Why Apache Parquet instead of CSV files?

Mariano Silva
5 min readJun 9, 2021

--

This article was also published on my LinkedIn page. You can also read it here.

When you develop your activity or business through the Cloud and pay for resources such as Memory, Cores and Storage, it is very important to look at optimizing the data your system rely on. If so, you have surely read about AWS, Azure and GCP, and their prices.

Although you could be thinking which Cloud Computing Service provider is better or cheaper, before you jump into that deep, and sometimes a fuzzy abyss of instance types, storage classes and vast range of prices, I think you may want to familiarize yourself with opportunities of leveraging Apache Parquet instead of regular text or CSV files, because you could be throwing money out the window without knowing.

Just to give an example: if you pay for storing objects in your S3 Standard buckets on AWS, the storage pricing for the first 50 TB / Month is about $0.023 per GB in US East (Ohio). Let’s say you want to store regular CSV file of 1 TB. That means $0.023*1024=$24 per month. That’s doesn’t mean so much, but this doesn’t include the PUT, COPY, POST, LIST requests from/to S3, and at least 1 “in” and “out” data transferred of the total Gigs. With those additions, you could easily reach $120 per month. Let’s consider now you need to store that file for 1 year, and then it will cost you $1,440. The bill is self-explanatory :)

To simplify the above, let’s round this number to $1,500 per year, and then we can do the math considering you have tens, hundreds or thousands of similar files. You will see that your AWS bill will be increasing and increasing in at least $1,500*(number of 1TB files) per year. Thus, if you don’t keep an eye on your file sizes, your bill can get out of control!

To put this in practice, we will do a small experiment with Spark. As you may know, Spark can read in text and CSV files. While this format could give you access to many data sources, it’s not always the most convenient format to work with. Some common issues with CSV files include:

  • The schema is not defined, there are no data types included, nor column names, beyond a header row.
  • Using content containing a comma (or another delimiter) requires escaping.
  • Using the escape character within content requires even further escaping.
  • The available encoding formats are limited depending on the language used.

In addition to the above issues with CSV files, Spark has some specific problems processing CSV data. CSV files are quite slow to import and parse. The files cannot be shared between workers during the import process, and if no schema is defined, all data must be read before a schema can be inferred.

On the other hand, Parquet is a compressed columnar data format for use in any Hadoop based system. This includes Spark, Hadoop, Apache Impala, and so forth. The Parquet format is structured with data accessible in chunks, allowing efficient read / write operations without processing the entire file. Finally, Parquet files automatically include schema information and handle data encoding. This is perfect for intermediary or on-disk representation of processed data. Note that Parquet files are a binary file format and can only be used with the proper tools. This is in contrast to CSV files which can be edited with any text editor.

Having said the above, let’s do a quick exercise using pySpark.

First, we are gonna to create a SparkSession named “sess” to then create a Spark DataFrame:

Then, after copying the CSV file from local to the HDFS, we are going to read it. By the way, for this experiment I used an Amazon dateset (download it from here) that contains more than 400k+ amazon electronic items review. Now we can see that reading this CSV dataset took 9.52 seconds and counting the records took 2.31 seconds:

and here is how the DataFrame looks like:

To continue this experiment, I will save the DataFrame to a Parquet file, which took 7.87 seconds. The same operation for a CSV file took 9.38 seconds:

Also, we read the Parquet file and performed a count:

And finally, we summarized the file sizes:

To sum up, these are the results for this experiment:

The above table is just an example for this 83+GB CSV file of 400k+ rows. Without considering the operations, you will see that the Parquet file size is 45% less than CSV.

Thus, the moral of a fable is: if over the course of a year, you stick with an uncompressed 1 TB CSV files, your costs will be $1,500. Using Parquet files, your total cost will be at most $660 per year. I know you love CSV files, but do you love them that much?

Thank you for reading this article! If you liked it, please leave your comments.

Keep Parquet-ing!

--

--