Parquet and CSV: Querying & Processing in Power BI
The ability to load data from Parquet files into Power BI is a relatively new thing and given it's storage structure, I wanted to see how Power Query dealt with it, and whether it gave any improvements over the more common format of CSV.
This is a pound-for-pound Import-mode comparison between the two file types, covering the reading of the file and processing in the Power BI Data model.
Early admission, some of the results I've shared below are not strictly related to CSV vs. Parquet, but worth a look anyway.
The see-saw choice of Comma Separated Volume versus Parquet.
This came about as I asked the question in an MVP/Microsoft discussion thread if Power Query would benefit from the columnar storage format of Parquet files. Some tests were performed and Chris Webb published a great blog as a follow up, as part of his recent series. He does mention to "look out for more posts on this subject", so here is one!
My comparison of performance of the two file types involved a few different source file scenarios so I could get a broad view of the results.
The scenarios I tested are as follows:
I had 11 CSV files with varying numbers of rows, totaling 28 million. I created Parquet versions of each file.
I combined the 11 files into a monolithic source file, still totaling 28 million rows, in both CSV and Parquet format.
I created smaller versions of the monolithic files, with 1K, 10K, 100K, 500K, 1M and 2M rows.
All of the conversions & smaller versions were done using Python. Nice and easy, only a few lines of code using the pandas module.
The files are 35 columns each, with a mixture of data types. They were stored on my local SSD so I could monitor disk read stats. In the real world, I'd be putting them on Azure Data Lake Storage.
Here are the 11 individual files. You'll also notice the Parquet files are much smaller, due to the efficient way it's stored.
These are the shortened versions, again check out the difference in file size!
Capturing the data
I won't go into detail in this article exactly how I obtained the various counters, and loaded them into Power BI. In short though, I borrowed Phil Seamark's method here, to get the Processing Events, with some adjustments to load multiple files.
Adding to that are Performance Monitor counters which were extracted and saved to CSV using relog.exe. I then loaded all the data into Power BI for visualisation purposes.
If there's a taste for it, I can blog about this in depth & share the pbix file.
In order to understand the labels on the charts below, you need to know the naming convention I used for the scenarios
All_MultiFile: I loaded All columns, from the 11 source files
All_SingleFile: I loaded All columns from the large file
All_SingleFile_1K, All_SingleFile_10K, etc: I loaded All columns from the shortened version of the files, with 1K rows, 10K rows, etc
3Col_SingleFile, 5Col_SingleFile: I only loaded 3 or 5 columns from the large file
Analysis & Findings
Given I have so many scenarios and measured so much, I'll have to be selective on what I report in this article! Spoiler alert: I am going to share the whole Power BI report with you, so you can filter/slice & dice the results yourself.
What format is faster to read?
CSV files load faster than Parquet files. Even though Parquet files are smaller than the equivalent data in CSV, it takes longer to unpack the data structure for encoding into the data model. It's the same thing when comparing load time of a CSV to the same data in Excel as Excel files are zipped, there's overhead on transforming the data.
The main thing I was hoping for is an improvement when selecting a subset of columns from Parquet as it's stored in a columnar format. This means Power Query shouldn't need to scan the whole file, then "remove" unwanted columns. The metrics I used to compare these are two trace file events called ExecuteSQL and ReadData.
What I noticed is that the ExecuteSQL event did not give me any telemetry on the reading of the file off of disk. The ReadData event did show that the Parquet file saw a much bigger improvement when reducing the columns.
The much bigger reduction in time is evident with the Parquet files.
To clarify, the ReadData event is not strictly a measure of reading the file contents, its about the data model receiving the file data from Power Query. As such, you'll also notice a reduction when less columns were chosen in the CSV.
Rows per second
I didn't see any specific benefits over a Parquet file vs. CSV in rows per second. This was directly linked to the ReadData event subclass, so is just a reflection of that. The chart shows CSV being generally faster than Parquet, but this is more about the above point.
It did show another result worth sharing, but this applies equally to Parquet and CSV files. Firstly, less columns will process faster - this is fairly obvious as its simply less data to encode and compress.
Secondly, I saw a diminishing return on rows per second as the row counts of the source files reduced. See the two light blue examples on the chart below? They are the 10K and 1K files, which you're hardly going to get 120K rows per second when there's only 1,000 rows in the file, so this makes sense.
Data Model Processing
All of the compression and column-hierarchy processing events are the same duration between CSV and Parquet file sources. This is because the data has already been transformed for these steps and is effectively in the same format, so it's a like-for-like exercise. There are also interesting results here worth sharing, and also applying equally to Parquet and CSV files.
The Compress Segment event is where the individual columns in the dataset are compressed. Below shows that the 3 and 5 column versions were much faster, simply due to less columns.
The Vertipaq Event is the compression of the table segments. This showed a much faster time for the 5-column scenario over the 3-column scenario. Why is that? It's because I deliberately chose very low grain columns in the 5-column scenario and the 3-column scenario contained the Fact key, so took just as long as the full 35 columns.
The Process event is done in parallel to the FileRead and is transforming the data ready for compression. There is more work to do for Parquet files, so this takes longer.
See for yourself
I have published the Power BI report and made it accessible for you. The first page is a full slice & dice of ALL of the metrics. The scenario you choose will filter both the SQL Trace data and Perfmon counters.
It also incudes a Gantt chart view of the processing, and the ability to filter other events, and file scenarios. There were to many combinations for me to blog about here so the rest is up to you. It's right at the bottom of this article.
There is some benefit using a Parquet file if you are going to select a subset of the columns from the source file using Power Query.
Some of this benefit is negated via added processing time as the encoding takes longer.
Compression and building of column hierarchy is the same between the two formats.
Outside the scope of this blog article is serverless querying, such as Azure Synapse Analytics SQL Pool or Redshift Spectrum. Parquet data sources are a no-brainer as costs are based on data processed (Parquet files are much more efficient, hence smaller, hence cheaper) and reading from them using multiple nodes will give improved performance. Adding to that is the ability to push down your query predicate so only the necessary files are scanned.
My conclusion: I cannot see benefit in choosing a Parquet file over CSV if you're loading it directly into Power BI where your data volumes are low (let's say 50 million rows or less). When the row counts are getting higher, you're probably going to start looking at database technologies anyway, so at this point the pendulum swings in the favour of Parquet files.
Here's the Power BI Report