Accelerating Dataflows Gen2 with Fast Copy

Daniel Scott
Jun 30, 2024
6 min read

In this blog post I will share with you how to effectively leverage the Dataflow Gen2 Fast Copy capability when ingesting large volumes of data. I have found that there are two key components that are critical for success:

Enable high scale compute – allows use of Fast Copy connectors that benefit from the scalable backend of the pipeline Copy Activity.
Meet prerequisite criteria – Fast Copy is only supported for a limited number of Dataflow Gen2 connectors and a limited number of transformation steps.

Focusing on these two components resulted in up to a 90% performance improvement ingesting a 100M row CSV data source when comparing to the same workload without fast copy.

Architecture

By enabling high scale compute and meeting the prerequisite criteria, you can redistribute the workload from Power Query’s slower mashup engine to the scalable backend of the pipeline Copy Activity. The diagram below illustrates this concept.

Configuring Fast Copy

The Dataflow Gen2 mashup engine first checks if high scale compute (Fast Copy) is enabled, then checks if the prerequisite criteria are met.

If both checks are successful then the mashup engine redistributes the workload to the backend of the pipeline Copy Activity, and copies the data from the supported source to the chosen output destination – in this case, a Lakehouse built on top of OneLake.

Enable high scale compute

On the Home tab of the Dataflow Gen2, select Options.

Then choose the Scale tab on the Options dialog and select the Allow use of fast copy connectors checkbox to turn on Fast Copy. Then close the Options dialog.

Prerequisite criteria

There are several prerequisite criteria which shape the Power Query steps to be compatible with the pipeline Copy Activity.

Fast Copy is currently only supported for the following Dataflow Gen2 connectors: ADLS Gen2, Blob storage, Azure SQL DB, Lakehouse, PostgreSQL.
The copy activity only supports a few transformations when connecting to a file source: combine files, select columns, change data types, rename a column, remove a column.

There are also minimum file sizes and number of rows depending on the source. This is because the performance benefit for a small amount of data would be negligible compared to the cost of redistributing the workload from the Power Query mashup engine to the backend of the pipeline Copy Activity.

For file data, files are in .csv or parquet format of at least 100 MB, and stored in an Azure Data Lake Storage (ADLS) Gen2 or a Blob storage account.
For databases including Azure SQL DB and PostgreSQL, 5 million rows or more of data in the data source.

Implementing Fast Copy

Now that the Power Query environment has been configured for Fast Copy, the next step is to connect to a supported source and apply supported transformations.

Connect to the supported source

For the purposes of this blog, I have uploaded 10 CSV files (around 10GB in size with a total of 100M rows) to Azure Blob Storage.

I have used the Azure Data Lake Storage Gen2 connector and opted to connect with the Data Lake Storage endpoint as shown in the screenshot below:

Apply supported transformations

After authenticating, Power Query will ask you to Create or Combine. Select Combine, as this is a supported transformation.

This will automatically generate a set of functions and helper queries to combine all the files in the specified location.

❗Warning❗ the encoding and quote style options in the “Transform Sample file” query will cause Fast Copy to fail – even though this is automatically generated by Power Query. To resolve, simply remove these two options from the query.

I think this is a major barrier to successfully using Dataflows Gen2 Fast Copy. Therefore, I have created a Fabric idea to improve this, and if you agree, you can vote for it here: Microsoft Idea

Note 1: If you need to apply other transformations, or set a destination other than a Lakehouse, then you should stage and reference your query first. I will cover this in a future blog, so for now we assume there are no transformations needed other than those supported by Fast Copy.

Note 2: There are other methods to combine files without using the native GUI, such as encapsulating the Csv.Document function within the Table.AddColumn function. However, I have found that this more custom and flexible approach doesn’t trigger Fast Copy, even though it is effectively doing the same thing under the bonnet. I am looking into this further.

Add data destination and Save

Add the data destination as a Lakehouse (this needs to be setup before the dataflow).

Save the dataflow and wait for the refresh to complete. If successful, you can check the refresh history to see the confirmation that the backend of the pipeline Copy Activity has been used.

Performance analysis

Now let’s check how fast “Fast Copy” actually is. I analysed the performance of three runs on an isolated F64 capacity, taking the average (median) result for each of the following four scenarios:

The pipeline Copy Activity (no dataflows here).
Dataflow Gen2 – Fast Copy enabled (Power Query redistributes the workload to the backend of the pipeline Copy Activity).
Dataflow Gen2 – Fast Copy disabled (Power Query uses the mashup engine only).
Dataflow Gen1 – Fast Copy N/A (Power Query uses the mashup engine only).

I observed the following:

The Dataflow Gen2 with Fast Copy enabled is 40% slower than the pipeline Copy Activity.
The Dataflow Gen2 with Fast Copy enabled is 10 times faster than with it disabled.
The Dataflow Gen2 with Fast Copy disabled is 3 times slower than the Dataflow Gen1.

I am not surprised that the Dataflow Gen2 Fast Copy is slower than the pipeline Copy Activity, but perhaps a little surprised that it is as much as 40% slower. This is because even though both scenarios are using the same backend, there are many more moving parts to the Dataflow Gen2. Power Query first must check whether the prerequisite criteria are met, and then must redistribute the workload to the backend of the pipeline Copy Activity.

I think the performance of the Dataflow Gen2 with Fast Copy disabled is exceptionally slow in this example. It can be explained by the fact there is currently no support for parallel ingestion of files in Dataflow Gen2 i.e. Power Query is processing each file in series. However, it is listed as an area of improvement on the Fabric roadmap here.

The refresh performance is even slower than the Dataflow Gen1, which is on a much older architecture (CDM Folders format). However, this story changes when you analyse the exact same scenarios, but with compute resource as the metric.

The Dataflows Gen2 are much more efficient in terms of compute resource compared to Dataflows Gen1 – even with Fast Copy turned off. It will just take much longer to refresh until the support for parallel ingestion of files in Dataflow Gen2 is delivered.

Other best practices

Staging

Staging is enabled by default, and I would recommend disabling it when applying no transformations or working with small amounts of data.

This is because the time taken to write the data to an additional staging Lakehouse outweighs the benefit of having the SQL endpoint available for transformations that are not supported by Fast Copy.

However, Microsoft do plan to have staging disabled by default, as detailed on the Fabric roadmap here.

Require Fast Copy

You may have heard of the term “If you are going to fail, then fail quickly”. Selecting “Require fast copy” ensures that the Power Query mashup engine will try to execute Fast Copy, regardless of whether the criteria are met.

This can be beneficial because if the criteria are not met, then the dataflow will fail almost immediately. This is much better than waiting for the dataflow to finish, then realising it did not use Fast Copy and instead used the slower Power Query mashup engine.

You can set the “Require fast copy” option by right-clicking on the query to select and enable that option.

Fast Copy Indicator

Handily, there is a Fast Copy indicator that works much in the same way as the query folding indicator does i.e. Power Query will inform you whether the list of applied steps is compliant with the transformations supported by Fast Copy.

Summary

By enabling high scale compute and making your query compliant with the prerequisite criteria, you can trigger Fast Copy and leverage the scalable backend of the pipeline Copy Activity.

There are some handy aids such as “Require fast copy” and the “Fast Copy indicator” that keep you on the right track. But be warned, if using the native “Combine” functionality for files then you could (in this case) be tripped up by the (not so helpful) helper queries.

As with most things, Fast Copy requires some love and attention to get the most out of it. If you follow the steps outlined in this article, then you shouldn’t go too far wrong. Thank you for reading my blog post – interested to hear how you get on.