In this blog post I will share with you how to effectively leverage the performance boosting functionalities available in dataflows when preparing large volumes of data. I have found there are two key components that are critical for developing efficient dataflows:
Enhanced Compute Engine – loads data into a temporary SQL cache and enables Power Query to leverage query folding.
Incremental refresh – partitions your data into smaller chunks to refresh only what you need.
Focusing on these two components resulted in a 90% performance improvement at a customer site when comparing to the same workload in a Pro workspace.
Layered architecture
By breaking out the architecture into separate staging and transformation dataflows you can redistribute the workload from Power Query’s slower mashup engine to the faster SQL based Enhanced Compute Engine (ECE). The diagram below illustrates this concept.
Staging dataflows
The staging dataflows are focused on pure extraction of the data from the source.
If you limit the list of query steps to selecting columns and filtering rows, the data is prepared (staged) for the enhanced compute engine to process further transformations that are defined separately in another dataflow.
I typically use one dataflow per source table to minimise the number of dependencies in a single dataflow. Otherwise, you may find that if one query fails then all the other queries in the same dataflow also fail.
Transformation dataflows
As the name suggests, the transformation dataflows read the staged data and apply the required transformations.
Here is where you focus on triggering the enhanced compute engine to handle expensive transformation steps that are otherwise computed by the Power Query mashup engine.
Linked entities
The first step to trigger the enhanced compute engine is to create a linked entity - which is a reference to an existing table, defined in another dataflow.
Once created, the entity has a small linking icon to indicate it is linked to another dataflow. I usually prefix mine with a “stg” so that I know it is referencing another staging dataflow and then save it in a folder:
Computed entities
The next step is to create a computed entity – which is a reference to another entity in the same dataflow with operations performed on top of it.
If you right click your linked entity, then select Reference, Power Query creates a computed entity with a little lightning bolt icon to indicate as such. I usually prefix this with a “cmp” and then save in a folder too.
Both linked entities and computed entities have the Enable load option ticked, otherwise the enhanced compute engine is not triggered.
Once created, you can start to apply the more complex transformations first that fold to the enhanced compute engine e.g., merge, group by, distinct etc.
Step folding indicator
Annoyingly, the step folding indicator is oblivious to the existence of the enhanced compute engine. For example, the steps below choose columns and filter rows. These are steps that I know fold to the enhanced compute engine.
Nevertheless, Power Query states the step is evaluated outside of the data source. It does not consider the enhanced compute engine, even though in this case it is triggered upon next refresh of the dataflow.
This means you are likely to find yourself relying on your own knowledge of steps that fold to the enhanced compute engine.
If, like me, you do not have an encyclopedic knowledge of M folding steps in the enhanced compute engine of dataflows then I think this article by Cristian Petculescu is immensely helpful.
Enhanced compute engine status
The refresh logs are not always a reliable source to tell if the query folds and are often misleading. For example, it might state the compute engine status is "Folded”.
You might initially assume the enhanced compute engine status “Folded” is equivalent to the traditional Power Query folding concept. However, these are two different things:
Traditional Power Query concept: "Query folding is the ability for a Power Query query to generate a single query statement to retrieve and transform source data.”
The Computed Engine Status - Folded: “Folded means that the dataflow was able to use SQL Compute to read data. The calculated entity used the table from SQL to read data, and the SQL used is related to the constructs of their query.”
The refresh logs do not tell you if the fold breaks partway through your list of query steps, only that the entity “used the table from SQL to read data”.
You can easily test this by adding a step that you know breaks folding. For example, if you add an Index step to a large dataset. In this case the refresh durations skyrocket, but the refresh logs still adamantly state that the entity “Folded”, despite the obvious break in fold.
The phraseology of the enhanced compute engine status “Folded” could be clearer and introduced a source of confusion and frustration for me until I understood the differences.
Configure the enhanced compute engine
There are three settings available for the enhanced compute engine, and knowing which one to choose is not immediately obvious:
The default setting is Optimized. It sounds promising, but if you want to use incremental refresh or Direct Query in your Power BI dataset, then select the setting On instead.
Optimized automatically triggers the enhanced compute engine when an entity in the dataflow is referenced by another entity or when the dataflow is connected to another dataflow in the same workspace.
A Power BI dataset referencing a dataflow does not meet this criteria for Optimized, and so selecting the On setting makes it certain. You can tell the difference if you look at the refresh log:
Optimized --> ECE does not cache data --> no fold --> no incremental refresh nor direct query.
On --> ECE caches data --> folding --> enables incremental refresh and direct query
Incremental refresh
Now that we have covered how to make best use of the enhanced compute engine, we will continue onto leveraging incremental refresh.
Dataflow refreshes are significantly faster with incremental refresh because it leverages query folding to partition your data and only refresh what you need.
There are three main options to consider:
A DateTime column to define the partition.
The time period to store and refresh rows from the past i.e., Year/Quarter/Month/Day
A DateTime column to detect data changes (if enabled)
Incremental refresh can be applied to both your staging and transformation dataflows. The exception is if your staging dataflows use a source that does not support query folding. Typical examples include flat files, blob and web. Then, only apply to your transformation dataflows.
Define the partition
A good DateTime column to use is one that defines the created timestamp of the record. This is because the timestamp does not change for an individual row once created . Therefore, it will be created in a specific partition, and never needs to switch to another one in subsequent refreshes.
Consider the alternative of using a DateTime field which does change whenever there is an update to any value in the row e.g., a status changing from Open to Closed. As soon as the datetime does change then this introduces a risk of the row moving between partitions upon next refresh and creating a duplicated copy. So, I avoid this.
Partition size
For me, there is no fixed rule for partition size. Partitioning by month produces satisfactory results for some tables, whereas partitioning by year is much better for other smaller tables.
The reason is because the time saved processing 36 months (for example) in parallel versus processing 3 years in parallel is marginal for the smaller tables. There is always a time cost to manage the partitions, and in this case the time cost to manage the extra 33 partitions outweighed the time saved.
The fact that Power Query has a maximum parallelism of 30 queries is a major factor to consider in this regard.
For example, the blue bars have a decreasing amount of processor time for the smaller partitions, however, the orange “Other time” increases.
*There is a limit of maximum 120 partitions that can be created
Detect data changes
This functionality allows Power Query to go back and check over all historical partitions, and only refresh those partitions with new data.
If the Detect data changes option is ticked, Power Query stores the maximum value of the selected DateTime column and remembers this for the next refresh.
Upon the next evaluation, if this maximum value increases, the entire partition is refreshed (not just the individual rows that have changed). A modified datetime is a good candidate to use here because it updates whenever there is a change to any value in the row.
Power Query is not able to detect any deleted rows though, unless there is another row that has an updated modified date within the same partition.
It is good idea to round a modified DateTime column to the required latency of the dataset. For example, an hourly refresh does not need the minutes or seconds, which reduces the cardinality (and therefore size) of the column in your dataset.
Other best practices
Dataflow connector
At the time of writing there are two dataflow connectors available in Power Query.
1. PowerPlatform.Dataflows()
2. PowerBI.Dataflows()
If you want to make use of incremental refresh and direct query in your Power BI dataset, then make sure to select the PowerPlatform.Dataflows() connector. The other connector – PowerBI.Dataflows() – does not support these. Here is the documentation for more details.
Orchestration
Using dataflows in the same workspace can enable straightforward orchestration. For example, if you refresh your staging dataflow then Power BI automatically triggers the downstream transformation dataflow. There is no option to turn off this behaviour.
Therefore, you may need to find an alternative method to orchestrate your dataflows for more complex data models. For me, the easiest way is to create two separate workspaces: one for staging dataflows and another for transformation dataflows.
This breaks the automatic refresh, but it does mean you need to manage your own orchestration. For me, building out a Python library was a good way to programmatically refresh the dataflows using the Power BI REST APIs. There are also other options such as scheduling refreshes or using Power Automate. Here is the documentation for more details.
You also have the option to create a layer in another workspace that only contains the final curated data entities. There is no performance benefit, however, it serves to create a clean environment for data modelers to work with, so they are not privy to those intermediate calculation steps.
Unsupported data types
There are a number of unsupported mashup data types. The full documentation is here.
The one I come across most often is when applying a piece of logic that relies on nulls. If you have a null text value in your staging dataflow output, then this is converted to an empty string. This can cause spurious results in your downstream dataflows if it is not handled appropriately.
Summary
By breaking out the architecture into separate staging and transformation dataflows you can trigger the enhanced compute engine. This leverages query folding within your dataflows which unlocks incremental refresh to take your dataflow performance to the next level.
As with most things, the enhanced compute engine requires some love and attention to get the most out of it. If you follow the steps outlined in this article then you shouldn’t go too far wrong. Thank you for reading my blog post – interested to hear how you get on.
Comments