• Ben Watt

Use Python to get the latest Power BI dataflow contents from Azure Data Lake Storage Gen 2

If you are making use of the feature to store your Power BI dataflow results in your own Azure Data Lake Gen 2, then I'd like to share with you a Python script to get the latest file and load it into a pandas DataFrame.


This came from a recent Twitter conversation Ruth Pozuelo, who has been doing some very cool updates on the matplotlib Python module. She asked:

Is it possible to access a dataset curated with Power Query or/and DAX outside Microsoft walls? Like a dataflow or datamart?

Here's the actual Twitter conversation. You will see that I promised to share code the following day, and I did just that, although it was a week later by the time I shared it!

We also discussed getting data from a Power BI dataset via the executeQueries REST API. I have already written that in Python too, so will share in a separate blog.


I had been doing some code to get data from ADLS Gen2, so with some basic edits to my Python scripts I shared some code with Ruth, and I'd like to share it with you too. Ruth said she may do a video on how she uses it too, so keep your eyes open for that on her YouTube channel.


Before jumping straight into the code, here's a summary of what's taking place:

  • I hard-coded some variables, which you should manage properly via something like Azure KeyVault for credentials or parameters to run against different Workspaces/Dataflow/Entity

  • A connection is made to the model.json file to get the list of headers for the Entity. This will ensure the script dynamically adapts when changes are made to the dataflow. This is necessary as Power BI saves the CSVs without headers.

  • A connection is made to the container and sub-folder where the list of CSVs are stored for the Entity. It loops through and identifies the latest file, based on modified date

  • It downloads the contents, converts from string to a file-like object for pandas to load via the read_csv function.

  • After that, you can do whatever you want, I chose to plot the data using the seaborn module. Notice the title of the chart shows the date/time the data was refreshed?

The Python script for you!

from azure.storage.blob import ContainerClient, BlobClient
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import datetime
import pytz
from io import BytesIO
import json

# --------------------------------------
# Variables, usually stored safely with KeyVault, etc
# --------------------------------------
storage_endpoint  = 'https://yourstorage.blob.core.windows.net'
storage_sas_token = '?sv=2021-04-10&st=2022-11-12T00%3A01%3A00Z&se=2022-11-19T23%3A59%3A00Z&sr=c&sp=redacted'
workspace_name = 'MyDataflowWorkspace'
container_name = 'powerbi'
dataflow_name = 'Dimensions'
entity_name = 'Daily Sales'
column_list = []

# --------------------------------------
# Get the list of column headers from the model.json file
# --------------------------------------
# these steps could be nested a little more, however I'm separating for readability
model_file_url = f'{storage_endpoint}/{container_name}/{workspace_name}/{dataflow_name}/model.json/{storage_sas_token}'
model_file_blob_client = BlobClient.from_blob_url(model_file_url)
model_file_download = model_file_blob_client.download_blob().readall()
model_file_contents = BytesIO(model_file_download)
model_file_items = json.load(model_file_contents)

# loop through and grab the name of every column in the dataflow results
for x in model_file_items['entities']:
    if x['name'] == entity_name:
        for y in x['attributes']:
            column_list.append(y['name'])

# --------------------------------------
# Set up connection to the Azure Storage container
# --------------------------------------
container_client = ContainerClient.from_container_url(f'{storage_endpoint}/{container_name}{storage_sas_token}')
blob_list = container_client.list_blobs(name_starts_with=f'/{workspace_name}/{dataflow_name}/{entity_name}/{entity_name}.csv.snapshots/')
utc = pytz.UTC
latest_modified_date = utc.localize(datetime.datetime.fromisoformat('1900-01-01 00:00:00'))

# --------------------------------------
# Loop through all of the blobs found & retain the latest one, by modified date
# --------------------------------------
for blob in blob_list:
    file_modified_date = blob.last_modified
    if file_modified_date > latest_modified_date:
        latest_modified_date = file_modified_date
        latest_blob = blob

# Now we know the latest file, download it & convert it for reading into pandas dataframe
blob_download = container_client.download_blob(latest_blob).readall()
blob_contents = BytesIO(blob_download)
df = pd.read_csv(blob_contents, header=0, names=column_list)

# From this point, you can do whatever you want with the file! In this case, I'm plotting it

# Plot Sales data by Year
chart_title = 'Sales Data, last updated {0}'.format(latest_modified_date.strftime('%a, %d %b %Y %H:%M'))
sns.lineplot(x="Year", y="Sales", err_style='band', data = df).set(title=chart_title)
plt.show()



Here's the output of the plt.show()



Considerations

The list_blobs function can take a while if you have a lot of CSV's (let's say that a lot, in this case, means 100,000 or more).


Power BI only writes to the Azure Storage, so it is your job to do some housekeeping. Is there a Python script to perform that housekeeping? Yes!! I will blog about that too!


Links


Configuring dataflow storage to use Azure Data Lake Gen 2

https://learn.microsoft.com/en-us/power-bi/transform-model/dataflows/dataflows-azure-data-lake-storage-integration


Ruth Pozuelo's YouTube Channel

https://www.youtube.com/CurbalEN


Power BI REST API to execute DAX query (for a later blog)

https://learn.microsoft.com/en-us/rest/api/power-bi/datasets/execute-queries-in-group