top of page
  • Ben Watt

Power Query: Sort + Distinct, a quick tip.

When you perform a Distinct operation in Power Query, the elimination policy is to keep the first instance it sees of the column(s) you have nominated, then discard any subsequent matching rows.


There are occasions when you may want to be more selective in choosing which row you keep. For example, the below shows a list of Products with a Part Number, Description & Last Sold Date. As you can see, the description has changed over time, so I want my result set to reflect the currently used one, highlighted in yellow.

Logically, to fix this you simply perform a Sort operation on the LastSoldDate, in descending order, before doing the Distinct, then you are guaranteed to get the most recent description when you perform the Distinct operation. Yes?


No.


Power Query will still retain the first rows based on the result set at the point before the Sort operation.


This is easy to solve, but the reason behind it opens up a lot of discussion on caching which I've provided some links to below. In order to keep the row you want, you simply need to perform a Table.Buffer operation after the Sort and before the Distinct.


Take the below example M query and try this yourself. Create a blank query and you can simply comment/uncomment between Option 1 and Option 2 at the end and see the different result. Or you can grab a copy from my GitHub


let
    Source = Table.FromRecords ({
        [ PartNumber=101, 
        Description="Hiking Shoes", 
        LastSoldDate="2020-12-10"],
        [ PartNumber=101, 
        Description="Hiking Shoes", 
        LastSoldDate="2020-12-02"],
        [ PartNumber=101, 
        Description="Hiking Shoes Elite", 
        LastSoldDate="2020-12-15"],
        [ PartNumber=101, 
        Description="Hiking Shoes", 
        LastSoldDate="2020-12-08"],
        [ PartNumber=101, 
        Description="Hiking Shoes Elite", 
        LastSoldDate="2020-12-16"],
        [ PartNumber=102, 
        Description="Running Shoes", 
        LastSoldDate="2020-12-02"],
        [ PartNumber=102, 
        Description="Running Shoes", 
        LastSoldDate="2020-12-03"],
        [ PartNumber=102, 
        Description="Running Shoes Max", 
        LastSoldDate="2020-12-17"],
        [ PartNumber=102, 
        Description="Running Shoes", 
        LastSoldDate="2020-12-04"],
        [ PartNumber=102, 
        Description="Running Shoes Max", 
        LastSoldDate="2020-12-18"]
        }),
    ChangedType = Table.TransformColumnTypes(
        Source,
        {{"PartNumber", Int64.Type}, 
        {"Description", type text}, 
        {"LastSoldDate", type date}}),
    SortedRows = Table.Sort(
        ChangedType,
        {{"LastSoldDate", Order.Descending}}),
    // Option 1 - No Buffer, keeps wrong rows
    RemovedDuplicates = Table.Distinct(
        SortedRows, 
        {"PartNumber"})
    // Option 2 - with Buffer, keeps latest rows
    // RemovedDuplicates = Table.Distinct(
        Table.Buffer(SortedRows), 
        {"PartNumber"})
in
    RemovedDuplicates

User Scenarios


When are you likely to be experiencing this Sort & Distinct combination? Sometimes we are given data dumps with lot of columns containing sales/customer/product/etc data in a single flat file. Being good Power BI citizens we don't simply load in as is, we use our data modeling prowess to create a Sales Fact, a Customer Dimension, Product Dimension and Date dimension.


If this is a new concept for you, here's a 10 minute YouTube video from Patrick at GuyInACube to explain a little more.



I also have this diagram which shows how we take a sample flat file & develop a data model in the Kickstarter class I've put together (launching in 2021)


Additional Reading


Both of the links below go into a lot of detail on this topic. They are not recent but are still completely valid.


Referenced Queries And Caching In Power BI And Power Query


Query evaluation chain



Comments


bottom of page