• Ben Watt

Power Query: Cleaning data to keep only numbers

I had a requirement to clean out all data in a field except for the numbers. This was a plain text field where users would manually enter data and within the text was an order number.


For example, from the below I just want 51638, the rest is discarded.

The data I got was never entered consistently (humans, am I right?) so using any kind of pattern to obtain the order number was not possible.


The below shows an example of the data, and the result I want.

Power Query has a wonderful function called Text.Remove. The really powerful part is that you can pass in a range of characters to remove, using lists. Normally, Replace-type functions only allow a string value meaning it's difficult to remove a varying selection of characters from the original text.


Getting the second parameter right is the only minor challenge. To do this you need to know your unicode table a little. If I wanted to remove all lower case characters, I would pass {"a".."z"} as the second parameter. This list generates all characters in the table between the a and z, inclusive. It's character-specific, so {"A".."Z"} is a different list.


What I needed to remove was all characters in the ascii table before, and after, the sequence of 0..9. If you take a look at the table below, you'll see that I want to pass in the range of characters from the position 1 to position 47, then another list of characters from position 58 to 255 (the good old ascii table max). As it happens, positions 1 to 31 are "control" characters in the table, so I only need to start at 32.


As I know the position numbers, I can simply use the Character.FromNumber function which will convert to the relevant character for me.


Below is resulting M code. Text.Remove supports multiple ranges in a single list, so I've passed the range of chars before and the range of chars after 0-9.

Text.Remove(
 [The Text Column],
 {
 Character.FromNumber(32)..Character.FromNumber(47),
 Character.FromNumber(58)..Character.FromNumber(255)
 }
 )
 
 which is equivalent to 

Text.Remove(
 [The Text Column],
 {
 " ".."/", // first character is a space
 ":".."ÿ"
 }
 )

Symbols and Other languages


If the data you are getting is only going to contain the typical printed characters, then 255 is sufficient as the end of the second range.


However, if you are getting symbols or characters from other languages or, then you need to increase this. The Character.FromNumber function works for numbers up to 1,114,112, except for the range 55,296 to 57,343, which after some Googling I found are unicode surrogates and don't work on their own.


The maximum number you should use is going to depend on what characters you are getting. My example contains some Greek characters, which meant my range had to increase to 1000, because that's where they are on the unicode table. However I also threw in a ⅓ symbol, which has a value of 8531, so I had to bump my range up even further to accommodate for it.

Don't play the easy card and set the max number or your performance is going to suffer having to scan too many characters. Only do this by exception.


Do some data discovery and check the characters you are getting & see if 255 is going to work for you. In the example .pbix file (link below) I also include a query which splits the data into one column of single characters. You'll see in that query how I arrived at 8531 at the end of my range because of that pesky ⅓ symbol!



Lastly, just for fun, I added a word cloud visual to let you casually browse through the characters in the unicode range up to 10,000. Use the slider below and have a look. The cool emojis are around the 9000-10000 range.



You can grab a copy of this Power BI Desktop file here: GitHub



Copyright © 2019 Datalineo Ltd. All Rights Reserved