FreshBI

View Original

DAX-JUNGLE: NORM.DIST

It’s a jungle out there

Back in the day- when I was stuck on a DAX problem, I used to toggle through the IntelliSense in PowerBI one letter at a time. I’ve learned much since then and in this blog I’d like to share my experience with using NORM.DIST in Dax.

A:

  • ABS

  • ACOS

  • ACOSH

B:

  • BETA.DIST

  • BETA.INV

  • BLANK

Etc….

Hours wasted..

Mistakes were made

A MUCH better use of my time would have been reviewing quality solutions to real world problems.

So that’s what we’ll do. Real problems, real DAX.


The Basics: NORM.DIST

If you’ve ever worked with statistics, or gone through any formal computer science education, chances are very high that you have encountered statistical distributions. As an overview for if you are not familiar; distributions are mathematical models used to describe probabilities of occurrences based on a few parameters.

Distributions help us answer questions like:

“On an average day, Gollum catches 4 fish. What is the probability they will catch 3 or less fish?”

Asking these questions and being able to visualize the answer quickly in Power BI can be a vast boon to your business reporting solutions.

In our example today, we will look into utilizing the normal distribution in DAX: According to Microsoft Themselves, NORM.DIST:

“Returns the normal distribution for the specified mean and standard deviation”

… Well, that is underwhelming. Lets breakdown what we can do with this!


Example Use Case:

Introduction:

Now that we have see a brief overview, we are going to walk through an example to help illustrate a NORM.DIST use case.

In this example I have a data set which is a list of entities with three core features,:

  • Each entity belongs to a collection.

  • Each entity has a text description.

  • There are a number of keywords that can be identified within each text description.

Using this data I want to highlight the distribution of how often various keywords occur between collections.

Now lets get to the set-up.

Data Set-up:

As a quick preface, the NORM.DIST function takes 3 key parameters.

  1. The first is the value at which you want to evaluate the probability.

  2. The second is the mean (average) number of occurrences within your dataset.

  3. The third is the Standard deviation of occurrences within your dataset.

On-top of our entity based dataset, we will need two other tables to accomplish what we are setting out to do.

Firstly, we will need a “value table” on which to evaluate our distribution function. We can create this by clicking “New Table” in the Modelling tab in Power BI Desktop and entering the following:

Value Axis Table = GENERATESERIES(0,50,1)

This gives us a table with one column “Value” which contains values from 0 to 50. One important note is the range of values you want to evaluate on is highly dependent on the data you are working with. You may have to adjust this range accordingly.

Secondly, we will want to bring in a short list of keywords we wish to search our entities descriptions for. I have manually entered these through Power BI’s “Enter Data Option


DAX Set-Up:

Next up we are going to need to write some DAX Measures that calculate the mean and standard deviation for our cases. Disclaimer: this section delves into some basic statistics with the goal of calculating the mean and standard deviation for selected keyword occurrences in our dataset. Feel free to skip this section if you are just interested in how we use the NORM.DIST function.

Every dataset will have different steps required to get these two parameters. The challenge that we have set ourselves to solve may require an intermediate understanding of DAX. Some solutions will require nothing more than the “Average” function, and the STDEV.P function.

Calculate the Mean:

Lets start with the average with the assumption that we will always be analyzing exactly one keyword at a time.

Lets break down what we must do.

Firstly we are calculating the average count per collection. This means we must start with an AVERAGEX over DISTINCT(Entites[colection])

For each collection, we want to count the number of rows in which the Entites[description] field contains our selected key word. This looks like:

COUNTROWS(FILTER(Entites,CONTAINSSTRING(Entites[description],SELECTEDVALUE('Common Keywords'[Keywords]))))

When we put this all together our Average Count measure looks like:

Calculate the Standard Deviation:

Here is where things get a little hairy. There is a DAX function STDEV.S which takes a column of numerical values and returns the standard deviation. However, in our case, we do not have a single column that contains a numerical value of the count per keyword, per collection. We have to manually define the standard deviation. A quick google search yields the following definition.

This looks awfully complicated at first glance, but lets break down what this is asking us to do.

Firstly, this portion is the main workhorse of the standard deviation measure. In English, this is saying: for each collection obtain a result by subtracting the count of occurrences of the desired keyword from the average overall then square that value; we then add up the result for every collection. This is a “sum of squared differences”, to describe it short hand. (Note: the absolute value is extraneous in this definition, since squaring always yields a positive number)

After this we simply take this “sum of squared differences”, divide it by the total number of collections, and square root it. Lets see how this pans out in DAX!

Obviously the most difficult part is our “sum of squared differences”.

Similar to our Average Count above, we must start with SUMX over DISTINCT(Entites[colection])

We user the POWER function to square each difference.

In the difference itself, we can use exactly the same lines as in our average calculation. Recall this line counts the number of rows in a collection which contains the selected keyword.

CALCULATE(COUNTROWS(FILTER(Entites,CONTAINSSTRING(Entites[description],SELECTEDVALUE('Common Keywords'[Keywords])))))

And then subtract the average. Here we can reuse our Average Count Measure, specifying to count the average over all entities, thus ignoring the context provided in our SUMX.

 -CALCULATE([Average Count],all(Entites[colection]))

Once this portion is resolved we simply divide by the DISTINCTCOUNT of Entites[colection], and take the square root. All combined this looks like this:


Putting it all Together:

If you skipped over the DAX above, we now have two measures, [Average Count], and [Standard Deviation] which calculate the mean and standard deviation respectively.

Okay, now that our set up is complete we can get to the process of putting our NORM.DIST to use.

First, lets visualize what these measures have resulted in for us so far.

I’ve built a couple cards and given the user the option to select one and only one keyword at a time.

We can see that our measures are giving us the average count and standard deviation as hoped.

Calculating the Distribution:

Now that we have the mean and standard deviation, we can finally get back to the NORM.DIST function. Recall the values table that we built? Now is when we utilize this. Our goal is to evaluate the NORM.DIST function for each value in our table from 0-50 and display this on a line chart. Our NORM.DIST function hence looks. Note the “FALSE()” as a fourth parameter in the Function we’ll cover that very soon.

 Dist = NORM.DIST(max('Value Axis Table'[Value]),[Average Count],[Standard Deviation],FALSE())

Lets create a line chart, put the “Value Axis Table” value in the x-axis, and place our “Dist” measure in the value placeholder.

Now we can see the probability distribution for the number of occurrences of the Keyword Flying in any given collection! In the screen shot above we can see based on this model the probability that exactly 22 entities with the Keyword Flying occurs in a collection is about 3%.

We Must Go Deeper:

Okay, that was fun, but what else can we do with this?

Cumulative Distribution:

Notice how our like chart returns the probability that exactly 22 instances of the keyword occur. What if we want to answer:

“What is the probability that there are less than or equal to 22 instances of a keyword?”

Logically, it makes sense roughly that adding up all the probabilities of there being 0, 1, 2 …, 22 instances of a keyword would give you the probability of there being less than 22. Did you notice the cheeky FALSE() in our NORM.DIST function?

Dist = NORM.DIST(max('Value Axis Table'[Value]),[Average Count],[Standard Deviation],FALSE())

This is the “Cumulative” flag which does exactly what we just described! Change this to TRUE() and the chart looks like:

By hovering over 22, we see that the probability that the number of occurrences of “Flying” is less than or equal to 22 is around 64%.

One other critical note to consider: the normal distribution will give probabilities for negative values. Notice in the first line, based on the y-axis scale, the probability for 0 looks like around 1%. However, when we add up all the probabilities for negative values, which are impossible in our case, we get the probability for 0 being closer to 5%. Thus it can be important to visualize both.

Inverse Distribution:

“We’ve only ever asked, ‘what is the probability, not how is the probability?’”.

So far our questions have revolved around passing in a number and receiving the probability. The inverse normal distribution is here to help us ask the another question. What number will we be under 80% of the time.

Using the inverse almost identical, except, instead of passing the values and receiving the probabilities, we flip the script and pass the probabilities.

In order to test this out we’ll have to create a “Probability Value Table”, which contains values between 0 and 1 ( they cannot actually contain 0 or 1, for… important reasons)

Probability Axis Table = GENERATESERIES(0.01,0.99,0.01)

If we plug our newly created axis into a new “Inv Dist” using the DAX NORM.INV function we see:

Inv Dist = NORM.INV(max('Probability Axis Table'[Value]),[Average Count],[Standard Deviation])

If we find 64% on our x-axis, we see clearly the value that we will be under 64% of the time rounds up to 22.

Again notices the probabilities associated with negative values. We can apply a visual level filter to “Inv Dist” >= 0 to visually accommodate for this if desired.

Small Multiples:

After a long time waiting, Microsoft has recently added a highly requested feature to many visuals and this is a prime example of where to show this off.

Utilizing small multiples to show a variety of distribution shapes associated with each keyword can really highlight the differences and similarities between them.


Summary

Power BI’s statistical functions seem scary at first, but once you break down the requirements and strategies for implementing them, they’ll be scared of you.

Utilizing distributions as a means to communicate complex measurements like standard deviations is a surefire way to build up the value in your report suite; at the very least you’ll look good doing it.

Let’s Chat

At FreshBI, we’ve lowered the barriers, risk and cost of developing world-class Power BI dashboards so that you can unlock the value in your data.

Contact us through the scheduling app to start a conversation about how our data visualization consultants can design your best Power BI dashboards today.


Our Latest Blogs

See this gallery in the original post

About FreshBI

Operating throughout USA and Canada, FreshBI is a boutique Microsoft Partner using Power BI to unlock the value trapped in your data so that your business can succeed.

See this social icon list in the original post