Categories
Azure Synapse Analytics

Azure Synapse Analytics and telemetry data

Since the recently announced public preview of Azure Synapse data explorer, the possibilities of IoT analytics solutions are brought to a new level inside the Azure stack. See this article here.

Categories
PowerBI

Enhance Relative Date Filter and allow for comparison of two periods

We are going to address the following limitations:

  1. Sometimes you may want to highlight the last two weeks of the month to compare it with the first two weeks of the month, or maybe compare the performance of main KPI’s of the chosen last period.
  2. You can observe that the relative date filter of Power BI uses today’s date (17-Jan-2022) as the last date, and sometimes you need that the last week to start on a previous day, not today. For example, if you only have data until 15-Jan-2022 and today it is 19-Jan-2022, when you want to show Last week, you want it to refer to the days between 10-Jan-2022 and 16-Jan-2022…

3. Sometimes you may want to customize the type of periods you see in the slicer. By default, it shows Days, Weeks, Week Calendars, … You may want to reduce that list and show less options.

4. Weeks start on Sunday or on Monday, depending on your regional settings. You may want to force your dashboard to have weeks always start on a Monday, independently of the regional settings of the viewer.

To resolve these problems, we follow the idea of Josh Trewin, a Sport and Data Scientist in his video and went a little bit further.

As the tutorial says, we need to:

We assume that you have a Dates table in your data model, as can be seen below.

Well, you then need to duplicate your Dates table. In the example below, we created ¨Dates_Disconnected¨ table.

Change the Date slicer to dropdown format.

Change the column of ‘Dates’[Date] in the Date slicer for the column ‘Dates_Disconnected’[Date].

Create values for slicer of periods (with the options that suits you), and measure to get the selected value.

PeriodSelected = ALLSELECTED(Periods[Period])

Create measure that returns a number of days depending on the selected period.

PeriodDays = 
SWITCH([PeriodSelected],
    "Week", 6,
    "Month", 30
)

Create metrics that get the max and min of the period we are going to filter:

PeriodEndLast = LASTDATE(Dates_Disconnected[Date])
PeriodStartLast = [PeriodEndLast] - [PeriodDays]
PeriodEndPrev = [PeriodStartLast] - 1
PeriodStartPrev = [PeriodEndPrev] - [PeriodDays]

Add the period slicer to your canvas and set the selection to single select.

Create the magic filter measure. We are filtering the dates in our related table by returning a 1, without losing the dates and also rows in the fact table, before the selected period, and this is why I love this idea, because it allows you to make a comparison with other periods on the same page of your report. This measure works as filter on a visual level, so actually, it doesn’t filter your date table.

DateFilter = 
VAR _Date = MAX('Dates'[Date])
VAR _Filter = 
   IF( _Date <= [PeriodEndLast], 
      IF(_Date >= [PeriodStartLast], 
        1,  
        0 
      ) 
      ,0
    )
return 

_Filter

Now Add this measure to the charts you want to show the period, and this would be the result so far:

Chart filtered in the last month

Chart filtered in the last week going two days back.

What we have left is to make the measure that will show the amount of spending in the last period selected and the amount of spending in the period before that.

SpendPeriod =
VAR _Value = 
    CALCULATE(
        SUM('Internet Sales Fact'[spend]),
        DATESBETWEEN(
            'Dates'[Date],
            [PeriodStartLast],[PeriodEndLast]
        )
    )
RETURN

    _Value
SpendPeriodPrev = 
VAR _Value = 
    CALCULATE(
        SUM('Internet Sales Fact'[spend]),
        DATESBETWEEN(  
            'Dates'[Date],
            [PeriodStartPrev],[PeriodEndPrev]
        )
    )
RETURN 
    _Value

Now just add cards to see the values, periods and voilà, you have a comparison view:

You can see on the left that we can change the period and go back in time. Hope this helps!!!

Categories
Machine Learning

Building a Machine Learning Model to infer a Car Price in Azure Machine Learning Studio

The goal of this article is to build and explain a model in Azure Machine Learning Studio (ML Studio) that infers a price for a car based on some of its characteristics.

Getting the Data

The data we are going to use is provided by ML Studio as a sample, which can be found in the Datastore section inside azureml_globaldatasets:

While we could use the already created Dataset, in this case we will download the file and create our own dataset to illustrate the full process.

Once the file is downloaded to our local PC, we can upload it to our own Blob Storage. Every Azure Machine Learning workspace has an associated storage, which can be found from the Portal:

We will use the ML Studio UI to upload the file, but this can also be achieved directly through the portal as with any Blob Storage. To upload the file and create a dataset we need to use the Dataset section:

From there, we select:

  1. Create Dataset
  2. From Local Files
  3. Provide a name for the Dataset and, for this example, select Tabular
  4. In this screen we can upload the file and change where it’s going to be uploaded. Let’s use the default path for now
  5. We can see a preview of the data and the detected File Format (Parquet)
  6. We can modify the schema changing what columns we include and the datatype

Finally, we can check the details and create the dataset. Azure ML Studio provides some tools to inspect the Dataset that we can use to get a feel for the data we are dealing with:

This Profile view includes a lot of statistics about each column, which may be useful in some cases

Each row holds data for a specific car, including its price.

Design the pipeline

Now that we have the dataset ready, it’s time to train our model. Azure ML Studio offers three separate ways to do this:

  • Python Notebooks: The most versatile choice, meant for programmers.
  • Automated ML: The simplest choice. You just need to select the dataset and the intended goal, and it will automatically choose the best model for the job.
  • Designer Pipelines: A middle ground between Automated ML and Python, allowing you to build ML Pipelines yourself but in a visual way

We are going to use the pipelines, as they provide a good trade-off between understandability and control.

First, we need to create the pipeline:

After that we can add the dataset we just created by dragging from the Dataset section:

After we import the data, we are going to clean it up removing rows with empty data. However, before we do that, we want to get rid of the normalized-losses column, because there are a lot of missing values:

Here we can see that 20% of the values are missing for normalized-losses, which means we would lose 20% of the rows in the next step just from this column

To do this, we can drag a Select Columns in Dataset component from the Data Transformation section and connect the dataset to it. We will use a rule to include all columns except normalized-losses:

Now, from the Data Transformation section we can drag a Clean Missing Data component and select All Columns. This is how it would look like:

After the clean-up we are ready to start with the training process. The first step is, as usual, to split the dataset into two groups, one for training and one for testing/evaluating our model. We will use a random 70/30 split, connecting the first output of Clean Missing Data to a Split Data component and specifying 0.7 as the Fraction of rows in the first output dataset:

Now its time to choose what algorithm we are going to use. Let’s remember our problem: We want to infer a price (a real number) from a set of characteristics of a car. This is a Regression problem then. We could use many different algorithms, and the Designer includes components to compare them and see which one performs better, but for this example we will select the Linear Regression algorithm, which is simple to set up. There is a component for this algorithm under the Machine Learning Algorithms section, so we just need to drag it. We will use the default parameters.

With the algorithm and the dataset ready we can set up the training, dragging the component Train Model and connecting to it both the algorithm and the first dataset from the split (the 70% of the original). We also need to tell it that we want to infer the price:

Note that we have set Model Explanations to True. This is an optional feature that allows you to better understand how is the model predicting the price.

After this step is executed, the model is already trained. However, we will add an Score Model (which receives the model and new data and infers the desired label, in this case the price) and an Evaluate Model component to check how good the model is. For this, we will use the remaining 30% of the dataset, the Testing split, and compare the predictions of the model with the actual values. This is the final pipeline:

Training the model and setup

Now its time to execute it and see the results. For this we first need to create a Compute. For this simple scenario the most basic compute should work. After that, we can attach the compute to the pipeline:

Now we can Publish the pipeline, creating a new Pipeline Endpoint which can be accessed from outside (eg. From Azure Synapse) and Submit it. This runs the Pipeline for the first time. After some minutes we have the results:

Everything completed without errors and a new button appeared on top. This button allows us to create an Inference Pipeline from the trained model, so we can use it through the API from different places like a Python notebook. Let’s do it, selecting real time inference. This is the generated Pipeline:

The Web Service Input indicates where data enters from the API call and Web Service Output shows where the data comes out to answer the API call. This way the pipeline can be modified accordingly if needed (removing unwanted columns before feeding the input, performing some transformations…) by the user before submitting. In our case we are just going to Submit and Deploy, selecting Azure Container Instance for simplicity instead of Kubernetes. Once it’s deployed we will use the REST endpoint to query the model. If we go into the model it even provides us with the python code needed to query the endpoint:

Testing the Inference

Inside ML Studio we can create and run python notebooks using the created computes, employing different IDEs like JupyterLab or VSCode, but in this case we will run the Notebook from an Apache Spark instance in our Synapse Workspace.

After we run the Notebook, we get the expected output (keep in mind that, for this to work, none of the strings provided can be empty):

For this input, the predicted price is 14172 and the real one is 13495, which is roughly a 5% error. This could be further improved by choosing different algorithms or getting a larger dataset. If we change it so the input car has four doors, uses diesel and have six cylinders we will get a different price:

In this case the inferred car price is 26163, which match our expectations of the price rising after adding more cylinders, doors, and changing the fuel type to diesel.