The goal of this article is to build and explain a model in Azure Machine Learning Studio (ML Studio) that infers a price for a car based on some of its characteristics.
Getting the Data
The data we are going to use is provided by ML Studio as a sample, which can be found in the Datastore section inside azureml_globaldatasets:
While we could use the already created Dataset, in this case we will download the file and create our own dataset to illustrate the full process.
Once the file is downloaded to our local PC, we can upload it to our own Blob Storage. Every Azure Machine Learning workspace has an associated storage, which can be found from the Portal:
We will use the ML Studio UI to upload the file, but this can also be achieved directly through the portal as with any Blob Storage. To upload the file and create a dataset we need to use the Dataset section:
From there, we select:
- Create Dataset
- From Local Files
- Provide a name for the Dataset and, for this example, select Tabular
- In this screen we can upload the file and change where it’s going to be uploaded. Let’s use the default path for now
- We can see a preview of the data and the detected File Format (Parquet)
- We can modify the schema changing what columns we include and the datatype
Finally, we can check the details and create the dataset. Azure ML Studio provides some tools to inspect the Dataset that we can use to get a feel for the data we are dealing with:
Each row holds data for a specific car, including its price.
Design the pipeline
Now that we have the dataset ready, it’s time to train our model. Azure ML Studio offers three separate ways to do this:
- Python Notebooks: The most versatile choice, meant for programmers.
- Automated ML: The simplest choice. You just need to select the dataset and the intended goal, and it will automatically choose the best model for the job.
- Designer Pipelines: A middle ground between Automated ML and Python, allowing you to build ML Pipelines yourself but in a visual way
We are going to use the pipelines, as they provide a good trade-off between understandability and control.
First, we need to create the pipeline:
After that we can add the dataset we just created by dragging from the Dataset section:
After we import the data, we are going to clean it up removing rows with empty data. However, before we do that, we want to get rid of the normalized-losses column, because there are a lot of missing values:
To do this, we can drag a Select Columns in Dataset component from the Data Transformation section and connect the dataset to it. We will use a rule to include all columns except normalized-losses:
Now, from the Data Transformation section we can drag a Clean Missing Data component and select All Columns. This is how it would look like:
After the clean-up we are ready to start with the training process. The first step is, as usual, to split the dataset into two groups, one for training and one for testing/evaluating our model. We will use a random 70/30 split, connecting the first output of Clean Missing Data to a Split Data component and specifying 0.7 as the Fraction of rows in the first output dataset:
Now its time to choose what algorithm we are going to use. Let’s remember our problem: We want to infer a price (a real number) from a set of characteristics of a car. This is a Regression problem then. We could use many different algorithms, and the Designer includes components to compare them and see which one performs better, but for this example we will select the Linear Regression algorithm, which is simple to set up. There is a component for this algorithm under the Machine Learning Algorithms section, so we just need to drag it. We will use the default parameters.
With the algorithm and the dataset ready we can set up the training, dragging the component Train Model and connecting to it both the algorithm and the first dataset from the split (the 70% of the original). We also need to tell it that we want to infer the price:
After this step is executed, the model is already trained. However, we will add an Score Model (which receives the model and new data and infers the desired label, in this case the price) and an Evaluate Model component to check how good the model is. For this, we will use the remaining 30% of the dataset, the Testing split, and compare the predictions of the model with the actual values. This is the final pipeline:
Training the model and setup
Now its time to execute it and see the results. For this we first need to create a Compute. For this simple scenario the most basic compute should work. After that, we can attach the compute to the pipeline:
Now we can Publish the pipeline, creating a new Pipeline Endpoint which can be accessed from outside (eg. From Azure Synapse) and Submit it. This runs the Pipeline for the first time. After some minutes we have the results:
Everything completed without errors and a new button appeared on top. This button allows us to create an Inference Pipeline from the trained model, so we can use it through the API from different places like a Python notebook. Let’s do it, selecting real time inference. This is the generated Pipeline:
The Web Service Input indicates where data enters from the API call and Web Service Output shows where the data comes out to answer the API call. This way the pipeline can be modified accordingly if needed (removing unwanted columns before feeding the input, performing some transformations…) by the user before submitting. In our case we are just going to Submit and Deploy, selecting Azure Container Instance for simplicity instead of Kubernetes. Once it’s deployed we will use the REST endpoint to query the model. If we go into the model it even provides us with the python code needed to query the endpoint:
Testing the Inference
Inside ML Studio we can create and run python notebooks using the created computes, employing different IDEs like JupyterLab or VSCode, but in this case we will run the Notebook from an Apache Spark instance in our Synapse Workspace.
After we run the Notebook, we get the expected output (keep in mind that, for this to work, none of the strings provided can be empty):
For this input, the predicted price is 14172 and the real one is 13495, which is roughly a 5% error. This could be further improved by choosing different algorithms or getting a larger dataset. If we change it so the input car has four doors, uses diesel and have six cylinders we will get a different price:
In this case the inferred car price is 26163, which match our expectations of the price rising after adding more cylinders, doors, and changing the fuel type to diesel.