Microsoft Azure Machine Learning x UdacityLesson 3 Notes

Microsoft Azure Machine Learning x Udacity — Lesson 3 Notes

Detailed Notes for Machine Learning Foundation Course by Microsoft Azure & Udacity, 2020 on Lesson 3 — Model Training

This lesson is about how to prepare data and transform it into trained machine learning models. This lesson will also introduce you to ensemble learning and automated machine learning.

Data Import & Transformation

Data Wrangling:

Cleaning, restructuring, enriching data to transform it into a suitable format for training machine learning models. This is an iterative process.

Steps:

Discovery & Exploration of data
Transformation of raw data
Publish data

Managing Data:

Datastores:

A layer of abstraction. It stores all the information needed to connect to a particular storage device

Datasets:

Resources for exploring, transforming, managing data. A reference to a point in storage.

Data Access Workflow:

Create a datastore
Create a dataset
Create a dataset monitor (critical to detecting issues in data, e.g. Data Drift)

Introducing Features

Feature Selection
Dimensionality Reduction

Feature Engineering

Core techniques
Various places to apply feature engineering: 1) datastores 2) a python library 3) during model training
Used more often in classical machine learning

Feature Engineering Tasks:

Aggregation: sum, mean, count, median, etc (math formulas)
Part-of: extract a part of a certain data structure, e.g. a part of date like hour
Binning: group entities into bins and apply aggregations on them
Flagging: deriving Boolean conditions
Frequency-based: calculate the various frequencies of occurrence of data
Embedding or Feature Learning: a relatively low-dimensional space into which you can translate high-dimensional vectors
Deriving by Example: aim to learn values of new features using examples of existing features

Feature Selection

Given an initial dataset, create a number of new features. Find out useful and filter out features.

Reasons for Feature Selection:

1. Eliminate redundant, irrelevant, or highly correlated feature

2. Dimensionality reduction

Dimensionality Reduction Algorithms:

Principle Component Analysis (PCA): Linear technique with more of a statistical approach.
t-Distributed Stochastic Neighboring Entities (t-SNE): A probabilities approach to reduce dimensionality. The target number of dimensions with t-SNE is usually 2–3 dimensions. It is used a lot for visualizing data.
Feature Embedding: Train a separate machine learning model to encode a large number of features into a smaller number of features also super-features.

All the above are also considered to be Feature Learning techniques.

Azure Machine Learning Prebuilt Modules Available:

Filter-Based Feature Selection: Helps identify the columns in input dataset that have a dataset that has the greatest predictive power
Permutation Feature Importance: Helps determine the best features to use in a model by computing a set of feature-important-scores for your dataset

Data Drift

Data Drift is a change in the input data for a model. Over time, data drift causes degradation in the model’s performance, as the input data drifts farther and farther from the data on which the model was trained.

Model Training Basics

The goal of the Model Training process is to produce a trained model that you can later use to predict. We want to be able to give the model a set of input features, X, and have it predict the value of some output feature, y.

It is important to establish the problem to be solved e.g a classification or regression problem. The framing of the problem will influence both the choice of algorithms in the training process as well as the various approaches to take to get to the desired result.

The important prerequisite is to understand, transform data, create new features, selecting features that are most relevant to the training process. Once we have both problem time and training features defined, next steps are:

Decide whether to scale or encode your data
Splitting data (i.e. Training, Validation, and Test dataset)

Parameters and Hyperparameters:

When we train a model, a large part of the process involves learning the values of the parameters of the model. For example, earlier we looked at the general form for linear regression:

y = B0 + B1*x1 + B2*x2 + B3*x3 … + Bn*xn

The coefficients in this equation, B_0 … B_nB0…Bn, determine the intercept and slope of the regression line. When training a linear regression model, we use the training data to figure out what the value of these parameters should be. Thus, we can say that a major goal of model training is to learn the values of the model parameters.

In contrast, some model parameters are not learned from the data. These are called hyperparameters and their values are set before training. Here are some examples of hyperparameters:

The number of layers in a deep neural network
The number of clusters (such as in a k-means clustering algorithm)
The learning rate of the model

We must choose some values for these hyperparameters, but we do not necessarily know what the best values will be prior to training. Because of this, a common approach is to make the best guess, train the model, and then tune adjust or tune the hyperparameters based on the model’s performance.

Splitting the Data:

As mentioned in the video, we typically want to split our data into three parts:

Training data
Validation data
Test data

We use the training data to learn the values for the parameters. Then, we check the model’s performance on the validation data and tune the hyperparameters until the model performs well with the validation data. We can adjust this hyperparameter and then test the model on the validation data once again to see if its performance has improved.

Finally, once we believe we have our finished model (with both parameters and hyperparameters optimized), we will want to do a final check of its performance — and we need to do this on some fresh test data that we did not use during the training process.

Model Training in Azure Machine Learning

Azure Machine Learning Service provides a comprehensive environment to implement model training processes, giving you a centralized place to work with all the artifacts involved in the process. It is also called a Machine Learning Managed Service and provides a platform which 1) simplifies the implementation of various tasks and 2) offers all the necessary kinds of services that will help create the best possible machine learning models. It provides out of the box managed services that help implement every step of the data science process.

Taxonomy of Azure Machine Learning:

Several types of artifacts and several classes of concepts related to the implementation of the various steps of data science.

Workspace: The centralized place for working with all the components of the machine learning process. Everything in Azure Machine Learning revolves around this concept and is the very first thing to be created in the pipeline.

Compute Instances: A cloud-based workstation that gives you access to various development environments, such as Jupyter Notebooks.

Datasets: A key component in the data preparation and transformation processes that makes data available to Machine Learning processes.

Experiment: A container that helps you organize the model training process and group various artifacts/tasks related to Machine Learning processes run within Azure.

Run: A process that is executed in one of the compute resources e.g. model training, model validation, feature engineering. Every Run will output a set of artifacts: snapshots of data, output files, metrics, and logs.

Compute Targets: Machine Learning processes on a large variety of environments:

Local environments
Remote environments: Native Azure, Azure Data Dake AnalyticS

Registered Models: A service that provides snapshots and versioning for your trained models. After a model is created, it gets registered into the Model Registry. Note: Versioning is also important for Models, just like datasets, so that end-to-end traceability is achievable.

Deployment: Either in the form of web services or other types of environments, IoT Edge, etc.

Model Telemetry: Collect telemetry from live running models (model predictions using production input data).

Service Telemetry: Collect telemetry from live running services (model input data from web services)

Training Classifiers

In a classification problem, the outputs are categorical or discrete.

There are three main types of classification problems:

Binary Classification: True/False, 0 or 1. e.g Anomaly detection, Fraud Detection
Multi-Class Single-Label Classification: output is categorized between 3 or more. e.g. recognition of written numbers.
Multi-Class Multi-Label Classification: multiple categories and output can belong to more than one. e.g. text tagging

Examples of Classification Algorithms:

Logistic Regression
Support Vector Machine (SVM)

Training Regressors

In a regression problem, the output is numerical or continuous.

There are two main types of regression problems:

Regression to arbitrary values: prediction based on various inputs. No boundary defined for the output
Regression to vales between 0 and 1: bound the outputs between the interval of 0 and 1 and assigns it as a probability

Examples of Regression Algorithms:

Linear Regressor
Decision Forest Regressor

Evaluating Model Performance

The evaluation of a Machine Learning model is a critical step through which you calculate a set of performance metrics in order to assess its performance — such as the predictive power and accuracy of the model.

we need to split off a portion of our labeled data and reserve it for evaluating our model’s final performance. We refer to this as the test dataset.

The test dataset is a portion of labeled data that is split off and reserved for model evaluation.

When splitting the available data, it is important to preserve the statistical properties of that data. the data in the training, validation, and test datasets need to have similar statistical properties as the original data to prevent bias in the trained model. Splitting the data up randomly will help ensure that the two datasets are statistically similar.

Confusion Matrices

A confusion matrix gets its name from the fact that it is easy to see whether the model is getting confused and misclassifying the data.

True positives are the positive cases that are correctly predicted as positive by the model
False positives are the negative cases that are incorrectly predicted as positive by the model
True negatives are the negative cases that are correctly predicted as negative by the model
False negatives are the positive cases that are incorrectly predicted as negative by the model

Evaluation Metrics for Classification

Accuracy: measures the goodness of a classification model as the proportion of true results to total cases

(TP+TN) / (TP+FP+FN+TN)

Precision: the proportion of true results overall positive results

TP / TP+FP

Recall: the fraction of all correct results returned by the model

TP / TP+FN

F1-Score: computed as the weighted average of precision and recall between 0 and 1, where the ideal F1-Score value is 1

2∗Precision+Recall / Precision∗Recall

Note: None of these independently are enough to provide an accurate measurement of metrics for classification and are used in pairs to get a complete image of a classification algorithm.

Model Evaluation Charts

When evaluating models, a level of understanding can be quickly gained using charts:

Receiver Operating Characteristics (ROC) Chart: a graph of the rate True Positives against the rate of False Positives. A metric that is derived from this is the Area Under the Curve (AUC).

AUC measures the area under the ROC curve plotted with true positives on the y-axis and false positives on the x-axis. This metric is useful because it provides a single number that lets you compare models of different types. An AUC of 0.5 indicates random guessing, while an AUC of 1.0 indicates perfect classification hence, the area under the curve should always fall between the range of 0.5 and 1.0.

Gain and Lift Charts: deals with ordering with rank, ordering the prediction probabilities, and measure how much better the classifier can perform. The diagonal line corresponds to random guessing; the other line corresponds to performance when a trained classifier is used. Ideally, the latter line should be as far away from the former as possible.

Evaluation Metrics for Regression

Root Mean Squared Error (RMSE): the square root of the squared differences between the predicted and actual values. It creates a single value that summarizes the error in the model. By squaring the difference, the metric disregards the difference between over-prediction and under-prediction.
Mean Absolute Error (MAE): Average of the absolute difference between each prediction and the true value. It measures how close the predictions are to the actual outcomes; thus, a lower score is better.
R-Squared: known as the coefficient of determination. How close the regression line is to the true values.
Spearman Correlation: Strength and direction of the relationship between predicted and actual values.

Model Evaluation Charts

Predicted vs. True Chart: displays the info/relationship between the predicted and the true value. Shown below, the perfect regressor is represented by the diagonal line whereas the predicted line is the one below it indicating errors. At the bottom of the chart, the histogram shows how the true values are distributed in your prediction results.

Histogram of Residuals: represents the distribution of the true value subtracted by the predicted value hence, the difference. When the model has a fairly low bias, the histogram should approach more or less a normal distribution (bell-shaped) which one should try to achieve.

Strength in Numbers

No matter how well-trained an individual model is, there is still a significant chance that it could perform poorly or produce incorrect results. To train multiple instances of machine learning models and capture their collective wisdom in a way that would alleviate the potential issues produced by single models. there are two main approaches along these lines. Both use the principle of Strength in Numbers to reduce the potential error/bias of individual machine learning models. We use the strength of a large number of trained models with the purpose to improve the accuracy of predictions.

Ensemble Learning

It combines multiple machine learning models to produce one predictive model. There are three main types of ensemble algorithms:

Bagging or Bootstrap Aggregation

Helps reduce overfitting for models that tend to have high variance (such as decision trees)
Uses random subsampling of the training data to produce a bag of trained models.
The resulting trained models are homogeneous
The final prediction is an average prediction from individual models

2. Boosting

Helps reduce bias for models.
In contrast to bagging, boosting uses the same input data to train multiple models using different hyperparameters.
Boosting trains model in sequence by training weak learners one by one, with each new learner correcting errors from previous learners hence, constantly improving the model.
The final predictions are a weighted average from the individual models.

3. Stacking

Trains a large number of completely different (heterogeneous) models
Combines the outputs of the individual models into a meta-model that yields more accurate predictions

Strength in Variety: Automated ML

Automated Machine Learning plays on the principle of Strength in Variety and, as the name suggests, automates many of the iterative, time-consuming, tasks involved in model development (such as selecting the best features, scaling features optimally, choosing the best algorithms, and tuning hyperparameters). Automated ML allows data scientists, analysts, and developers to build models with greater scale, efficiency, and productivity — all while sustaining model quality. Automated ML explores combinations that help find the best performing combination to improve the accuracy of a prediction. Automated ML is not a replacement for a data scientist but a way to get a baseline model and then try other approaches to get superior performance.

Summary

In this lesson, you’ve learned to perform the essential data preparation and management tasks involved in machine learning:

Data importing and transformation
The use of datastores and datasets
Versioning
Feature engineering
Monitoring for data drift

The second major area we covered in this lesson was model training, including:

The core model training process
Two of the fundamental machine learning models: Classifier and regressor
The model evaluation process and relevant metrics

The final part of the lesson focused on how to get better results by using multiple trained models instead of a single one. In this context, you learned about ensemble learning and automated machine learning. You’ve learned how the two differ, yet apply the same general principle of “strength in numbers”. In the process, you trained an ensemble model (a decision forest) and a straightforward classifier using automated Machine Learning.

Don’t forget to give us your ? !

Microsoft Azure Machine Learning x Udacity — Lesson 3 Notes was originally published in Becoming Human: Artificial Intelligence Magazine on Medium, where people are continuing the conversation by highlighting and responding to this story.

Via https://becominghuman.ai/microsoft-azure-machine-learning-x-udacity-lesson-3-notes-8498661210c4?source=rss—-5e5bef33608a—4

source https://365datascience.weebly.com/the-best-data-science-blog-2020/microsoft-azure-machine-learning-x-udacitylesson-3-notes

Microsoft Azure Machine Learning x UdacityLesson 3 Notes