365 Data Science

Working with Spark Python or SQL on Azure Databricks

Here we look at some ways to interchangeably work with Python, PySpark and SQL using Azure Databricks, an Apache Spark-based big data analytics service designed for data science and data engineering offered by Microsoft.

Originally from KDnuggets https://ift.tt/3gzg07s

source https://365datascience.weebly.com/the-best-data-science-blog-2020/working-with-spark-python-or-sql-on-azure-databricks

Top KDnuggets tweets Aug 19-25: #MachineLearning-Handling Missing Data

Machine Learning - Handling Missing Data; The Last SQL Guide for Data Analysis You’ll Ever Need; How (not) to use #MachineLearning for time series forecasting: The sequel

Originally from KDnuggets https://ift.tt/32xlDhD

source https://365datascience.weebly.com/the-best-data-science-blog-2020/top-kdnuggets-tweets-aug-19-25-machinelearning-handling-missing-data

Data Versioning: Does it mean what you think it means?

Does data versioning mean what you think it means? Read this overview with use cases to see what data versioning really is, and the tools that can help you manage it.

Originally from KDnuggets https://ift.tt/3gsyQx7

source https://365datascience.weebly.com/the-best-data-science-blog-2020/data-versioning-does-it-mean-what-you-think-it-means

From local Jupyter Notebooks to AWS Sagemaker.

I will be covering the basics and a generic overview of what are the basic services that you’d need to know for the certification, We will not be covering deployment in detail and a tutorial of how you might be able to use these services in this guide.

Now before you think about Machine Learning Specialty certification from AWS, if you haven’t done any certification from AWS before I will suggest you to complete AWS Cloud Practitioner.

Getting through the Cloud Practitioner is relatively easy and you will get perks. Perks like a free practice test of the next certification of your choice and 50 percent discount on you next certification exam.

There are certain key points before you embark on your journey for the Machine Learning Specialty certification:

It is recommended that you have 1 to 2 years of experience of using AWS for ML projects and pipelines
It is recommended for people who have relative expertise over manipulating Data sets, doing EDA, extraction, tuning etc.
This exam is specifically built to weed out people who don’t have an analytics background and don’t have an in depth understanding of how Machine Learning pipelines work.
It is my personal opinion that you at least understand using shell commands, Docker containers and model deployment to fully grasp the SageMaker services and pipelines.

I will be dividing the modules into few parts and my key focus will be on the SageMaker part of the certification because that alone could get you through the examination if you are very good at it.

Understanding AWS storage

For our certification we will be sticking to S3 but it’s recommended to have a minimum idea of other storage services.

Amazon Simple Storage Service or S3 stores data as objects within buckets

You can set individual permissions(create, delete, view list of objects) for every bucket within S3
S3 has 3 different storage classes: S3 Standard — General purpose storage for any type of data, typically used for frequently accessed data, S3 Intelligent — Tiering * — Automatic cost savings for data with unknown or changing access patterns, S3 Glacier ** — For long-term backups and archives with retrieval option from 1 minute to 12 hours. S3 standard being the most expensive.
For our training purposes we can both provide them as separate channels using S3 buckets, we will get into more details later on while we go through the inbuilt algorithms.
For writing and reading data using S3 you need to use boto3 framework which is preinstalled on the sagemaker note book instances.

For a better understanding of the S3 storage and availability, you can use the following link: https://aws.amazon.com/s3/storage-classes/

Jupyter notebooks: You could launch a jupyter notebook directly from an EC2 instance but you’re responsible for the following things:

Creating the AMI(Amazon machine image, in short the OS)
Launching those instances with this AMI.
Configuring the autoscaling options depending on the task.

However, it’s very straight forward, you just need the ssh key pair work and add the device IP from which you are connected to the security group of the EC2 instance you are trying to connect. If you use this service you will have to take care of the Container registry, the endpoints, distribution of Training jobs and the tuning as well. The major advantage of using Sagemaker is that it manages all these things for you.

Let’s drive straight into AWS Sagemaker, we will cover some key concepts in depth as we try to understand the various components.

Sagemaker is a fully managed service by AWS to build, train and deploy machine Learning models at scale.

Simple Machine Learning pipeline on AWS Sagemaker

Building pipelines in Sagemaker:

You can read data from S3 in the following ways:

Directly connect to S3
Using AWS Glue to move data from Amazon RDS, Amazon DynamoDB, and Amazon Redshift into S3.

Training on AWS Sagemaker:

Flowchart for Training and deploying model using Sagemaker

We will be covering the inbuilt algorithms in this part.

Transforming the Training Data

After you have launched a notebook, you need the following libraries to be imported, we’re taking the example of XGboost here:

import sagemaker
import boto3
from sagemaker.predictor import csv_serializer    # Converts strings for HTTP POST requests on inference

import numpy as np                                # For performing matrix operations and numerical processing
import pandas as pd                               # For manipulating tabular data
from time import gmtime, strftime                 
import os 
 
region = boto3.Session().region_name    
smclient = boto3.Session().client('sagemaker')

from sagemaker import get_execution_role 
#the IAM role that you created when you created your                          #notebook instance. You pass the role to the tuning job.

role = get_execution_role()
print(role)

bucket = 'sagemaker-MyBucket'                               
#replace with the name of your S3 bucket
prefix = 'sagemaker/DEMO-automatic-model-tuning-xgboost-dm'

Next Download the data and do EDA.

Hyperparameter Tuning

Hyperparameter tuning job specifications can be found here

from sagemaker.amazon.amazon_estimator import get_image_uri
training_image = get_image_uri(boto3.Session().region_name, 'xgboost')

s3_input_train = 's3://{}/{}/train'.format(bucket, prefix)
s3_input_validation ='s3://{}/{}/validation/'.format(bucket, prefix)
tuning_job_config = {
    "ParameterRanges": {
      "CategoricalParameterRanges": [],
      "ContinuousParameterRanges": [
        {
          "MaxValue": "1",
          "MinValue": "0",
          "Name": "eta"
        },
        {
          "MaxValue": "2",
          "MinValue": "0",
          "Name": "alpha"
        },
        {
          "MaxValue": "10",
          "MinValue": "1",
          "Name": "min_child_weight"
        }
      ],
      "IntegerParameterRanges": [
        {
          "MaxValue": "10",
          "MinValue": "1",
          "Name": "max_depth"
        }
      ]
    },
    "ResourceLimits": {
      "MaxNumberOfTrainingJobs": 20,
      "MaxParallelTrainingJobs": 3
    },
    "Strategy": "Bayesian",
    "HyperParameterTuningJobObjective": {
      "MetricName": "validation:auc",
      "Type": "Maximize"
    }
  }

training_job_definition = {
    "AlgorithmSpecification": {
      "TrainingImage": training_image,
      "TrainingInputMode": "File"
    },
    "InputDataConfig": [
      {
        "ChannelName": "train",
        "CompressionType": "None",
        "ContentType": "csv",
        "DataSource": {
          "S3DataSource": {
            "S3DataDistributionType": "FullyReplicated",
            "S3DataType": "S3Prefix",
            "S3Uri": s3_input_train
          }
        }
      },
      {
        "ChannelName": "validation",
        "CompressionType": "None",
        "ContentType": "csv",
        "DataSource": {
          "S3DataSource": {
            "S3DataDistributionType": "FullyReplicated",
            "S3DataType": "S3Prefix",
            "S3Uri": s3_input_validation
          }
        }
      }
    ],
    "OutputDataConfig": {
      "S3OutputPath": "s3://{}/{}/output".format(bucket,prefix)
    },
    "ResourceConfig": {
      "InstanceCount": 2,
      "InstanceType": "ml.c4.2xlarge",
      "VolumeSizeInGB": 10
    },
    "RoleArn": role,
    "StaticHyperParameters": {
      "eval_metric": "auc",
      "num_round": "100",
      "objective": "binary:logistic",
      "rate_drop": "0.3",
      "tweedie_variance_power": "1.4"
    },
    "StoppingCondition": {
      "MaxRuntimeInSeconds": 43200
    }
}

tuning_job_name = "MyTuningJob"
smclient.create_hyper_parameter_tuning_job(HyperParameterTuningJobName = tuning_job_name,
                                           HyperParameterTuningJobConfig = tuning_job_config,
                                           TrainingJobDefinition = training_job_definition)

Monitoring can be done directly on the AWS console itself

Evaluating is very straight forward,You use a Jupyter notebook in your Amazon SageMaker notebook instance to train and evaluate your model.

You either use AWS SDK for Python (Boto) or the high-level Python library that Amazon SageMaker provides to send requests to the model for inferences.

How to Debug?

Say hello to the Amazon SageMaker Debugger!

It provides full visibility into model training by monitoring, recording, analyzing, and visualizing training process tensors. Using Amazon SageMaker Debugger Python SDK we can interact with objects that will help us debug the jobs. If you are more interested in the api, you can check it out here.

You can check the list of rules here.

Deploying the model

After I created a model using createmodel api. Speicify S3 path where the model artifacts are stored and the Docker registry path for the image that contains the inference code.
Create an HTTPS endpoint configuration i.e: Configure the endpoint to elastically scale the deployed ML compute instances for each production variant job, for further details about the API, check CreateEndpointConfig api.
Next launch it using the CreateEndpoint api

I will discuss more about the details of deployment in the next part.

Don’t forget to give us your ? !

From local Jupyter Notebooks to AWS Sagemaker. was originally published in Becoming Human: Artificial Intelligence Magazine on Medium, where people are continuing the conversation by highlighting and responding to this story.

Via https://becominghuman.ai/from-local-jupyter-notebooks-to-aws-sagemaker-b4a792f5d270?source=rss—-5e5bef33608a—4

source https://365datascience.weebly.com/the-best-data-science-blog-2020/from-local-jupyter-notebooks-to-aws-sagemaker

NLP Tutorial for Machine Learning

Natural Language Processing (NLP) consists of developing applications and services capable of understanding human languages. Some…

Continue reading on Becoming Human: Artificial Intelligence Magazine »

Via https://becominghuman.ai/nlp-tutorial-for-machine-learning-7fdac1f815b5?source=rss—-5e5bef33608a—4

source https://365datascience.weebly.com/the-best-data-science-blog-2020/nlp-tutorial-for-machine-learning

ResNet: Convolution Neural Network

ResNet : Convolution Neural Network

ResNet, also known as residual neural network, refers to the idea of adding residual learning to the traditional convolutional neural network, which solves the problem of gradient dispersion and accuracy degradation (training set) in deep networks, so that the network can get more and more The deeper, both the accuracy and the speed are controlled.

The problem caused by increasing depth :

The first problem brought by increasing depth is the problem of gradient explosion / dissipation . This is because as the number of layers increases, the gradient of backpropagation in the network will become unstable with continuous multiplication, and become particularly large or special. small. Among them , the problem of gradient dissipation often occurs. i.e effect of the weight decreases.
Another problem of increasing depth is the problem of network degradation, that is, as the depth increases, the performance of the network will become worse and worse, which is directly reflected in the decrease in accuracy on the training set. The residual network article solves this problem. And after this problem is solved, the depth of the network has increased by several orders of magnitude.

Don’t forget to give us your ? !

ResNet : Convolution Neural Network was originally published in Becoming Human: Artificial Intelligence Magazine on Medium, where people are continuing the conversation by highlighting and responding to this story.

Via https://becominghuman.ai/resnet-convolution-neural-network-e10921245d3d?source=rss—-5e5bef33608a—4

source https://365datascience.weebly.com/the-best-data-science-blog-2020/resnet-convolution-neural-network

How to Optimize Your CV for a Data Scientist Career

As the number of data science positions continues to grow dramatically, so does the number of data scientists in the marketplace. Follow these expert tips and examples to help make your resume and job applications stand out in an increasingly competitive field.

Originally from KDnuggets https://ift.tt/3lkPt1j

source https://365datascience.weebly.com/the-best-data-science-blog-2020/how-to-optimize-your-cv-for-a-data-scientist-career

Breaking Privacy in Federated Learning

Despite the benefits of federated learning, there are still ways of breaching a user’s privacy, even without sharing private data. In this article, we’ll review some research papers that discuss how federated learning includes this vulnerability.

Originally from KDnuggets https://ift.tt/2CZtb3M

source https://365datascience.weebly.com/the-best-data-science-blog-2020/breaking-privacy-in-federated-learning

KDnuggets News 20:n33 Aug 26: If I had to start learning Data Science again how would I do it? Must-read NLP and Deep Learning articles for Data Scientists

If I had to start learning Data Science again, how would I do it? Must-read NLP and Deep Learning articles for Data Scientists; These Data Science Skills will be your Superpower; Accelerated Natural Language Processing: A Free Amazon Machine Learning University Course.

Originally from KDnuggets https://ift.tt/3aZ4Zva

source https://365datascience.weebly.com/the-best-data-science-blog-2020/kdnuggets-news-20n33-aug-26-if-i-had-to-start-learning-data-science-again-how-would-i-do-it-must-read-nlp-and-deep-learning-articles-for-data-scientists

Unifying Data Pipelines and Machine Learning with Apache Spark and Amazon SageMaker

Roll up your sleeves and charge up because you’re invited to an interactive, virtual Machine Learning workshop run by Amazon Web Services, Databricks, and Immuta on September 10.

Originally from KDnuggets https://ift.tt/32qfm73

source https://365datascience.weebly.com/the-best-data-science-blog-2020/unifying-data-pipelines-and-machine-learning-with-apache-spark-and-amazon-sagemaker

365 Data Science

Working with Spark Python or SQL on Azure Databricks

Top KDnuggets tweets Aug 19-25: #MachineLearning-Handling Missing Data

Data Versioning: Does it mean what you think it means?

From local Jupyter Notebooks to AWS Sagemaker.

Trending AI Articles:

Transforming the Training Data

How to Debug?

Deploying the model

Don’t forget to give us your ? !

NLP Tutorial for Machine Learning

ResNet: Convolution Neural Network

ResNet : Convolution Neural Network

Trending AI Articles:

ResNet :

Don’t forget to give us your ? !

How to Optimize Your CV for a Data Scientist Career

Breaking Privacy in Federated Learning

KDnuggets News 20:n33 Aug 26: If I had to start learning Data Science again how would I do it? Must-read NLP and Deep Learning articles for Data Scientists

Unifying Data Pipelines and Machine Learning with Apache Spark and Amazon SageMaker