Applied Natural Language Processing (NLP) in Python | Exploring NLP Libraries

Natural Language Processing (NLP) is one of the oldest branches of artificial intelligence (with works starting from as early as the 1950s), which is still undergoing continuous development and commands a great deal of importance in the field of data science.

So if you want to add a new feather to your cap by learning applied NLP, you’ve come to the right spot. It doesn’t matter if you want to be a data scientist or just want to gain a new skill, this tutorial will help you get down and dirty with NLP and show you hands-on techniques on how to deal with raw data, without overwhelming you with a barrage of information.

Let us first look at how NLP came into use.

Text Mining

With the social media boom, companies have access to massive behavioral data of their customers, enabling them to use that data to fuel business processes and make informed decisions. But there’s a teeny, tiny problem.

The data is unstructured.

Raw and unstructured data is not of much use as it doesn’t give any valuable insights. Just like an uncut diamond needs to undergo polishing to reveal the flawless gem underneath, raw data need to be mined or analyzed to be of any practical use.

This is where text mining comes in. It is the process of extracting and deriving useful information, patterns, and insights from a large collection of unstructured, textual data.

This field can be divided into 4 practice areas-

  1. Information Extraction
  • Deals with identification and extraction of relevant facts and relationships from unstructured data.

2. Document Classification and Clustering

  • Aims at grouping and categorizing terms, paragraphs, docs using classification, and clustering methods.

3. Information Retrieval

  • Refers to dealing with storage and retrieval of text documents.

4. Natural Language Processing (NLP)

  • Different computational tasks are used to analyze and understand the underlying structure of the text data.

As you can see, NLP is an area of text mining or text analysis where the final goal is to make computers understand the unstructured text and retrieve meaningful pieces of information from it.

Top 4 Most Popular Ai Articles:

1. Natural Language Generation:
The Commercial State of the Art in 2020

2. This Entire Article Was Written by Open AI’s GPT2

3. Learning To Classify Images Without Labels

4. Becoming a Data Scientist, Data Analyst, Financial Analyst and Research Analyst

Natural Language Toolkit (NLTK)

NLTK is a powerful Python package that contains several algorithms to help computer pre-process, analyze, and understand natural languages and written texts.

Common NLTK Algorithms:

  • Tokenization
  • Part-of-speech Tagging
  • Named-entity Recognition
  • Sentiment Analysis

Now, let’s download and install NLTK via terminal (Command prompt in Windows).

Note: The instructions given below are based on the assumption that you have Python and Jupyter Notebook installed. So, if you haven’t installed these, please pause here and install them before moving ahead.

  1. Go to the Scripts folder and copy the path

2. Open Windows command prompt and navigate to the Scripts folder

3. Enter pip3 install nltk to install NLTK

After you have successfully installed NLTK on your machine, we will explore the different modules of this package.

NLTK Corpora

A corpus (plural: corpora) is a huge collection of written texts. It is a body of written or spoken texts used for linguistic analysis and the development of NLP tools.

To download the datasets, open a Jupyter notebook and type the following code block-

import nltk
nltk.download()

A GUI will pop up, where you can click on the ‘Download’ button to download all the data packages.

Since this consists of a large number of datasets, it might take some time to complete so make sure you have a fast internet connection before downloading them.

After you’re done, come back to the Jupyter notebook, where we will be exploring the different NLP tools.

Tokenization

The process of turning a string or a text paragraph into smaller chunks or tokens such as words, phrases, keywords, symbols, etc. Some useful tokenization methods:

  • sent_tokenize
  • word_tokenize
  • RegexpTokenizer
  • BlanklineTokenizer

sent_tokenize

This function extracts all sentences present in a text document.

First, we import sent_tokenize and then pass it a paragraph as an argument, which outputs a list that contains all the sentences as individual elements.

You can also make use of one of the datasets of the NLTK corpora instead of using a sample string.

word_tokenize

This function extracts individual words from a text document.

RegexpTokenizer

This is used to match patterns in a text document and only extract those which match with the regular expression (regex).

Here, we will use RegexpTokenizer to match with a regex which will give us a list of all the numbers present in the Bible.

BlanklineTokenizer

This function tokenizes sentences even if they have blank lines or spaces between them.

Machine Learning Jobs

In this example, both RegexpTokenizer and BlankLineTokenizer are used on the same sample text which will give a clear idea on both these functions.

Frequency Distribution

This is used to get the word frequency in a text. It works similarly to a dictionary, where the keys are the words and the values are the counts associated with the word.

After tokenizing into words, we pass the tokens into the object declaration of the FreqDist class.

The most_common method of freq object can be used to see the most common occurrences of the words. The plot method can be used to graph a plot of the same.

You can also iterate through the frequency distribution and find the number of occurrences of any particular word.

As you can see above, there are a lot of useless ‘words’ which have been tokenized, from punctuation marks like commas and spaces to particles like the, a, etc.

Stopwords

Fortunately, NLTK has a list of such useless words of 16 different languages. These are called stopwords. Using these, we can filter out the data and clean it.

We will try finding out the frequency distribution of the words but this time we will be using stopwords to get better results.

On line 17, we loop through the word tokens and then select only those words which are not present in the stopwords list and whose length is greater than 3.

If you have trouble understanding list comprehension, here is the same code, written in a more familiar way.

If we now plot the graph and look at the most common occurrences of words, we will get a much more sensible result.

But there are many times when two or more consecutive words give us more information rather than individual words. Fortunately, NLTK provides us a way to handle that.

Bigrams, Trigrams, and Ngrams

Two consecutive words that occur in a text document are called bigrams.

We first break the text into word tokens and then pass those tokens to the bigrams module imported from nltk to convert the word tokens into bigrams. We then pass it on to FreqDist and use the most_common method to look at the top 5 most common words.

Similarly, three consecutive words that appear in a sentence are called trigrams.

We can also look for more consecutive words by using the ngrams module.

Stemming

This is a process for text normalization which involves reducing words to their root/base words from their derived words. NLTK provides us the PorterStemmer class to perform stemming operations on word tokens.

As you can see, stemming doesn’t really help much here because it reduces the actual words to such an extent that it doesn’t have any morphological sense anymore. That is why we have another process called lemmatization.

Lemmatization

This is a process that involves reducing words to their root/base words using vocabulary and morphological analysis. It brings context to the words which are not done in stemming.

After creating an instance of the WordNetLemmatizer class, we call the method lemmatize to lemmatize the words. Words are reduced to their original form while still retaining their meaning.

Parts-of-Speech Tagging

Here, the task is to label/tag each word in a sentence with its grammatical groups such as nouns, pronouns, adjectives, and many more. Some of the acronyms are:

CC — Coordinating Conjunction

JJ — Adjective

IN — Preposition/Subordinating Conjunction

JJR — Adjective, comparative

JJS — Adjective, superlative

NN — Noun, singular

NNP — Proper Noun, singular

PRP — Personal Pronoun

As you can see, the pos_tag function takes in word tokens as input and returns a list of tuples consisting of the word alongside the part of speech tag.

Named-Entity-Recognition

This is used to identify important named entities in a text such as people, places, locations, dates, organizations, etc. Here are a few types of entities along with their examples-

ORGANIZATION — Facebook, Alphabet, etc.

LOCATION — 22 West St, Mount Everest

GPE (Geo-Political Entity) — India, Ukraine, South East Asia

MONEY — 100 Million Dollars

PERSON — Obama, George W. Bush

DATE — July, 2019–05–2

TIME — three fourty pm, 3:12 am

FACILITY — Washington Monument, Stonehenge

In order to use this module, we have to pass in parts-of-speech tags as the argument. So we first import the necessary modules, tokenize the text into words and find out its parts-of-speech and then pass it to ne_chunk which will give us the proper results.

Congratulations! You have completed this tutorial and now you’re equipped to take on NLP projects on a beginner level.

But beware, for this tutorial is just the tip of the iceberg. NLP is a very vast field and has many more methods and modules that we haven’t gone through. However, I would encourage you to check them out on your own and build small projects to put your newly-learned skills to good use.

Github Repository link:

sthitaprajna-mishra/applied_nlp_python_tutorial

Don’t forget to give us your ? !


Applied Natural Language Processing (NLP) in Python | Exploring NLP Libraries was originally published in Becoming Human: Artificial Intelligence Magazine on Medium, where people are continuing the conversation by highlighting and responding to this story.

Via https://becominghuman.ai/applied-natural-language-processing-nlp-in-python-exploring-nlp-libraries-1d95710d5186?source=rss—-5e5bef33608a—4

source https://365datascience.weebly.com/the-best-data-science-blog-2020/applied-natural-language-processing-nlp-in-python-exploring-nlp-libraries

Gradient Descent Explained

Source

Gradient descent is an optimization algorithm used to minimize some function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient.

Consider the 3-dimensional graph above. Our goal is to move from the top right corner of the mountain, which can be described as a high cost, to a dark blue sea below it, which can be viewed as a low cost. The arrows represent the steepest descent from any given point that decreases the cost function as quickly as possible.

Starting at the top of the mountain, we will take baby steps downhill in the direction where the gradient is negative. After that, we recalculate the negative gradient and take another step in a specified direction. We continue this process until we get to the bottom or local minimum.

Learning Rate

The size of each step is called a learning rate. If you have a high learning rate it means your descent will be much faster, but you will risk overshooting the local minima. With a slow learning rate, however, it will be much more precise, but it will be time-consuming. Finding the “right” learning rate is very important.

Cost Function

The loss function describes how well the model will perform given the current set of parameters (weights and biases), and gradient descent is used to find the best set of parameters. ( Clare Liu)

Top 4 Most Popular Ai Articles:

1. Natural Language Generation:
The Commercial State of the Art in 2020

2. This Entire Article Was Written by Open AI’s GPT2

3. Learning To Classify Images Without Labels

4. Becoming a Data Scientist, Data Analyst, Financial Analyst and Research Analyst

Math
Cost Function
?(?,?)=1?∑?=1?(??−(???+?))2

Gradient Descent

?′(?,?)=[????????]=[1?∑−2??(??−(???+?))1?∑−2(??−(???+?))]

Code

def update_weights(m, b, X, Y, learning_rate):
m_deriv = 0
b_deriv = 0
N = len(X)
for i in range(N):
# Calculate partial derivatives
# -2x(y - (mx + b))
m_deriv += -2*X[i] * (Y[i] - (m*X[i] + b))

# -2(y - (mx + b))
b_deriv += -2*(Y[i] - (m*X[i] + b))

# We subtract because the derivatives point in direction of steepest ascent
m -= (m_deriv / float(N)) * learning_rate
b -= (b_deriv / float(N)) * learning_rate

return m, b
Machine Learning Jobs

References:

Don’t forget to give us your ? !


Gradient Descent Explained was originally published in Becoming Human: Artificial Intelligence Magazine on Medium, where people are continuing the conversation by highlighting and responding to this story.

Via https://becominghuman.ai/gradient-descent-explained-1d95436896af?source=rss—-5e5bef33608a—4

source https://365datascience.weebly.com/the-best-data-science-blog-2020/gradient-descent-explained

My Week in AI: Part 6

Photo by Aron Visuals on Unsplash

Welcome to My Week in AI! Each week this blog will have the following parts:

  • What I have done this week in AI
  • An overview of an exciting and emerging piece of AI research

Progress Update

Absorbing Best Practices

This week I attended the Spark + AI summit, hosted by Databricks. This conference offered lots of informative and useful talks, mostly on the topics of data engineering and productionizing machine learning models. I found two talks to be particularly enlightening : ‘Accelerating MLFlow Hyper-parameter Optimization Pipelines with RAPIDS’ by John Zedlewski from NVIDIA, and ‘Scaling up Deep Learning by Scaling Down’ by Nick Pentreath from IBM.

Training Models Rapidly

GPUs and can be used in place of the standard Python data science libraries (pandas, scikit-learn, PyTorch, Matplotlib). None of the standard libraries, with the exception of PyTorch, have built in GPU support and instead compute on CPU, which takes a significant amount of time. RAPIDS libraries that use GPU instead allow machine learning model development to happen at a tiny fraction of the typical computation time.

Models that took an hour to train using scikit-learn were trained in less than 5 minutes when using cuML, RAPIDS’ corresponding library. When I heard this statistic, I was astounded by the amount of time saved — , and on top of the speed, the libraries are very easy to use. RAPIDS has a corresponding library for each of Python’s data science libraries, and each of RAPIDS’ versions has the same functions as the equivalent Python version. For example, to use RAPIDS’ version, you just replace each instance of pandas in your code with cudf,and similarly with the other libraries. The talk went on further to demonstrate hyperparameter sweep with Hyperopt, and how RAPIDS integrates with this to make the sweep extremely fast compared to a grid search in scikit-learn. RAPIDS is a toolkit that I plan to explore further as computation time is a significant frustration for me (as it is for many Data Scientists!).

Top 4 Most Popular Ai Articles:

1. Natural Language Generation:
The Commercial State of the Art in 2020

2. This Entire Article Was Written by Open AI’s GPT2

3. Learning To Classify Images Without Labels

4. Becoming a Data Scientist, Data Analyst, Financial Analyst and Research Analyst

Optimizing Models for Production

Pentreath’s talk was on running deep learning models for inference on edge devices like mobile phones. These devices typically have limited resources, so the models have to be scaled down in order to run efficiently. Pentreath presented four main ways of doing this : architecture improvement, model pruning, quantization and model distillation. Each of the four techniques leads to significant efficiency improvements, however their effect on accuracy varies. Architecture improvement and model distillation typically cause a decrease in accuracy, whereas model pruning and quantization can often cause an increase in accuracy. I think it is easy for models to become bloated, so these techniques can be useful for managing memory and computation time regardless of whether or not the models are being run on edge devices.

Emerging Research

A cheaper and more accurate BERT

The research I’m highlighting this week also focuses on scaled-down models. This week’s paper is titled, ‘ALBERT — A Lite BERT for Self-supervised Learning of Language Representations’ by Lan et al.¹ and presents a successor to the famous BERT. This research was presented at the ICLR conference in April 2020. The authors demonstrated two ways to reduce the training time and memory consumption of BERT, whilst also attaining superior accuracy on benchmark tasks.

This optimized architecture, ALBERT, uses two parameter reduction techniques: factorized embedding parameterization and cross-layer parameter sharing. Factorized embedding parameterization splits the vocabulary embedding matrix into two smaller matrices so that the vocabulary embedding is no longer connected to the size of the hidden layers in the model. Cross-layer parameter sharing means all parameters are shared across each layer, so the number of parameters does not necessarily grow as the network becomes deeper.

Machine Learning Jobs

Furthermore, the researchers used sentence-order prediction loss in training the model instead of the next-sentence prediction loss used in training BERT. Next-sentence prediction loss is a binary classification loss used to predict if two sequences of text appear sequentially in a dataset. The aim of using this loss originally was to improve BERT’s performance on downstream tasks, such as natural language inference, by focusing on topic prediction and coherence prediction. However, studies have found that it is unreliable. The loss proposed by Lan et al. focused only on coherence prediction, and helped to train an ALBERT model that is consistently more accurate on downstream tasks than BERT.

Other important takeaways are that an ALBERT configuration analogous to BERT-large has 1/18th the number of parameters and trains in less than 2/3 the amount of time, and that ALBERT can achieve state-of-the-art accuracy on three standard NLP benchmarks — GLUE, RACE and SQuAD.

Overall, seeing the advances made in NLP research since BERT was released has been very exciting and for me, and NLP tasks are much easier when I can use such powerful and optimized pretrained models.

Next week I will be presenting more of my work in AI, and also discussing new research on the use of autoencoders in computer vision. Thanks for reading and I appreciate any comments/feedback/questions.

References

[1] Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. ; 2020. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. The International Conference on Learning Representations.

Don’t forget to give us your ? !


My Week in AI: Part 6 was originally published in Becoming Human: Artificial Intelligence Magazine on Medium, where people are continuing the conversation by highlighting and responding to this story.

Via https://becominghuman.ai/my-week-in-ai-part-6-42529f808be8?source=rss—-5e5bef33608a—4

source https://365datascience.weebly.com/the-best-data-science-blog-2020/my-week-in-ai-part-6

A Complete Guide To Survival Analysis In Python part 3

Concluding this three-part series covering a step-by-step review of statistical survival analysis, we look at a detailed example implementing the Kaplan-Meier fitter based on different groups, a Log-Rank test, and Cox Regression, all with examples and shared code.

Originally from KDnuggets https://ift.tt/339LZYX

source https://365datascience.weebly.com/the-best-data-science-blog-2020/a-complete-guide-to-survival-analysis-in-python-part-3

Top KDnuggets tweets Jul 22-28: Increase your expertise in machine learning with a foundational understanding of Bayesian Statistics

Also Why You Should Get Google New #MachineLearning Certificate; Remote #DataScience Internships For Everyone.

Originally from KDnuggets https://ift.tt/39GkDdT

source https://365datascience.weebly.com/the-best-data-science-blog-2020/top-kdnuggets-tweets-jul-22-28-increase-your-expertise-in-machine-learning-with-a-foundational-understanding-of-bayesian-statistics

A Tour of End-to-End Machine Learning Platforms

An end-to-end machine learning platform needs a holistic approach. If you’re interested in learning more about a few well-known ML platforms, you’ve come to the right place!

Originally from KDnuggets https://ift.tt/3f7fAEJ

source https://365datascience.weebly.com/the-best-data-science-blog-2020/a-tour-of-end-to-end-machine-learning-platforms

AI for CFD: byteLAKEs approach (part3)

Source

Welcome to part 3 of my AI for CFD blog post series. Previously I explained how the team at byteLAKE took their first steps into the world of CFD somewhat 10 years ago. I summarized how we tried different approaches as we progressed and delivered projects focused on both GPU and FPGA adaptation work. While moving to heterogeneous computing seemed like the right and the only choice at that time to speed up the calculations, we always knew something was missing in the equation. Nevertheless, we listened to the voice of our clients and partners and eventually decided to start building a cross-platform solution with the option for extra benefits offered by heterogeneous architectures rather than build a solution that is tied to any particular hardware. Also, having completed several research projects, we saw a great and at the same time uncharted potential for applying AI (Artificial Intelligence) in one more area, this time HPC (High-Performance Computing) industrial simulations.

Based on all of these experiences we always asked the following question within the byteLAKE team: could AI enable tremendous acceleration of calculations across the CFD industry? Well, it seems that AI has a lot to offer for CFD. Follow this blog post series to stay up to date with byteLAKE’s CFD Suite development work.

byteLAKE’s AI for CFD, CFD Suite is a cross-platform solution and is not tied to any particular hardware. It can work on CPU-only as well as heterogeneous architectures though. We designed in a way that our partners can use CFD Suite in their existing environments and so that no adjustments are needed.

So what is byteLAKE’s CFD Suite? It’s a collection of innovative AI Models for CFD (Computational Fluid Dynamics). We believe we are on our way to offer, through AI, the level of calculations acceleration that is beyond what’s achievable with hardware upgrades or algorithms adaptation to hardware accelerators alone. Needless to say, that the latter approach comes at a huge cost which we aim to eliminate.

Trending AI Articles:

1. Natural Language Generation:
The Commercial State of the Art in 2020

2. This Entire Article Was Written by Open AI’s GPT2

3. Learning To Classify Images Without Labels

4. Becoming a Data Scientist, Data Analyst, Financial Analyst and Research Analyst

Our strategy is to replace the numerical solvers with equivalent AI models or enable interaction between AI and solvers for much faster analysis and reduced cost of trial & error experiments. In other words, the CFD Suite will make it possible to run many more experiments and better explore the design space before decisions are made. All within radically lower budgets related to the time and resources needed. We are working on solver specific AI models to ensure high accuracy of predictions. Also, byteLAKE’s CFD Suite will become a foundation for bringing the Digital Twin concept to the world of CFD simulations. The models will be generalized so that they can work across various use cases or geometries. We have already reached a milestone where the same model can be successfully used when simulation input parameters change.

byteLAKE’s CFD Suite is a collection of innovative AI Models for CFD (Computational Fluid Dynamics). Ultra-fast results, radically lower TCO. Explore new possibilities. One model even if simulation parameters change.

One can say that byteLAKE’s CFD Suite is a collection of plug & play modules or add-ons for your existing CFD software or a collection of tools that can work in parallel to your existing toolchain or simulation workflow. There is no need to make any adjustments to your existing infrastructure. It can work on CPU-only architectures, CPU+GPU, FPGAs being on the roadmap as well, on your laptop/PC, on-premises data center, will be available through our Cloud Computing partners and what have you. In other words, when byteLAKE says you will eventually get an AI model that is trained to handle your simulation(s), you will get a tool that:

  • takes input data which is the same as you now provide to your existing CFD tools
  • output generated will also be compatible with what you are getting through your existing toolchain, meaning can be directly used for visualization purposes in your tools.
Artificial Intelligence Jobs

Bottom line is that byteLAKE’s CFD Suite has been designed as a cross-platform solution to address your CFD acceleration needs. Full compatibility with existing environments makes the learning curve as short as possible.

AI Models within byteLAKE’s CFD Suite work as plug & play modules or tools, fully compatible with your existing toolchains or infrastructure (no hardware-specific requirements). Their ultimate purpose is to generate CFD simulation results much faster than their counterpart CFD solvers.

How does the CFD Suite work? Although it is still a work-in-process with the first product being scheduled for launch in November 2020, let me tell a few words about the underlying architecture as of July 2020. High level view is as illustrated below.

byteLAKE’s CFD Suite (High Level)

You typically create a line-up of input data describing the phenomena, objects, and related characteristics like i.e. velocities. Such data is then transferred into CFD solvers which produce results sequentially, each iteration or a time step becoming an intermediate result of your simulation. All such intermediate results wrap up into what’s called the simulation result. Then such a result is becoming a foundation for further analysis (and calculations) to answer the designer’s questions, dispel doubts and establish optimal configurations.

Process-wise this is unchanged in the case of AI. Only the results (both simulation result, intermediate results if needed, and further calculations) can be produced much faster. It must be noted that AI is just a tool here, used to accelerate the calculations. Thus, there is again no need for any adjustments on your side and it really does not matter if your organization is ready for AI or not. It does not matter whether you have already considered AI within your roadmaps or strategies either. AI is a hidden layer within byteLAKE’s CFD Suite, designed to work seamlessly. As mentioned, its only purpose is to accelerate the calculations and not generate any requirements in terms of organizational changes, infrastructure upgrades, etc.

On the inside, you will find byteLAKE’s proprietary combination of various CNN (Convolutional Neural Network) and RNN (Recurrent Neural Network), the latter mostly being LSTM (Long Short-Term Memory) networks. For time being we are mostly simulating the results for 2D and 3D meshes of sizes respectively from 32×32(x32) to 128×128(x128). The algorithm we are starting with is the general advection algorithm (which is MPDATA in our case; used i.e. in weather simulations as part of the COSMO Model). So far (July 2020) we have successfully predicted the next 8, 16, and 32 steps. Regarding the accuracy, example results are as follows:

  • Pearson’s correlation: 0.75–0.98
  • MSE: 0.03–0.6
  • Relative error: 0.2–15%

Depending on scenarios, our neural networks consist of 2–14 layers and we usually train ~10 epochs for quick testing and 100 for general usage. To ensure that the AI Models are not overfitted or under fitted we have also designed a dedicated software auto-tuning mechanism (machine learning) that allows us to optimize the hyperparameters automatically. Software automatic tuning (auto-tuning) is a paradigm enabling software automatic adaptation to a variety of computational conditions.

byteLAKE’s software autotuning mechanism, powered by Machine Learning is another tool that ensures optimal configuration and maximum performance of the AI Models within the byteLAKE’s CFD Suite.

As pictured below, auto-tuning analyzes the neural networks and for instance, ensures that the AI Models produce accurate results optimally.

byteLAKE’s CFD Suite — training history (July 2020)
  • X-Axis: epochs
  • Y-Axis: MSE
  • Blue: loss (MSE for training dataset)
  • Red: val-loss (MSE for validation dataset)

Currently, the main challenge is to increase the accuracy of predictions of time steps that are based on already predicted intermediate results.

Thank you for reading another blog post in the series! Hope you are as excited about our product as we are. I will be revealing more results as we progress with a call for partners coming up next! Stay tuned.

Navigation:

Don’t forget to give us your ? !


AI for CFD: byteLAKE’s approach (part3) was originally published in Becoming Human: Artificial Intelligence Magazine on Medium, where people are continuing the conversation by highlighting and responding to this story.

Via https://becominghuman.ai/ai-for-cfd-bytelakes-approach-part3-e9cb60bd9e3b?source=rss—-5e5bef33608a—4

source https://365datascience.weebly.com/the-best-data-science-blog-2020/ai-for-cfd-bytelakes-approach-part3

My Week in AI: Part 5

Photo by Stephen Dawson on Unsplash

Welcome to My Week in AI! Each week this blog will have the following parts:

  • What I have done this week in AI
  • An overview of an exciting and emerging piece of AI research

Progress Update

Visualizing Data Interactively

I’ve spent this week working on a dashboarding project using Plotly and Dash. Dashboarding is new for me, and I have been excited to learn Dash whilst building something with real data. I have found Dash very easy to use and intuitive, mostly due to the fact that it is a declarative and reactive library. Similar to Altair, which I spoke about in Part 3, Dash is an all-Python library, making it very easy for data scientists like myself to develop interactive and aesthetically pleasing dashboards.

I’ve also spent some time at the Spark + AI Summit by Databricks, which started on Monday and is a conference that I have been greatly looking forward to. I am especially interested in the talks on productionizing machine learning models and on using MLflow, and I’ll be sharing my thoughts and reactions in next week’s blog.

Trending AI Articles:

1. Natural Language Generation:
The Commercial State of the Art in 2020

2. This Entire Article Was Written by Open AI’s GPT2

3. Learning To Classify Images Without Labels

4. Becoming a Data Scientist, Data Analyst, Financial Analyst and Research Analyst

Emerging Research

Model Inversion Attacks with Generative Models

The major AI research event that took place over the last week was CVPR 2020 — the pre-eminent computer vision conference. There were many fascinating papers submitted as always, so I decided that the research I would present this week should be from that conference. One paper that caught my eye was ‘The Secret Revealer: Generative Model-Inversion Attacks against Deep Neural Networks’ by Zhang et al., a research team from China and the US¹. In this research, they demonstrated a novel model-inversion attack method, which was able to extract information about training data from a deep neural network.

Model-inversion attacks are especially dangerous with models that use sensitive data for training, for example healthcare data, or facial image datasets. The running example used in this research was a white-box attack on a facial recognition classifier, to expose the training data and recover the face images. The researchers’ method involved training GANs on public ‘auxiliary data,’ which in their example was defined as images in which the faces were blurred. This encourages the generator to produce realistic images. The next step was to use the trained generator to then recover the missing sensitive regions in the image; this step was framed as an optimization problem.

Proposed attack method¹

The researchers found that their method performed favorably in this task when compared with previous state-of-the-art methods. They also made two further empirical observations that I would like to reiterate. First, that models with high predictive power can be attacked with higher accuracy because such models are able to build a strong correlation between features and labels; this characteristic is exactly what is exploited in model inversion attacks. Second, differential privacy could not protect against this method of attack, as it does not aim to conceal the private attributes of the training data — it only obscures them with statistical noise. This raises questions about models that rely on differential privacy for information security.

Artificial Intelligence Jobs

Unsupervised Learning of 3D Objects from 2D Images

I also wanted to mention the Best Paper Award winner from CVPR 2020, ‘Unsupervised Learning of Probably Symmetric Deformable 3D Objects from Images in the Wild,’ by Wu et al². They proposed a method of learning 3D objects from single-view images without any external supervision. The researchers centered their method on an autoencoder that draws information based on the depth, albedo, viewpoint and illumination of the input image. Many of the results they presented in their paper were noteworthy, and I highly recommend reading it and trying their demo (which is available on their Github page).

Join me next week for my thoughts on the Spark + AI summit, and an overview of a piece of exciting and emerging research. Thanks for reading and I appreciate any comments/feedback/questions.

References

[1] Zhang, Y., Jia, R., Pei, H., Wang, W., Li, B. and Song, D., 2020. The Secret Revealer: Generative Model-Inversion Attacks Against Deep Neural Networks. The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.253–261.

[2] Wu, S., Rupprecht, C. and Vedaldi, A., 2020. Unsupervised Learning of Probably Symmetric Deformable 3D Objects From Images in the Wild. The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.1–10.

Don’t forget to give us your ? !


My Week in AI: Part 5 was originally published in Becoming Human: Artificial Intelligence Magazine on Medium, where people are continuing the conversation by highlighting and responding to this story.

Via https://becominghuman.ai/my-week-in-ai-part-5-9543453bfd90?source=rss—-5e5bef33608a—4

source https://365datascience.weebly.com/the-best-data-science-blog-2020/my-week-in-ai-part-5

Design a site like this with WordPress.com
Get started