365 Data Science

Applied Natural Language Processing (NLP) in Python | Exploring NLP Libraries

Natural Language Processing (NLP) is one of the oldest branches of artificial intelligence (with works starting from as early as the 1950s), which is still undergoing continuous development and commands a great deal of importance in the field of data science.

So if you want to add a new feather to your cap by learning applied NLP, you’ve come to the right spot. It doesn’t matter if you want to be a data scientist or just want to gain a new skill, this tutorial will help you get down and dirty with NLP and show you hands-on techniques on how to deal with raw data, without overwhelming you with a barrage of information.

Let us first look at how NLP came into use.

Text Mining

With the social media boom, companies have access to massive behavioral data of their customers, enabling them to use that data to fuel business processes and make informed decisions. But there’s a teeny, tiny problem.

The data is unstructured.

Raw and unstructured data is not of much use as it doesn’t give any valuable insights. Just like an uncut diamond needs to undergo polishing to reveal the flawless gem underneath, raw data need to be mined or analyzed to be of any practical use.

This is where text mining comes in. It is the process of extracting and deriving useful information, patterns, and insights from a large collection of unstructured, textual data.

This field can be divided into 4 practice areas-

Information Extraction

Deals with identification and extraction of relevant facts and relationships from unstructured data.

2. Document Classification and Clustering

Aims at grouping and categorizing terms, paragraphs, docs using classification, and clustering methods.

3. Information Retrieval

Refers to dealing with storage and retrieval of text documents.

4. Natural Language Processing (NLP)

Different computational tasks are used to analyze and understand the underlying structure of the text data.

As you can see, NLP is an area of text mining or text analysis where the final goal is to make computers understand the unstructured text and retrieve meaningful pieces of information from it.

Natural Language Toolkit (NLTK)

NLTK is a powerful Python package that contains several algorithms to help computer pre-process, analyze, and understand natural languages and written texts.

Common NLTK Algorithms:

Tokenization
Part-of-speech Tagging
Named-entity Recognition
Sentiment Analysis

Now, let’s download and install NLTK via terminal (Command prompt in Windows).

Note: The instructions given below are based on the assumption that you have Python and Jupyter Notebook installed. So, if you haven’t installed these, please pause here and install them before moving ahead.

Go to the Scripts folder and copy the path

2. Open Windows command prompt and navigate to the Scripts folder

3. Enter pip3 install nltk to install NLTK

After you have successfully installed NLTK on your machine, we will explore the different modules of this package.

NLTK Corpora

A corpus (plural: corpora) is a huge collection of written texts. It is a body of written or spoken texts used for linguistic analysis and the development of NLP tools.

To download the datasets, open a Jupyter notebook and type the following code block-

import nltk
nltk.download()

A GUI will pop up, where you can click on the ‘Download’ button to download all the data packages.

Since this consists of a large number of datasets, it might take some time to complete so make sure you have a fast internet connection before downloading them.

After you’re done, come back to the Jupyter notebook, where we will be exploring the different NLP tools.

Tokenization

The process of turning a string or a text paragraph into smaller chunks or tokens such as words, phrases, keywords, symbols, etc. Some useful tokenization methods:

sent_tokenize
word_tokenize
RegexpTokenizer
BlanklineTokenizer

sent_tokenize

This function extracts all sentences present in a text document.

First, we import sent_tokenize and then pass it a paragraph as an argument, which outputs a list that contains all the sentences as individual elements.

You can also make use of one of the datasets of the NLTK corpora instead of using a sample string.

word_tokenize

This function extracts individual words from a text document.

RegexpTokenizer

This is used to match patterns in a text document and only extract those which match with the regular expression (regex).

Here, we will use RegexpTokenizer to match with a regex which will give us a list of all the numbers present in the Bible.

BlanklineTokenizer

This function tokenizes sentences even if they have blank lines or spaces between them.

In this example, both RegexpTokenizer and BlankLineTokenizer are used on the same sample text which will give a clear idea on both these functions.

Frequency Distribution

This is used to get the word frequency in a text. It works similarly to a dictionary, where the keys are the words and the values are the counts associated with the word.

After tokenizing into words, we pass the tokens into the object declaration of the FreqDist class.

The most_common method of freq object can be used to see the most common occurrences of the words. The plot method can be used to graph a plot of the same.

You can also iterate through the frequency distribution and find the number of occurrences of any particular word.

As you can see above, there are a lot of useless ‘words’ which have been tokenized, from punctuation marks like commas and spaces to particles like the, a, etc.

Stopwords

Fortunately, NLTK has a list of such useless words of 16 different languages. These are called stopwords. Using these, we can filter out the data and clean it.

We will try finding out the frequency distribution of the words but this time we will be using stopwords to get better results.

On line 17, we loop through the word tokens and then select only those words which are not present in the stopwords list and whose length is greater than 3.

If you have trouble understanding list comprehension, here is the same code, written in a more familiar way.

If we now plot the graph and look at the most common occurrences of words, we will get a much more sensible result.

But there are many times when two or more consecutive words give us more information rather than individual words. Fortunately, NLTK provides us a way to handle that.

Bigrams, Trigrams, and Ngrams

Two consecutive words that occur in a text document are called bigrams.

We first break the text into word tokens and then pass those tokens to the bigrams module imported from nltk to convert the word tokens into bigrams. We then pass it on to FreqDist and use the most_common method to look at the top 5 most common words.

Similarly, three consecutive words that appear in a sentence are called trigrams.

We can also look for more consecutive words by using the ngrams module.

Stemming

This is a process for text normalization which involves reducing words to their root/base words from their derived words. NLTK provides us the PorterStemmer class to perform stemming operations on word tokens.

As you can see, stemming doesn’t really help much here because it reduces the actual words to such an extent that it doesn’t have any morphological sense anymore. That is why we have another process called lemmatization.

Lemmatization

This is a process that involves reducing words to their root/base words using vocabulary and morphological analysis. It brings context to the words which are not done in stemming.

After creating an instance of the WordNetLemmatizer class, we call the method lemmatize to lemmatize the words. Words are reduced to their original form while still retaining their meaning.

Parts-of-Speech Tagging

Here, the task is to label/tag each word in a sentence with its grammatical groups such as nouns, pronouns, adjectives, and many more. Some of the acronyms are:

CC — Coordinating Conjunction

JJ — Adjective

IN — Preposition/Subordinating Conjunction

JJR — Adjective, comparative

JJS — Adjective, superlative

NN — Noun, singular

NNP — Proper Noun, singular

PRP — Personal Pronoun

As you can see, the pos_tag function takes in word tokens as input and returns a list of tuples consisting of the word alongside the part of speech tag.

Named-Entity-Recognition

This is used to identify important named entities in a text such as people, places, locations, dates, organizations, etc. Here are a few types of entities along with their examples-

ORGANIZATION — Facebook, Alphabet, etc.

LOCATION — 22 West St, Mount Everest

GPE (Geo-Political Entity) — India, Ukraine, South East Asia

MONEY — 100 Million Dollars

PERSON — Obama, George W. Bush

DATE — July, 2019–05–2

TIME — three fourty pm, 3:12 am

FACILITY — Washington Monument, Stonehenge

In order to use this module, we have to pass in parts-of-speech tags as the argument. So we first import the necessary modules, tokenize the text into words and find out its parts-of-speech and then pass it to ne_chunk which will give us the proper results.

Congratulations! You have completed this tutorial and now you’re equipped to take on NLP projects on a beginner level.

But beware, for this tutorial is just the tip of the iceberg. NLP is a very vast field and has many more methods and modules that we haven’t gone through. However, I would encourage you to check them out on your own and build small projects to put your newly-learned skills to good use.

Github Repository link:

sthitaprajna-mishra/applied_nlp_python_tutorial

Don’t forget to give us your ? !

Applied Natural Language Processing (NLP) in Python | Exploring NLP Libraries was originally published in Becoming Human: Artificial Intelligence Magazine on Medium, where people are continuing the conversation by highlighting and responding to this story.

Via https://becominghuman.ai/applied-natural-language-processing-nlp-in-python-exploring-nlp-libraries-1d95710d5186?source=rss—-5e5bef33608a—4

source https://365datascience.weebly.com/the-best-data-science-blog-2020/applied-natural-language-processing-nlp-in-python-exploring-nlp-libraries

Gradient Descent Explained

Gradient descent is an optimization algorithm used to minimize some function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient.

Consider the 3-dimensional graph above. Our goal is to move from the top right corner of the mountain, which can be described as a high cost, to a dark blue sea below it, which can be viewed as a low cost. The arrows represent the steepest descent from any given point that decreases the cost function as quickly as possible.

Starting at the top of the mountain, we will take baby steps downhill in the direction where the gradient is negative. After that, we recalculate the negative gradient and take another step in a specified direction. We continue this process until we get to the bottom or local minimum.

Learning Rate

The size of each step is called a learning rate. If you have a high learning rate it means your descent will be much faster, but you will risk overshooting the local minima. With a slow learning rate, however, it will be much more precise, but it will be time-consuming. Finding the “right” learning rate is very important.

Cost Function

The loss function describes how well the model will perform given the current set of parameters (weights and biases), and gradient descent is used to find the best set of parameters. ( Clare Liu)

Don’t forget to give us your ? !

Gradient Descent Explained was originally published in Becoming Human: Artificial Intelligence Magazine on Medium, where people are continuing the conversation by highlighting and responding to this story.

Via https://becominghuman.ai/gradient-descent-explained-1d95436896af?source=rss—-5e5bef33608a—4

source https://365datascience.weebly.com/the-best-data-science-blog-2020/gradient-descent-explained

My Week in AI: Part 6

Welcome to My Week in AI! Each week this blog will have the following parts:

What I have done this week in AI
An overview of an exciting and emerging piece of AI research

Progress Update

Absorbing Best Practices

This week I attended the Spark + AI summit, hosted by Databricks. This conference offered lots of informative and useful talks, mostly on the topics of data engineering and productionizing machine learning models. I found two talks to be particularly enlightening : ‘Accelerating MLFlow Hyper-parameter Optimization Pipelines with RAPIDS’ by John Zedlewski from NVIDIA, and ‘Scaling up Deep Learning by Scaling Down’ by Nick Pentreath from IBM.

Training Models Rapidly

GPUs and can be used in place of the standard Python data science libraries (pandas, scikit-learn, PyTorch, Matplotlib). None of the standard libraries, with the exception of PyTorch, have built in GPU support and instead compute on CPU, which takes a significant amount of time. RAPIDS libraries that use GPU instead allow machine learning model development to happen at a tiny fraction of the typical computation time.

Models that took an hour to train using scikit-learn were trained in less than 5 minutes when using cuML, RAPIDS’ corresponding library. When I heard this statistic, I was astounded by the amount of time saved — , and on top of the speed, the libraries are very easy to use. RAPIDS has a corresponding library for each of Python’s data science libraries, and each of RAPIDS’ versions has the same functions as the equivalent Python version. For example, to use RAPIDS’ version, you just replace each instance of pandas in your code with cudf,and similarly with the other libraries. The talk went on further to demonstrate hyperparameter sweep with Hyperopt, and how RAPIDS integrates with this to make the sweep extremely fast compared to a grid search in scikit-learn. RAPIDS is a toolkit that I plan to explore further as computation time is a significant frustration for me (as it is for many Data Scientists!).

Emerging Research

A cheaper and more accurate BERT

The research I’m highlighting this week also focuses on scaled-down models. This week’s paper is titled, ‘ALBERT — A Lite BERT for Self-supervised Learning of Language Representations’ by Lan et al.¹ and presents a successor to the famous BERT. This research was presented at the ICLR conference in April 2020. The authors demonstrated two ways to reduce the training time and memory consumption of BERT, whilst also attaining superior accuracy on benchmark tasks.

This optimized architecture, ALBERT, uses two parameter reduction techniques: factorized embedding parameterization and cross-layer parameter sharing. Factorized embedding parameterization splits the vocabulary embedding matrix into two smaller matrices so that the vocabulary embedding is no longer connected to the size of the hidden layers in the model. Cross-layer parameter sharing means all parameters are shared across each layer, so the number of parameters does not necessarily grow as the network becomes deeper.

Furthermore, the researchers used sentence-order prediction loss in training the model instead of the next-sentence prediction loss used in training BERT. Next-sentence prediction loss is a binary classification loss used to predict if two sequences of text appear sequentially in a dataset. The aim of using this loss originally was to improve BERT’s performance on downstream tasks, such as natural language inference, by focusing on topic prediction and coherence prediction. However, studies have found that it is unreliable. The loss proposed by Lan et al. focused only on coherence prediction, and helped to train an ALBERT model that is consistently more accurate on downstream tasks than BERT.

Other important takeaways are that an ALBERT configuration analogous to BERT-large has 1/18th the number of parameters and trains in less than 2/3 the amount of time, and that ALBERT can achieve state-of-the-art accuracy on three standard NLP benchmarks — GLUE, RACE and SQuAD.

Overall, seeing the advances made in NLP research since BERT was released has been very exciting and for me, and NLP tasks are much easier when I can use such powerful and optimized pretrained models.

Next week I will be presenting more of my work in AI, and also discussing new research on the use of autoencoders in computer vision. Thanks for reading and I appreciate any comments/feedback/questions.

References

[1] Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. ; 2020. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. The International Conference on Learning Representations.

Don’t forget to give us your ? !

My Week in AI: Part 6 was originally published in Becoming Human: Artificial Intelligence Magazine on Medium, where people are continuing the conversation by highlighting and responding to this story.

Via https://becominghuman.ai/my-week-in-ai-part-6-42529f808be8?source=rss—-5e5bef33608a—4

source https://365datascience.weebly.com/the-best-data-science-blog-2020/my-week-in-ai-part-6

A Complete Guide To Survival Analysis In Python part 3

Concluding this three-part series covering a step-by-step review of statistical survival analysis, we look at a detailed example implementing the Kaplan-Meier fitter based on different groups, a Log-Rank test, and Cox Regression, all with examples and shared code.

Originally from KDnuggets https://ift.tt/339LZYX

source https://365datascience.weebly.com/the-best-data-science-blog-2020/a-complete-guide-to-survival-analysis-in-python-part-3

Math for Programmers!

Math for Programmers teaches you the math you need to know for a career in programming, concentrating on what you need to know as a developer. Save 50% with code kdmath50.

Originally from KDnuggets https://ift.tt/3g9lFBI

source https://365datascience.weebly.com/the-best-data-science-blog-2020/math-for-programmers9842367

5 Big Trends in Data Analytics

Data analytics is the process by which data is deconstructed and examined for useful patterns and trends. Here we explore five trends making data analytics even more useful.

Originally from KDnuggets https://ift.tt/2BMjEwu

source https://365datascience.weebly.com/the-best-data-science-blog-2020/5-big-trends-in-data-analytics

Top KDnuggets tweets Jul 22-28: Increase your expertise in machine learning with a foundational understanding of Bayesian Statistics

Also Why You Should Get Google New #MachineLearning Certificate; Remote #DataScience Internships For Everyone.

Originally from KDnuggets https://ift.tt/39GkDdT

source https://365datascience.weebly.com/the-best-data-science-blog-2020/top-kdnuggets-tweets-jul-22-28-increase-your-expertise-in-machine-learning-with-a-foundational-understanding-of-bayesian-statistics

A Tour of End-to-End Machine Learning Platforms

An end-to-end machine learning platform needs a holistic approach. If you’re interested in learning more about a few well-known ML platforms, you’ve come to the right place!

Originally from KDnuggets https://ift.tt/3f7fAEJ

source https://365datascience.weebly.com/the-best-data-science-blog-2020/a-tour-of-end-to-end-machine-learning-platforms

AI for CFD: byteLAKEs approach (part3)

Welcome to part 3 of my AI for CFD blog post series. Previously I explained how the team at byteLAKE took their first steps into the world of CFD somewhat 10 years ago. I summarized how we tried different approaches as we progressed and delivered projects focused on both GPU and FPGA adaptation work. While moving to heterogeneous computing seemed like the right and the only choice at that time to speed up the calculations, we always knew something was missing in the equation. Nevertheless, we listened to the voice of our clients and partners and eventually decided to start building a cross-platform solution with the option for extra benefits offered by heterogeneous architectures rather than build a solution that is tied to any particular hardware. Also, having completed several research projects, we saw a great and at the same time uncharted potential for applying AI (Artificial Intelligence) in one more area, this time HPC (High-Performance Computing) industrial simulations.

Based on all of these experiences we always asked the following question within the byteLAKE team: could AI enable tremendous acceleration of calculations across the CFD industry? Well, it seems that AI has a lot to offer for CFD. Follow this blog post series to stay up to date with byteLAKE’s CFD Suite development work.

byteLAKE’s AI for CFD, CFD Suite is a cross-platform solution and is not tied to any particular hardware. It can work on CPU-only as well as heterogeneous architectures though. We designed in a way that our partners can use CFD Suite in their existing environments and so that no adjustments are needed.

So what is byteLAKE’s CFD Suite? It’s a collection of innovative AI Models for CFD (Computational Fluid Dynamics). We believe we are on our way to offer, through AI, the level of calculations acceleration that is beyond what’s achievable with hardware upgrades or algorithms adaptation to hardware accelerators alone. Needless to say, that the latter approach comes at a huge cost which we aim to eliminate.

Navigation:

Previous part: AI for CFD: join our community (part 2) (aka Artificial Intelligence in CFD Groups): www.byteLAKE.com/en/AI4CFD-pt2
Next Part: AI for CFD: call for partners (part 4): www.byteLAKE.com/en/AI4CFD-pt4
All blog posts in the series: www.byteLAKE.com/en/AI4CFD-toc

Don’t forget to give us your ? !

AI for CFD: byteLAKE’s approach (part3) was originally published in Becoming Human: Artificial Intelligence Magazine on Medium, where people are continuing the conversation by highlighting and responding to this story.

Via https://becominghuman.ai/ai-for-cfd-bytelakes-approach-part3-e9cb60bd9e3b?source=rss—-5e5bef33608a—4

source https://365datascience.weebly.com/the-best-data-science-blog-2020/ai-for-cfd-bytelakes-approach-part3

My Week in AI: Part 5

Welcome to My Week in AI! Each week this blog will have the following parts:

What I have done this week in AI
An overview of an exciting and emerging piece of AI research

Progress Update

Visualizing Data Interactively

I’ve spent this week working on a dashboarding project using Plotly and Dash. Dashboarding is new for me, and I have been excited to learn Dash whilst building something with real data. I have found Dash very easy to use and intuitive, mostly due to the fact that it is a declarative and reactive library. Similar to Altair, which I spoke about in Part 3, Dash is an all-Python library, making it very easy for data scientists like myself to develop interactive and aesthetically pleasing dashboards.

I’ve also spent some time at the Spark + AI Summit by Databricks, which started on Monday and is a conference that I have been greatly looking forward to. I am especially interested in the talks on productionizing machine learning models and on using MLflow, and I’ll be sharing my thoughts and reactions in next week’s blog.

Emerging Research

Model Inversion Attacks with Generative Models

The major AI research event that took place over the last week was CVPR 2020 — the pre-eminent computer vision conference. There were many fascinating papers submitted as always, so I decided that the research I would present this week should be from that conference. One paper that caught my eye was ‘The Secret Revealer: Generative Model-Inversion Attacks against Deep Neural Networks’ by Zhang et al., a research team from China and the US¹. In this research, they demonstrated a novel model-inversion attack method, which was able to extract information about training data from a deep neural network.

Model-inversion attacks are especially dangerous with models that use sensitive data for training, for example healthcare data, or facial image datasets. The running example used in this research was a white-box attack on a facial recognition classifier, to expose the training data and recover the face images. The researchers’ method involved training GANs on public ‘auxiliary data,’ which in their example was defined as images in which the faces were blurred. This encourages the generator to produce realistic images. The next step was to use the trained generator to then recover the missing sensitive regions in the image; this step was framed as an optimization problem.

The researchers found that their method performed favorably in this task when compared with previous state-of-the-art methods. They also made two further empirical observations that I would like to reiterate. First, that models with high predictive power can be attacked with higher accuracy because such models are able to build a strong correlation between features and labels; this characteristic is exactly what is exploited in model inversion attacks. Second, differential privacy could not protect against this method of attack, as it does not aim to conceal the private attributes of the training data — it only obscures them with statistical noise. This raises questions about models that rely on differential privacy for information security.

Unsupervised Learning of 3D Objects from 2D Images

I also wanted to mention the Best Paper Award winner from CVPR 2020, ‘Unsupervised Learning of Probably Symmetric Deformable 3D Objects from Images in the Wild,’ by Wu et al². They proposed a method of learning 3D objects from single-view images without any external supervision. The researchers centered their method on an autoencoder that draws information based on the depth, albedo, viewpoint and illumination of the input image. Many of the results they presented in their paper were noteworthy, and I highly recommend reading it and trying their demo (which is available on their Github page).

Join me next week for my thoughts on the Spark + AI summit, and an overview of a piece of exciting and emerging research. Thanks for reading and I appreciate any comments/feedback/questions.

References

[1] Zhang, Y., Jia, R., Pei, H., Wang, W., Li, B. and Song, D., 2020. The Secret Revealer: Generative Model-Inversion Attacks Against Deep Neural Networks. The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.253–261.

[2] Wu, S., Rupprecht, C. and Vedaldi, A., 2020. Unsupervised Learning of Probably Symmetric Deformable 3D Objects From Images in the Wild. The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.1–10.

Don’t forget to give us your ? !

My Week in AI: Part 5 was originally published in Becoming Human: Artificial Intelligence Magazine on Medium, where people are continuing the conversation by highlighting and responding to this story.

Via https://becominghuman.ai/my-week-in-ai-part-5-9543453bfd90?source=rss—-5e5bef33608a—4

source https://365datascience.weebly.com/the-best-data-science-blog-2020/my-week-in-ai-part-5

Text Mining

Top 4 Most Popular Ai Articles:

Natural Language Toolkit (NLTK)

sent_tokenize

word_tokenize

RegexpTokenizer

BlanklineTokenizer

Don’t forget to give us your ? !

Top 4 Most Popular Ai Articles:

Don’t forget to give us your ? !

Progress Update

Absorbing Best Practices

Training Models Rapidly

Top 4 Most Popular Ai Articles:

Optimizing Models for Production

Emerging Research

A cheaper and more accurate BERT

References

Don’t forget to give us your ? !

Trending AI Articles:

Navigation:

Don’t forget to give us your ? !

Progress Update

Visualizing Data Interactively

Trending AI Articles:

Emerging Research

Model Inversion Attacks with Generative Models

Unsupervised Learning of 3D Objects from 2D Images

References

Don’t forget to give us your ? !