Getting started with NLP using NLTK

Source

Easy Natural Language Processing tutorial using NLTK package in Python

Natural Language Processing (NLP) is an area of computer science and artificial intelligence concerned with interactions between computer and human(natural) language.

Well, wondering what is NLTK? the Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language. It was developed by Steven Bird and Edward Loper in the Department of Computer and Information Science at the University of Pennsylvania.

The basic task in NLP are:

1.convert text to lower case
2. word tokenize
3. sent tokenize
4. stop words removal
5. lemma
6. stem
7. get word frequency
8. pos tags
9. NER

Pre-requirements:

install Python

install nltk and its corpus

Examples:

import nltk

import nltk in-order to use its functions

import nltk

2. convert text to lower case:

It is necessary to convert the text to lower case as it is case sensitive.

text = “This is a Demo Text for NLP using NLTK. Full form of NLTK is Natural Language Toolkit”
lower_text = text.lower()
print (lower_text)
[OUTPUT]: this is a demo text for nlp using nltk. full form of nltk is natural language toolkit
Data Scientist Jobs

3. word tokenize

Tokenize sentences to get the tokens of the text i.e breaking the sentences into words.

text = “This is a Demo Text for NLP using NLTK. Full form of NLTK is Natural Language Toolkit”
word_tokens = nltk.word_tokenize(text)
print (word_tokens)
[OUTPUT]: ['This', 'is', 'a', 'Demo', 'Text', 'for', 'NLP', 'using', 'NLTK', '.', 'Full', 'form', 'of', 'NLTK', 'is', 'Natural', 'Language', 'Toolkit']

4. sent tokenize

Tokenize sentences if the there are more than 1 sentence i.e breaking the sentences to list of sentence.

text = “This is a Demo Text for NLP using NLTK. Full form of NLTK is Natural Language Toolkit”
sent_token = nltk.sent_tokenize(text)
print (sent_token)
[OUTPUT]: ['This is a Demo Text for NLP using NLTK.', 'Full form of NLTK is Natural Language Toolkit']

5. stop words removal

Remove irrelevant words using nltk stop words like is,the,a etc from the sentences as they don’t carry any information.

import nltk
from nltk.corpus import stopwords
stopword = stopwords.words(‘english’)
text = “This is a Demo Text for NLP using NLTK. Full form of NLTK is Natural Language Toolkit”
word_tokens = nltk.word_tokenize(text)
removing_stopwords = [word for word in word_tokens if word not in stopword]
print (removing_stopwords)
[OUTPUT]: ['This', 'Demo', 'Text', 'NLP', 'using', 'NLTK', '.', 'Full', 'form', 'NLTK', 'Natural', 'Language', 'Toolkit']

6. lemma

lemmatize the text so as to get its root form eg: functions,funtionality as function

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
#is based on The Porter Stemming Algorithm
stopword = stopwords.words(‘english’)
wordnet_lemmatizer = WordNetLemmatizer()
text = “the dogs are barking outside. Are the cats in the garden?”
word_tokens = nltk.word_tokenize(text)
lemmatized_word = [wordnet_lemmatizer.lemmatize(word) for word in word_tokens]
print (lemmatized_word)
[OUTPUT]: ['the', 'dog', 'are', 'barking', 'outside', '.', 'Are', 'the', 'cat', 'in', 'the', 'garden', '?']

Trending AI Articles:

1. Natural Language Generation:
The Commercial State of the Art in 2020

2. This Entire Article Was Written by Open AI’s GPT2

3. Learning To Classify Images Without Labels

4. Becoming a Data Scientist, Data Analyst, Financial Analyst and Research Analyst

7. stemming

stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form

import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
#is based on The Porter Stemming Algorithm
stopword = stopwords.words(‘english’)
snowball_stemmer = SnowballStemmer(‘english’)
text = “This is a Demo Text for NLP using NLTK. Full form of NLTK is Natural Language Toolkit”
word_tokens = nltk.word_tokenize(text)
stemmed_word = [snowball_stemmer.stem(word) for word in word_tokens]
print (stemmed_word)
[OUTPUT]: ['this', 'is', 'a', 'demo', 'text', 'for', 'nlp', 'use', 'nltk', '.', 'full', 'form', 'of', 'nltk', 'is', 'natur', 'languag', 'toolkit']

8. Get word frequency

counting the word occurrence using FreqDist library

import nltk
from nltk import FreqDist
text = “This is a Demo Text for NLP using NLTK. Full form of NLTK is Natural Language Toolkit”
word = nltk.word_tokenize(text.lower())
freq = FreqDist(word)
print (freq.most_common(5))
[OUTPUT]: [('is', 2), ('nltk', 2), ('this', 1), ('a', 1), ('demo', 1)]

9. pos(Part of Speech)tags

POS tag helps us to know the tags of each word like whether a word is noun, adjective etc.

import nltk
text = “the dogs are barking outside.”
word = nltk.word_tokenize(text)
pos_tag = nltk.pos_tag(word)
print (pos_tag)

[OUTPUT]: [('the', 'DT'), ('dogs', 'NNS'), ('are', 'VBP'), ('barking', 'VBG'), ('outside', 'IN'), ('.', '.')]

10. NER

NER(Named Entity Recognition) is the process of getting the entity names

import nltk
text = “who is Barrack Obama”
word = nltk.word_tokenize(text)
pos_tag = nltk.pos_tag(word)
chunk = nltk.ne_chunk(pos_tag)
NE = [ “ “.join(w for w, t in ele) for ele in chunk if isinstance(ele, nltk.Tree)]
print (NE)
[OUTPUT]: ['Barrack Obama']

PS: Execute all those code and tada!!! you know the basics of NLP ?

You can also try some mini projects like:

  1. Extracting keywords of documents, articles.
  2. Generating part of speech for phrases.
  3. Getting the top used words among all documents.

You can also check : NLP for Beginners using SPACY

Don’t forget to give us your ? !


Getting started with NLP using NLTK was originally published in Becoming Human: Artificial Intelligence Magazine on Medium, where people are continuing the conversation by highlighting and responding to this story.

Via https://becominghuman.ai/nlp-for-beginners-using-nltk-f58ec22005cd?source=rss—-5e5bef33608a—4

source https://365datascience.weebly.com/the-best-data-science-blog-2020/getting-started-with-nlp-using-nltk

Published by 365Data Science

365 Data Science is an online educational career website that offers the incredible opportunity to find your way into the data science world no matter your previous knowledge and experience. We have prepared numerous courses that suit the needs of aspiring BI analysts, Data analysts and Data scientists. We at 365 Data Science are committed educators who believe that curiosity should not be hindered by inability to access good learning resources. This is why we focus all our efforts on creating high-quality educational content which anyone can access online.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Design a site like this with WordPress.com
Get started