nltk: How to Remove Stop words in Python

This tutorial shows how you can remove stop words using nltk in Python. Stop words are words not carrying important information, such as propositions (“to”, “with”), articles (“an”, “a”, “the”), or conjunctions (“and”, “or”, “but”). We first need to import the needed packages.

import nltk
nltk.download('stopwords')
nltk.download('punkt')
from nltk.tokenize import word_tokenize

We can then set the language to be English. Before removing stop words, we need to tokenize the sentence. Basically it is to break the sentence into separate words.

stopwords = nltk.corpus.stopwords.words('english')
example_sentence = "Britain's Andy Murray will face old rival Novak Djokovic for the first time since 2017 after beating Canada's Denis Shapovalov at the Madrid Open."

# tokenize the sentence
word_tokens = word_tokenize(example_sentence)
# print them out
print(word_tokens)
['Britain', "'s", 'Andy', 'Murray', 'will', 'face', 'old', 'rival', 'Novak', 'Djokovic', 'for', 'the', 'first', 'time', 'since', '2017', 'after', 'beating', 'Canada', "'s", 'Denis', 'Shapovalov', 'at', 'the', 'Madrid', 'Open', '.']

We can see that there some words such as “will”, “for”, “the”, which can be considered as stop words. We can then remove them.

# remove stop words
filtered_sentence = [w for w in word_tokens if not w in stopwords]
# print them out
print(filtered_sentence)
['Britain', "'s", 'Andy', 'Murray', 'face', 'old', 'rival', 'Novak', 'Djokovic', 'first', 'time', 'since', '2017', 'beating', 'Canada', "'s", 'Denis', 'Shapovalov', 'Madrid', 'Open', '.']

While we can see that we did remove “will”, “for”, “the”, there are still “‘s” and “.”, which we do not really need. The stopword list does not include “‘s” and “.”, and that is why they are still in the filtered_sentence. We can manually add them into the list of stop words.

# additional stop words
new_words=("'s",'.')

# append them into the list of stopwords
for i in new_words:
    stopwords.append(i)
filtered_sentence = [w for w in word_tokens if not w in stopwords]
print(filtered_sentence)
['Britain', 'Andy', 'Murray', 'face', 'old', 'rival', 'Novak', 'Djokovic', 'first', 'time', 'since', '2017', 'beating', 'Canada', 'Denis', 'Shapovalov', 'Madrid', 'Open']

Now we can see that “‘s” and “.” are gone.

For more information how to write for loop function, please refer to my other tutorial. Further, you can also see my tutorial on how to define a list.