A love letter to NLTK

5 May

Natural Language Toolkit, or NLTK, is a leading Python library used for NLP (natural language processing). “Why dedicate an entire blog post to it?”, I hear you ask. Well, it’s a leader for a reason. Think of this as more of a love letter.

In a legal setting, technologies using NLP have been thriving in recent years. As we’ve discussed in a previous blog post (click here to read), NLP is used to analyse text data and can therefore be used to help with a number of tasks, including:

· Contract analysis

· Automation/summarisation of legal writing

· Information extraction for e-discovery

These tasks can be complex enough, so anything that helps us to save time and streamline processes is welcome. And NLTK was built for that purpose.

I was first introduced to NLTK when completing a sentiment analysis project which looked to determine the general opinion of tweets related to feminist issues. Like the majority of projects involving large quantities of data, an extraordinary amount of time was dedicated to preparing and cleaning the data before I could even begin to consider classification models.

Luckily, NLTK came to the rescue, along with its 50 useful lexical resources.

One of these is WordNet, which can be imported as part of the NLTK module. WordNet is a lexical database of the English language created by Princeton. It can be used to find synonyms, antonyms, and word meanings. In the case of my sentiment analysis project, WordNet allowed me to create a list of Twitter hashtags to fully encompass the subject of feminism to ensure I was working with appropriate and objective terms. This works by grouping words together which are semantically close within the network, creating synsets. Applying each of the resulting hashtags as search queries meant that an effective representation of the subject could be given, therefore helping to reduce selection bias.

In addition to these already useful tools, the support that NLTK provides for undertaking NLP tasks doesn’t stop there.

Of course, one of the first aspects we need to consider in the preprocessing stage is tokenisation, which involves splitting up each word into individual tokens so they can be recognised and processed individually. With this in mind, NLTK implements this pattern recognition, identifying words in a string and splitting them into tokens. Following this, we can also count on NLTK to help us remove those pesky stop words. As the text has already been tokenised, words with little meaning, such as pronouns and prepositions are easily identified and removed. This process is essential to reduce the word count and therefore the processing time of training a model later.

As well as this, NLTK also allows us to apply the stemming technique which removes suffixes from words, as well as the arguably superior method, lemmatisation, which considers the morphological analysis of the word. In this way, the dictionary form of the words is returned. For example, if we applied lemmatisation to the word ‘going’ the word ‘go’ would be returned. Of course, it uses the WordNet database to do this.

As if this isn’t enough, NLTK allows us to perform part-of-speech tagging, named entity recognition, and text classification with various models, such as Naïve Bayes classification.

As we can see, NLTK takes some of the pressure off when completing a variety of preprocessing tasks. Well-known and loved in the world of Python NLP, it’s not hard to see why. Its numerous capabilities make it user-friendly and a great tool to work with, whether you’re working on a serious project or simply playing around with natural language for the first time.

Annie Benzie

A love letter to NLTK

Research & Innovation Awards 2022

AI and Snake Oil

Legal Innovation Lab Wales