So you have a text problem. Maybe you need to detect fraud and the only data you have is user messages, or maybe you have a forum that is being flooded with spam posts, or maybe you’d like to find positive or negative comments about your business on Twitter. Either way, you’ve got a free text problem that needs solving, and doing it manually is a drain.
These sorts of problems can be solved (at least partially) using Natural Language Processing (NLP). In this guide I’ll present a brief overview of the major steps you’ll need to take to get started with NLP. This guide generally assumes that you have some programming knowledge, though I’ll largely be discussing the process of developing your system, rather than the technical details.
Step 1: Identify and acquire your data
The first step to solving your NLP problem is to find and organise all of your data. I’ll take is as given that you have at least some text, as you’re reading this guide. You’ll need to start by breaking your text into the individual units that you want to classify, and by storing those units somewhere that’s easy for you to access them. This might seem obvious, but you’ll often find that actually getting all of your text in one place will take up a surprising amount of engineering time.
It’s also important to identify what other information you have about the problem. For example if you’re trying to identify malicious users based on chat logs, you’ll want to look at the age of the user’s account, the number of interactions they’ve had on your site, the IPs they log on from, and so on. These bits of information should be stored with the units of text that you want to classify, again in an easily accessible format. This information can often end up being more important than the text itself. In fact, if you’re doing simple Naive Bayes classification of emails as either spam or not spam, you’ll find that the data in the email headers is often more powerful than the actual email body.
I would highly recommend building a simple script that can take your raw data and transform it into the storage format that you decide on. It will make it much easier to acquire more data later, and to integrate the functionality into your eventual system.
Step 2: Annotate the data
No one ever wants to do this step. All it involves is sitting down and manually doing the task that you want your system to do. Easy right? The thing is engineers seem to be willing to spend days and weeks building rule-based approximations, features, bespoke models, or cleaning raw data, but are almost completely unwilling to spend a single day (or maybe an hour a day for a week) actually looking at and manually classifying their data!
This sucks, as actually annotating the data gives you three very important things:
You’ll get a much better sense of the problem (and you’ll probably find some edge cases you hadn’t thought of)
You’ll start to notice patterns in your data that you can use as rules or features
You will be able to test your system once it’s built
Actually annotating the data is pretty simple. Get a subset of the textual data you collected in step one, dump it in a spreadsheet, read each piece of text, and then make the decision that you want your system to make. It’s easy (though usually very boring)! Generally speaking you are going to want a few hundred annotated pieces of text, but it will depend on the exact problem you are trying to solve.
Step 3: Clean the data
Text data -- especially text data produced by real users in the wild -- tends to be very messy in a lot of weird and unexpected ways. As such, cleaning the data is a very non-trivial task, and is absolutely essential to the success of your system. The first thing you’ll need to do here is to decide on a text encoding, and to make sure all of your text conforms to that standard. I’d recommend UTF-8, as it enjoys broad support, and is fairly common across the web. If you find characters that are mangled or can’t be encoded, just throw them away.
Next, you’ll need to make sure that there aren’t rogue elements, like HTML or escaped newline characters (“\n”) or other strange things in your text. In most applications you’re going to want to throw these things away, however in some instances (like fraud detection), you may want to replace the instances of HTML with a placeholder token that indicates HTML was present. This means that your system can use the fact that HTML existed in the text, without breaking your n-gram features (more on these later).
At this point you might be tempted to start throwing away normal pieces of the text that you think might make it harder for a computer, such as newlines, punctuation, and sentence boundaries. Don’t! These parts of the text can often make a big difference to your system, and they are crucial for many NLP tools to work properly.
Again, I would strongly recommend you build this part of your system as a separate script from the beginning. Having a pipeline of scripts, each with their own inputs and outputs can help you debug your system, and will make it faster to fix issues when they occur, as you wont need to rerun the full pipeline.
Step 4: Preprocess the data
This is where you’ll start doing things that actually tend to be called NLP. For almost any problem, you are going to want to tokenise and Part-of-speech tag your text. Tokenisation simply splits your text into words, or tokens. It will initially look a little strange as your text might go from:
Freelancer’s articles aren’t boring!
Freelancer ‘s articles are n’t boring !
Which kind of looks like the tokenisation system screwed up and garbled your text. The truth is it’s actually splitting the text into tokens that better fit linguistic theory, and are more suited to getting practical results.
Once you’re tokenised your data, you are almost always going to want to Part-of-speech tag it as well. Roughly speaking, a POS tagger will label each token with whether it’s a noun, a verb, an adjective, or an adverb. The difference is that rather than those four simple categories, there are 30 or more categories, depending on the POS scheme that is used. Fortunately to use POS tags you don’t necessarily need to understand them (although it helps).
Most programming languages have some form of tokeniser and POS-tagger. Personally I use Python and NLTK, as it is fairly straightforward to get started. You can find a straightforward example of using NLTK to tokenise and POS tag text on the NLTK homepage (http://www.nltk.org/). Again, I’d recommend building this part of your system as a separate script, with its own input and output.
Step 5: Build and test your system
This is the final step, and the one where you probably think you’ll spend the most time. That isn’t necessarily true if you’ve done a good job of the previous steps. The approach you’ll want to take here is to first build a simple rule-based system that performs the task you’re interested in. There’s two reasons for this. First, you can get a working system pretty quickly. Second, many businesses shy away from machine learning systems, as the output is a little unpredictable. With machine learning you may not always have a great answer as to why a system did what it did, but with simple rules you’ll always know.
Once you’ve built your rule-based system, take a third of the data you annotated in step 2, and run the system over it. Now you can count the number of instances where it is right and wrong and work out some evaluation metrics. The metric you use will depend on the problem. In cases of fraud, you’ll probably need to find every single instance, so you’ll be interested in whether the system found all of the fraud, without caring too much about false positives. For spam classifiers you probably don’t want to mark legitimate text as spam, so you’ll need to make sure the system found most of the spam without giving you too many false positives. If your rule-based system does well on your chosen metric then congratulations! The problem is solved and you can now go relax!
If not, then it’s time to break out the Machine Learning (ML) algorithms. I won't go into a huge amount of detail here, as it would probably double the length of the article, but I will give you the information you need to explore further.
In order to make use of ML, you’ll first need to build a set of features that your ML algorithm can use to distinguish the things that you’re trying to identify (e.g. spam vs. not spam). For most NLP tasks you can start pretty well with what’s called a Bag-of-words model. It will contain the unigrams (and potentially bigrams) of both your tokens and your POS tags. To explain this with a simple example, suppose we have the (tokenised) example sentence from before:
Freelancer ‘s articles are n’t boring !
The unigrams here are the individual tokens, so the unigrams from that sentence will be:
Each of these unigrams can then be used as a feature in your ML system, with an equivalent set for the POS tags. You can also use the output of your rule-based system and any metadata you identified in step 2 as extra features. As I said earlier, you’ll tend to find that these features add quite a bit of power to your model, so I’d highly recommend them.
Once your features are built you’re ready to choose your classifier. The best starter ML algorithms for NLP are naive bayes and logistic regression. Both are simple enough that you could implement them yourself, but I’d recommend using one of the many ML packages out there. Personally, I use scikit learn (http://scikit-learn.org/stable/), as it’s fast and quite easy to use.
Whatever you decide to use for a classifier, you should now be able to train your classifier with two thirds of your annotated data. You can then use the remaining third to test the output of the classifier (use the same third that you used with your rule-based system). Generally speaking if your system makes the right decision in 70-90% of cases, then it is working correctly. If it’s less than that, or very close to 100% then you may have made a mistake somewhere. If not, then congrats! Your system is working and can start saving you time and effort.