This article gives an introduction to the basics of Natural Language Processing (NLP for short) and shows as a practical application how to classify texts with little effort and standard Python libraries. In the following article we will go in more detail about:
- Extract online data,
- Apply simple steps of speech analysis to these speeches,
- Extract statistical information about the vocabulary used,
- Train a machine-learning classifier to assign the speeches to the respective artist
- Evaluate and visualize the results.
What is pre-processing and why is it needed?
If you read a text, we as humans can very quickly grasp the content, meaning and also the context of a text. So why do we first have to preprocess a text and why is this interesting at all? You have to keep in mind that what for us are individual words, sentences, smilies, etc., for the computer is initially nothing more than a collection of individual characters. In this context one often speaks of “unstructured data”. Although a text typically has a certain structure, this is of a linguistic nature, and thus designed for humans and not for machines.
On the one hand, the purpose of pre-processing is to translate this unstructured collection of characters into a system, to give it a structure (e.g. to divide it into subunits), which makes it possible to automatically evaluate the text for different characteristics. This can be, for example, what the general topic of the text is (text classification), to judge which texts are similar (clustering), or, for example, what mood is expressed in the text (sentiment analysis) etc.
Text data often contains a large amount of different content due to its origin, but not all of them are of interest for a use case. Depending on the use case, this can be very different content, such as punctuation marks, HTML tags, smilies or numbers. Furthermore, texts often contain additional challenges, such as grammatical and orthographical errors, colloquial language, abbreviations, etc. Therefore, a second goal of text pre-processing is to free the text from unwanted information or replace it with useful information.
Data Processing in Python
Natural Language Libraries for Python:
- Natural Language Toolkit
NLTK is a leading platform for building Python programs to work with human language data.
- Apache OpenNLP
The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text