Essential Algorithms and Techniques in NLP – Introduction

Natural Language Processing (NLP for short) is one of the earliest directions in Artificial Intelligence. It can be tracked back to Alan Turing and his article titled “Computing Machinery and Intelligence” published in 1950. Famous Turing Test was proposed in that article as well. The idea of the Turing Test is simple. If a machine exhibits intelligent behavior equivalent to, or indistinguishable from, that of a human the machine passes the test.

It shall be obvious that Natural Language Processing plays a critical role in a machine’s ability to pass the test. This happens because language is exclusive to humans. Many animals exhibit intelligence but only humans have been able to develop language (for completeness it must be noted that some other animals, for example dolphins, do communicate with each other. But the form that they use for communication is not a true language). Therefore, Natural Language Processing deals with one of the hardest problems in Artificial Intelligence and thus is AI-complete.

As a programmer you may wonder what algorithms and techniques can be used in NLP and what are the best achievements so far.

Before we review actual algorithms it must be noted that NLP itself has several directions like search and text extraction, discourse analysis, maintaining a dialogue, machine translation, text synthesis, speech analysis and so on. Obviously different algorithms are required for different tasks. Review of each and every algorithm is beyond the scope of a single post and would require a book (or maybe even a series of books). Therefore, I will review the most important algorithms that you should know. This will serve as a good starting point to further explore the topic and enrich your arsenal.

NLP tasks can be grouped into four big categories: Syntax, Semantics, Discourse and Speech.

Syntax

Syntax is concerned with the structure of a sentence. In this context the structure means word order and punctuation. Human brain requires milliseconds to signal that a sentence am table a I is a wrong sentence, while I am a table is a right one. The reason why I used this useless sentence is that syntax cares only about the structure and not the meaning. While a human hardly ever say would I am a table syntactically it’s a fully correct sentence in English. Unfortunately for computers it’s not that easy to process syntax and there is no single algorithm that can be used for all syntax-related tasks. This happens because there is a whole array of tasks in syntax processing like part-of-speech tagging, terminology extraction, stemming, parsing, sentence breaking, morphological segmentation and so on.

Semantics

Semantics is a word coming from ancient Greece. Semantics is a study of meaning. Intuitively, it must be clear that semantics is much more complicated than syntax. In semantics main focus is on the words, phrases, symbols, signs and their relationships. These are collectively called signifiers and the meaning of the word semantics in Greek is “significant”.

In terms of semantic processing we face such problems as lexical semantics, named entity recognition (NER for short), language generation, language understanding, sentiment analysis, topic segmentation and question answering. Optical Character Recognition (more famously known as OCR) falls into the realms of semantics as well.

Discourse

If syntax and semantics looked difficult expect even more difficulties ahead. Discourse defines what statements can be said about a topic. Discourse is concerned with describing formal way of thinking that can be expressed through a language. Put it simply, discourse is the whole text about some idea, information or knowledge. Main problem with discourse is that it does not exist in vacuum. In other words, discourse has relations with other discourses.

There are three groups of tasks in discourse: discourse analysis, text summarization and coreference resolution. While syntax and semantics deal with more tasks, discourse deals only with these three groups of tasks but these tasks are extremely complex. I will briefly expand on these tasks in a moment.

Discourse Analysis

Discourse analysis is one of the most critical tasks as it allows a machine understand what the text is about. Without proper system for discourse analysis conversational agents cannot do anything useful with texts. Have you ever wondered why all chatbots and conversational agents are stupid? Blame poor discourse analysis.

Coreference Resolution

Let’s take the following text. David was struggling with the assignment. Maria helped him with that. Without proper corefernce resolution the machine won’t understand what the last that means in the second sentence. Yet, our brains do this on an autopilot. We immediately know that in this case that refers to David’s assignment.

Automatic Summarization

This is relatively simple. You have a big text and want a summary that is readable. The latest developments have shown very good results in automatic summarization and many companies (for example Associated Press) use automatic summarization successfully.

Speech

Speech is concerned with two major groups of tasks: speech recognition and speech synthesis. Speech recognition also requires a sub-group of tasks in the form of speech segmentation. This is required because when we speak we do not make pauses between words. Therefore, main task of speech segmentation is to break down the whole speech into separate words.

Speech synthesis (often called text-to-speech or TTS for short) is generating speech from a text. Speech synthesis is an extremely important task as it allows computers to speak to us. Until very recently all TTS systems used extremely “robotic” voice. However this is quickly changing and speech synthesis systems like Ivona are demonstrating close to human speech capabilities. The latest advances include speech synthesis fully based on Deep Learning.

In the second part I will review actual algorithms and their purpose. Stay tuned.