🎞

Special topics in machine learning

Previously, we learned about two general areas of machine learning: Supervised and Unsupervised learning. Here, we'll investigate two special fields of machine learning: time series prediction and natural language processing.

Time Series Forecasting

Time series forecasting refers to any type of supervised Machine Learning where time is an important feature.

A good time series forecast will account for recent behaviour as well as weekly, monthly, or yearly trends.

Seasonality

Time series forecasting can help us catch periodic events know as seasonality. Seasonality can happen on any time scale. For example, television viewership on weekends because many folks choose to go out rather than stay in and watch TV. This is a weekly trend. Certain spending can spike at the end of the month when people receive a pay check. This is a monthly trend. Ice cream sales are lower in the winter because people don't like to eat cold food when it's cold outside. This is an annual trend.

Natural Language Processing

NLP refers to any machine learning problem where the dataset is text. Possible inputs include customer reviews, Tweets, medical records, or email subjects.

Understanding text is difficult to define and more difficult to do in practice, but NLP can accomplish many simpler tasks, such as classifying the sentiment of customer reviews. or clustering medical records with similar pathologies.

Successful NLP depends on having a specific question, and creating a good set of features from the input text. Previously, the features for our machine learning problems have been numbers or categories. What do we do when our data is text?

Word Counts

A simple option is to count the number of times important words appear in each piece of text. Suppose we wanted to analyse the following two sentences: "KKR is a great cricket team" and "RCB is a great cricket team". We might end up with the word counts shown in the table.

Although, word counts are commonly used in NLP, there are a few obvious limitations.

Problems with word counts: negation

First, word counts don't take into account negation. Consider the sentence, "KKR is not a great cricket team". Although, great is present in this sentence, "not" mean that we don't actually mean "great".

Word counts and Synonyms

Another problem is that word counts don't help us consider synonyms. For example, there are many words that all mean "blue", such as "sky-blue", "aqua" and "cerulean". Ideally, we would like to group these as a single feature.

Word embeddings

One solution to these problems is Word Embeddings. It is a special way of creating features that group together similar words. Word embeddings would create similar features for various shades of blue.

Word embeddings have another interesting property: they are mathematical representations of words that obey intuitive rules. For example, in word embeddings, if we take the features for "King", subtract the features for "man", and add the features for "woman", we get a set of features that are very close to those of "queen".