Words in numbers
By Mikhail Dubov
Just how well can a computer understand written language? We look at some of the latest advancements in machine learning and what it means for text analytics.
It is often remarked that the easier a task is a human, the harder it is to teach a machine to do it and vice versa. Computing the Taylor series expansion of a sinusoidal curve? Easy. Scanning a photo of a face and recognising that it’s angry? Not so much.
However, for the past decade things have started to change and that change is gathering pace. It all comes from the field of artificial neural networks, which might not be the first place you’d expect a revolution, since it’s a field that’s been around since transistors looked like this:
The brain is the ultimate machine
Artificial neural networks as the name implies are inspired by studies of the human brain, but implementing the brain’s vast computational power comes with some sizeable challenges. For example, the brain has 100 billion neurons, making 100 trillion connections on average. Advances in hardware have been indispensable, with the largest artificial networks now containing around 30 billion parameters, which are roughly analogous to connections. Traditional methods rarely go beyond a few thousand parameters, limiting their ability to model the complexities of natural language, for example.
But size is not a panacea. The modern day reincarnation of artificial neural nets, known as ‘deep learning’ owes a large part of its existence to subtle changes in the mathematical equations which govern the behaviour of these networks. In particular, differential calculus is a familiar sight when reading many of the ground-breaking papers written in the area.
At Chattermill, we use this technology to classify open ended customer feedback, labelling comments with the themes that appear and associated sentiments (positive, negative or neutral) for each of those topics, like this:
Despite its biological inspiration, deep learning is still just a way of approximating a mathematical function. Therefore, we need a way to convert the words into numbers. Enter word vectors.
The company a word keeps
“You shall know a word by the company it keeps”
John Rupert Firth, 1957
Word vectors are built on the premise that words which are close in meaning are also spatially close in sentences. For example, the word ‘embassy’ will appear near ‘diplomat’, ‘American’ and ‘government’ far more often than the average word, which tells us something about it’s meaning. It also tells us that the vectors for these words should be relatively similar.
But this is the English language so of course it’s not quite that simple. Take the word ‘kick’. If the comment is about some food, it probably refers to flavour or punchiness, making it a good thing. But imagine you’re a travel company — someone writing ‘the hotel said they were going to kick us out’ is probably not a happy customer.
Not that it’s usually that obvious. Even companies in the same sector can have subtle differences that can trip up the unsuspecting. Suppose you’re a restaurant owner. Many of the comments you receive are relatively clear — ‘it was too spicy’ and ‘the chicken was dry’ don’t leave much room for interpretation, but the comments aren’t always like that. For example, ‘the sauce was mild’. Is that a good or a bad thing?
It looks impossible to be sure but we can find a reliable answer by looking at the context. In this case the context is that the restaurant serves Indian food and looking back through the data, we find that 90% of the time someone uses the word ‘mild’ they’re being critical. So we can classify the comment as negative and be pretty confident about it. Similarly, we might regard the comment as positive if the restaurant was mainly aimed at children.
There are two important points we can draw from this. First, using statistics we were able to infer information that wouldn’t have been possible using only the comment itself. Second, these inferences are often specific to a particular type of business.
For this reason at Chattermill we think it’s important to train our models specifically for each client (and we do). Not that there aren’t similarities that can be taken advantage of. More data is almost always better in machine learning and combined datasets can be used to learn the general structure of the English language and then fine-tuned using data from specific clients.