Each month our Principal Data & Machine Learning Engineer Dr. Joe Ainsworth shares his take on the world of data. De-mystifying the science and sharing his diagnosis on common issues.
Issue #1 – Tagging Data
Data Science has been likened to the famous novel *Anna Karenina*, on account of the opening line:
Happy families are all alike; every unhappy family is unhappy in its own way.
The idea is that there are many potential failure modes and only by avoiding all of them will you succeed.
In the case of data science we have many steps to take on the way from a collection of historical data through cleaning and sorting and identifying related indicators to predicting behaviour with an accuracy better than a coin toss. If we do any of these steps imperfectly we will lose what little signal we’re trying to tease out, and one of the most critical steps is tagging.
What do we mean by tagging?
Tagging in the data science meaning is much as we know it from its social media equivalent: we’re labelling our data, usually either categorising it or reducing it to a binary choice (“did this agreement end up defaulting?” yes/no).
We do this to make sense of it. Numeric data is easy for a machine to understand and easy to build algorithms around, but there’s a real practical problem with computers inferring meaning from a string of natural language characters – text. We could choose to apply more software to this problem of course; this forms the basis of the natural language processing branch of the discipline, which is a subject for another post.
Natural language processing is a last resort though, for when text data is truly freeform. Not all text fields are like this and the most useful types, to an analyst searching for insight, are ones with a finite number of possible values. The number itself is known as the field’s cardinality, and the most useful cardinality is not too high and not too low.
Some data are naturally finite. If we were to record a person’s favourite professional football club the number of values this could take is limited to the set of all possible football teams, something which is easy to discover and of low enough cardinality to do something useful with. We might hope to discover some relationship between this datum and some other, less obvious aspect of the person in question and so it’s helpful to have lots of datapoints for each value. In this example we might be looking at a few hundred clubs, and to get a few hundred data points for each one we’d need a hundred thousand altogether, a not-impractical figure.
A person’s favourite food, on the other hand, is not so easily quantised. Should an open sandwich be considered the same as a closed sandwich? What about regional variations of a dish? One could in principle create a category for each value encountered but the resulting cardinality would be too high to be useful – we might end up with a million values, so in order to identify any patterns we’d want a hundred million datapoints. That’s not so feasible.
How do you like your sauce?
The reason we’d have to have so many datapoints is because, unlike us, the computer has no way to gauge the similarity of two values. A favourite food of ‘pasta and ragu as made in Naples’ is much more similar to ‘pasta and ragu as made in Sicily’ than it is to ‘pickled herring as made in Helsinki’, but to a computer they are all completely distinct.
This similarlity is what allows us to draw inference – we’re betting on the fact that, if there’s a relationship between favourite food and some other factor, those two people who like Sicilian and Neapolitan ragu will score similarly in the related factor too. We would be able to consider those two people as members of the same category while the computer would not, meaning it would need to see two more datapoints in order to produce a model of equivalent predictive power.
So should we teach the computer how to assess similarity? Depends on the data. If you can do so then you’re able to turn a finite, discrete field into a continuous one, amenable to all sorts of further processes. But for the vast majority this is too complex.
For those we resolve this by decoupling the drawing of inference from the problem of assessing similarity, and we do that by exhaustively solving the similarity problem first and hardcoding its results into the inference system.
And if you thought that sounded like a tedious and error-prone solution with the potential to add bias to the resulting system then you would be correct. Group everything into too few groups, for instance, and you risk losing the signal by mixing up too many dissimilar entries into the same category. If you group things badly it could be even worse: not only might you lose the signal in that variable but you may counteract the signal in a related variable (there are techniques to counteract that too, of course).
It turns out that, for any given example of a behaviour so subtle we need databases and algorithms to tease it out, before we could do so we needed to spend time manually categorising some or all of our data.
And not just some time – this can be a significant part of the time budget. Tagging is where the algorithms run out. We can build tooling to extend our reach and allow us to deal with larger datasets in a given time, but we’ll always have do put in the hours. The efforts of thousands of computers can be greatly improved just by plugging one human into the process, capturing the responses of its finely-honed pattern matching abilities to a selection of your test data.
If you don’t do this then the signal can be lost, just one of the ways you can fail to build a useful model!