Each month our Principal Data & Machine Learning Engineer, Dr. Joe Ainsworth shares his insight on the world of data. De-mystifying data science and providing guidance on the steps you can take to move your data agenda forward.
We kick off this year’s inaugural data clinic with a post dealing with the common challenge of…
Managing Timelines in Big Data Applications
To begin, lets answer the question…
🦾 What is big data?
A good working definition of ‘big data’ is that it is an amount of data that is too big to handle without some extra thought. Certainly, you can handle larger amounts of data by scaling up or out, but that’s not all there is to it. There will always be an amount of data that’s bigger than you can afford to scale up to and there will always be someone who needs to process that amount of data.
💡 Solving for big data requests
There are tricks we can use and good practices that can eke a little more out. Some of these are provided to us neatly packaged up in new tools and platforms. But what if you’ve spent all the compute budget and added all the shiny toys and your users still want bigger datasets and more accurate models? You have to turn to the time-honoured engineering practice of ‘compromise’.
Specifically I mean compromise in terms of the metrics, not the actual request – we in the technical departments live for the problem-solving; the question of whether the problem actually needs solving is one we are happy to delegate to the commercial arm of the business (though we reserve the right to inform them when their demands are outrageous, of course).
What I mean is that, while the user may ask for all the results of a query, in a given amount of time, often they don’t need all of it immediately. In many cases they would be happy to receive part of it now and the balance later. Alternatively, a less accurate result promptly with a refined value later. I have two examples in mind which we can use to talk through this slightly suspicious answer.
(“) Traditional Queries and Lambda Architecture
The first case is typical: an enormous query. As a one-off, people are happy to wait minutes or even hours, if need be, for their results to come in. But if that query needed running daily then those timeframes would not do. Among the most demanding of consumers are dashboards, which like to offer dynamic monitoring of metrics. If you have an hour-long query to populate a metric on a dashboard then your data is always at least an hour stale and the granularity is an hour at best. If the dashboard appears more timely or granular then users might be basing decisions on false assumptions.
So let’s posit a simple example: a database going back many years recording the trading activity of our customers, which is generally frequent. A primary requirement for the contact centre is that, should a customer contact us, the operator needs immediate access to the total of the magnitude of all transactions ever made by the customer (we’ll imagine there’s a good reason for this). We have a great many customers and a great many records so precomputing this for all customers takes four hours. Computing it on the fly for a given customer only takes thirty seconds, but that’s too long.
However if we restrict the query to only the last week it returns in a second; an acceptable outcome. So what we can do is perform a weekly query to precompute everything up to the beginning of the week, which gives us immediate access to data that’s up to a week out of date. We can then run another query on-demand for the remainder, adding values together when it returns. This approach gives us immediate results and no staleness without having to install vast amounts of hardware.
A popular technique that bridges the gap between batch and streaming, the drawbacks include maintaining this more complicated system – you now have two queries and an aggregator. Building a bespoke system for a few key metrics might be justifiable but if the business needs this sort of access to arbitrary data then you would want to implement this generally. This sort of application has become more common so you might think the market would respond with a turnkey solution. Instead we found it’s cleaner to solve them using fully-streamed systems instead – we’ll come back to this.
⚡️ Live ML Decision Systems
When you talk about AI-enabled systems this is what most people think of: a machine learning system that responds instantly with the answer to some riddle that a mere human can’t answer at all.
It turns out that such systems are phenomenally expensive to build (as of early 2021) and their utility doesn’t always justify such a cost. They get created by large organisations looking to edge out their competitors in tightly-run races in which all parties are already using all the usual tricks.
The expense comes about because of the demanding application. Using the fictitious example above, what if further to knowing the total magnitude of all the customer’s trades, we want to predict what transaction they’re calling about? We have a ton of data we can use – not just their trades but a rich seam of data collected, with their knowledge and consent, from the machine they use to interact with our trading platform.
Our data scientists have determined that, of the potentially hundreds of trades a customer made in the last few hours, the one they enquire about can be determined by looking at the colour of the ambient light detected by the ambient light sensor on their phone, as well as some other stuff.
In the past, having identified such a relationship, we might then quantify it and build an algorithm to make the prediction based on the new information received at the start of the contact. This means mining the historic records of ambient light colour values relating to customer contact subjects.
So now let’s suppose the relationship between colour and problematic trade varies slowly. The values we acquired from the data mining exercise are good for a month or so but as it took a solid month to generate them (the long way) that’s not going to work.
Machine Learning speeds the process up by generalising the data mining part of the problem. The relationships are what’s important, not the actual data (a theme I’ve discussed before), so if you ask your user to format their data in a certain way you can then use a generic algorithm to produce weights that work in a generic prediction routine. Great! You can now automate the whole thing and retrain the model every week, maintaining freshness in predictive ability.
So now we can do the time-consuming number-crunching offline, reducing a very large database to a small set of coefficients geared up to answer a specific question very quickly. When the customer makes contact the ambient light colour reading and other values are fed into the prediction algorithm and the most likely result pops out.
The compromise we made here is an accuracy one: the model is most accurate just after it is trained, losing accuracy with every moment. If we can retrain it regularly enough to keep this loss of accuracy to the minimum then it can be a satisfactory solution. Generally this is achievable at a lower cost than doing so the long way.
The above machine learning system is still not simple to implement. We now have two pipelines to manage – one to train the model and one to use the predictor – and there’s a significant data engineering effort required to support it.
Feeding this sort of system is quite similar to the former example, the one we used a lambda architecture for, and the streaming approach works well for that one as well. So well in fact that it deserves its own treatment – lookout for a future post on this subject!
That’s all for today but check-out Joe’s previous data clinic on Data Tagging.