Machine learning blog series – Part 2: How to (re)train your machine

Note: In our first post in this series, we briefly give an overview of machine learning. Be sure to give it a read before checking out part two. Now, on to data training!

One of the great challenges of machine learning (ML) may seem obvious: the billions of data points you used to train your system will eventually grow to trillions of data points. There will always be new data, new inputs. Because we apparently can’t stop talking about Netflix on this blog, here’s an example: Netflix claims its members watched more than 42 billion hours of content in 2015, up from 29 billion in 2014. That’s a lot of new user data that wasn’t part of your system’s training data set. How do you make sure the system keeps making the right choices?

The answer is feedback loops (otherwise known as retroaction loops). Feedback loops play a critical role in ensuring that systems provide accurate results on a continuous basis. And while simple in theory, feedback loops present difficult challenges in practice. Retraining a system that has accumulated massive amounts of data takes time. Lots of time. Unless you have a big budget for massive computing power, the required CPU time can mean weeks will be required to retrain a model.

tweet_training_image

But by implementing a properly designed feedback loop, you can avoid the time and expense of retraining. Rather than re-training the model when the results show signs of degradation or when new types of input data are observed, models can be trained continuously as humans send feedback. This is called ‘online learning’ or ‘incremental learning’. While the benefits of online learning are well-recognized, we found that there are very few large-scale, referenceable implementations at this point. Maybe in the future, but not now. Online learning may be the white whale of data scientists everywhere…

Also, heed this warning, data trainers: It might be tempting to keep your retraining data to a fixed size or set by ignoring older data or randomly trimming data in order to contain training time and cost. But this kind of selective sampling is always dangerous. You simply cannot take a non-representative sample of data to retrain your system. Data should always be independent and identically distributed (IID). As soon as you start ignoring part of the data to save time or expense, you run the risk of introducing bias. If online learning is the white whale of data scientists, bias is the sworn mortal enemy.

So what can be done? If you’re not Google or Facebook with their seemingly infinite computing power, but you don’t want crappy data, what do you do? To reconcile the promise of online learning with budgetary constraints, Innodata has adopted techniques such as model ‘pre-training’. In practice, this means keeping a fully trained neural network handy and spinning off task-specific networks by quickly training with latest data representative of the current work.

Data re-training can be done cost-effectively and correctly. Just be careful of bias, and keep an eye out for that glorious white whale. It will come some day…