Machine learning blog series — Part 3: Bias, be gone.

We’ve been talking in this blog series about the benefits of machine learning and specifically the importance of confidence metrics and data training. If you haven’t yet, give those a read before checking out today’s post. For today’s post, we will be `talking about our enemies. One enemy to be specific. The arch-villain of machine learning: bias. Machine learning bias can ruin your data, but like all arch-villains, it can be stopped.

Even with a well-implemented feedback loop in a highly-trained computer model with confidence scoring, bias is a lurking fatal flaw that can undermine your data integrity. Bias can creep into a system in a host of capricious ways. And as ML expert Daniel Tunkelang said, “Once your machine learning system embeds biases into its model, it can continue generating new training data that reinforces those biases.” Tunkelang warned that we need to be careful not to create “self-fulfilling prophecies.” Machine learning bias can set off a chain of unwanted consequences down the road.

To train and retrain your system, the machine always needs data that reflects the real world. Perhaps no clearer example of this exists than Google’s experience in June 2015, when the tech giant’s photo categorization system identified two African-Americans as “gorillas.” Mortified, the company’s immediate reaction was to simply disable the AI from identifying anything as a gorilla. But in a New York Times op-ed, Microsoft AI researcher Kate Crawford correctly identified the root cause of the blunder as “…a white guy problem: the data used to train Google’s software relied upon too many photos of white people, which diminished its ability to accurately identify images of people with different features.”

The best way to guard against identified bias from the outset is to increase training examples, include additional models, and/or to try a larger set of features. In terms of in-production models, of course you need low confidence results being sent to a human to check. But you also need to ensure that the feedback loop includes a stream of randomly selected data that the machine has earmarked as high confidence. As pictured below, we refer to this as a “multi-channel” feedback loop, because it selects data for validation from both the low-confidence and the high-confidence channels. Sometimes your machine just needs to know it was right. By only correcting its errors, over time, the quality in the high confidence channel degrades. Then you run into problems.

Machine learning can do wonders for your business. It can make your data smarter, help you monetize it and drastically improve process efficiency. But it has to be done right. Don’t let bias become the 800 pound gorilla in the room.