Best Practices for Training a Machine Learning Model
Machine learning is finally starting to live up to the hype, which is why 83% of organizations have increased their AI/ML budgets year over year, according to Algorthmia’s 2021 enterprise trends in ML report. Machine learning models have complex algorithms and require significant efforts to train. Having a defined strategy can improve the results and help achieve accurate expectations for the projected timeline. Before starting a machine learning project, one needs to prepare, since the journey is long and re-work cycles can cost more than one expects. In this article we will cover the important questions to ask when preparing an ML project, managing the training process, monitoring batch results, and fine tuning the training strategy.
Preparing for the Project
Before starting a Machine Learning project, we need to answer a few questions:
1. What exactly are we trying to solve through machine learning?
A machine learning model can solve really tough problems where one has large volumes of input data with similarities, but the similarities can’t be grouped using traditional programming approaches. The model can identify patterns in the input data and recognize the contained classes as trained. A common example would be to identify images of cats in a dataset containing images of different animals.
2. What are existing vs. expected accuracy levels?
During the preparation, we should identify the existing average accuracy achieved by users and the expected model accuracy. This is important in the preparation phase because accuracy expected by the model cannot be more than what users are achieving in real conditions. The model accuracy is calculated using a combination of coverage and correctness of the model on a test dataset. The number of predictions attempted by a model is called coverage (or recall), and number of correct predictions is called precision. We should calculate these parameters for existing users of current process, this represents the maximum expectation from a model.
3. Do we have defined rules to instruct annotators on identification?
The identification done by humans will be later imitated by machine learning model. Therefore, it’s important to analyze and resolve:
a. Conflicting rules – multiple rules that produce the same results
b. Ambiguous rules – a single rule that produces different results
c. Obvious rules – not defined (clearly) but SMEs understand
d. Relative/conditional rules – If rule 1 results in being true, rule 2 must fire. Or if object x exists, follow y rule.
4. How can we prepare a good dataset for training?
Diversity and volume of accurately labelled data are primary requirements for a model. One must evaluate the availability of samples on:
a. Coverage – all variety of samples
b. Quantity – enough volume to prepare training sets and test sets
c. Elimination – identify and remove out of scope and inappropriate samples
Managing the Training Process
The steps for training a machine learning model are quite straight forward. However, it’s an iterative and incremental process, so it is important to include implementing observations in the project scope. To ensure success, follow these steps:
1. Analyze the input samples and prepare the dataset within a defined scope.
2. Perform data cleaning to match input requirements.
3. Identify datapoints to be extracted from the dataset.
4. Identify datapoint expectations:
a. Accuracy – minimum value and desired value
b. Coverage – minimum value and desired value
5. Define training strategy, annotation rules and accuracy targets. Create sample batches for annotating.
6. Annotate and ingest sample batches for model training.
7. Monitor batch results – fine tune the training strategy and make timeline projections. Continue annotation and ingestion of sample batches.
8. Once the model reaches the point where machine accuracy matches human accuracy, plan for periodic monitoring to identify, maintain, and improve accuracy levels.
Monitoring Batch Results and Fine Tuning the Training Strategy
Initial batches may have a fluctuating accuracy level(s) but as the volume of right samples increase, a linear growth in accuracy and coverage of identification may occur. Each batch should be closely examined for:
1. Completeness – Are we annotating all instances?
The model observes continuously, so if several cats are left un-annotated the model observes that these objects look like cats but are not cats. This results in low coverage. Therefore, all examples in the input set should be annotated to give an unambiguous matching capability to your model.
2. Consistency – Are we annotating instances of an object consistently?
The annotation rules play a very important role when we have a team of annotators, the rules should be clear so that all annotators make the same decision for an object. The level of agreement among annotators is an important factor. You can implement different methodologies to make sure that training data set is consistent in annotation quality.
Conclusion
Training a machine learning model in an efficient manner is extremely important to achieve the goals of a model. The quality of dataset determines the skill and performance of a model to understand the input environment and make the right decisions. Model training should be executed in a planned way to achieve predictable growth in model performance. After achieving the expected accuracy levels, monitoring model accuracy, and fine tuning the training strategy is a continuous process to keep model performance up to date with real world conditions
If you would like to find out how businesses can further optimize AI and ML model training, book a 15-minute discovery call with our Data Strategy and Innovation Team.
If you found this material helpful, please it share on social media.