Poor data quality is enemy number one to the widespread, profitable use of machine learning. While the caustic observation, “garbage-in, garbage-out” has plagued analytics and decision-making for generations, it carries a special warning for machine learning. The quality demands of machine learning are steep, and bad data can rear its ugly head twice — first in the historical data used to train the predictive model and second in the new data used by that model to make future decisions.
If Your Data Is Bad, Your Machine Learning Tools Are Useless
Poor data quality is enemy number one to the widespread, profitable use of machine learning. The quality demands of machine learning are steep, and bad data can rear its ugly head twice both in the historical data used to train the predictive model and in the new data used by that model to make future decisions. To ensure you have the right data for machine learning, you must have an aggressive, well-executed quality program. It requires the leaders of the overall effort to take the following five steps: First, clarify your objectives and assess whether you have the right data to support these objectives. Second, build plenty of time to execute data quality fundamentals into your overall project plan. Third, maintain an audit trail as you prepare the training data. Fourth, charge a specific individual or team with responsibility for data quality as you turn your model loose. Finally, obtain independent, rigorous quality assurance.