If Your Data Is Bad, Your Machine Learning Tools Are Useless

Five steps to ensure higher-quality data.

by

Thomas C. Redman

by

Thomas C. Redman

April 02, 2018

Alan Schein Photography/Getty Images

Summary.

Poor data quality is enemy number one to the widespread, profitable use of machine learning. The quality demands of machine learning are steep, and bad data can rear its ugly head twice both in the historical data used to train the predictive model and in the new data used by that model to make future decisions. To ensure you have the right data for machine learning, you must have an aggressive, well-executed quality program. It requires the leaders of the overall effort to take the following five steps: First, clarify your objectives and assess whether you have the right data to support these objectives. Second, build plenty of time to execute data quality fundamentals into your overall project plan. Third, maintain an audit trail as you prepare the training data. Fourth, charge a specific individual or team with responsibility for data quality as you turn your model loose. Finally, obtain independent, rigorous quality assurance.

Poor data quality is enemy number one to the widespread, profitable use of machine learning. While the caustic observation, “garbage-in, garbage-out” has plagued analytics and decision-making for generations, it carries a special warning for machine learning. The quality demands of machine learning are steep, and bad data can rear its ugly head twice — first in the historical data used to train the predictive model and second in the new data used by that model to make future decisions.

New!

HBR Learning

Digital Intelligence Course

Accelerate your career with Harvard ManageMentor®. HBR Learning’s online leadership training helps you hone your skills with courses like Digital Intelligence . Earn badges to share on LinkedIn and your resume. Access more than 40 courses trusted by Fortune 500 companies.

Excel in a world that's being continually transformed by technology.

Start Course

Learn More & See All Courses

Thomas C. Redman, “the Data Doc,” is President of Data Quality Solutions. He helps companies and people chart their courses to data-driven futures with special emphasis on quality, analytics, and organizational capabilities. His latest book, People and Data: Uniting to Transform Your Organization (Kogan Page) was published Summer 2023.

New!

HBR Learning

Digital Intelligence Course

Accelerate your career with Harvard ManageMentor®. HBR Learning’s online leadership training helps you hone your skills with courses like Digital Intelligence . Earn badges to share on LinkedIn and your resume. Access more than 40 courses trusted by Fortune 500 companies.

Excel in a world that's being continually transformed by technology.

Start Course

Learn More & See All Courses

If Your Data Is Bad, Your Machine Learning Tools Are Useless

Partner Center

Explore HBR

HBR Store

About HBR

Manage My Account

Follow HBR