Data quality: It’s a dirty job, but someone’s got to do it

Memorial Day marks the ringing in of the summer season. If you are like me, there’s no better way to celebrate the beginning of summer than a perfectly toasted bratwurst or hotdog. And while I am more than happy to enjoy the end-product, I have very little interest in rolling up my sleeves and “seeing how the sausage is made”. This is how we treat data analytics. We are happy to digest the insightful dashboards, intriguing graphics, and actionable takeaways. However, very few of us are interested in rolling up our sleeves and doing the necessary dirty work to prepare the data to allow for useful insights.

Many companies starting their data analytics journey make the mistake of skipping the data cleaning process all together. None of us want to see how the sausage is made, we just want the bratwurst to magically appear. But as we have seen over, and over, insightful analytics cannot be achieved with poor data quality. A 2017 Harvard Business Review article titled Only 3% of Companies’ Data Meets Basic Quality Standards outlined the data quality crisis best, stating that, “47% of newly-created data records have at least one critical (e.g., work-impacting) error”. If you are struggling with data quality issues, know that you are not alone. In this article, we will examine why data quality is so critical, why companies struggle, and finally what can be done about it.

Poor data quality erodes

Sinkholes are caused by steady erosion of underground structures by water and other natural processes. At this point you might be asking yourself what in the world sinkholes and data quality have to do with one another. Well, like sinkholes poor data quality causes erosion. Data quality is the foundation upon which all analytics are built. When the foundation is eroded, it effects the whole structure of the analytics process.

At first, when data is entered into a database, missing values, incorrect selections, relational hierarchy issues, etc. do not cause much of a problem. However, as we continue to accumulate data, the data quality issues that we did not perceive at first have now compounded as the volume and velocity of data increased. If data quality is not addressed before we attempt to preform analytics, we will be left with eroded insights. By eroding analytics capabilities, poor data quality will also lessen the return on investment (ROI) on EHS (Environmental, Health and Safety) management systems. A main reason companies are investing in EHS management systems is to learn more about the health of their programs through data analytics. Without data quality, insights are eroded along with the ROI on the EHS management system. If data quality is so vital to achieving analytics success, why do so many companies struggle?

The data quality struggle

Information technology adoption is growing in the EHS profession at an exponential rate. A recent report by Verdantix indicated that the compound annual growth rate of the EHS software industry is expected to be 11.5% from 2021 to 2026. This adaptation has and will continue to lead to a dramatic increase in the volume of data being collected, along with a push for more data driven decision making. Ultimately, this is a good thing. However, to capitalize on the value of the increase in volume, variety, and velocity of data, EHS professionals should be knowledgeable of three common data quality pitfalls.

Field Dilution – An attempt to collect too much data can dilute the pertinent details. Leveraging EHS information technology systems greatly lowers the barrier to data collection and entry. However, this can lead to over collection of data. For example, you might have been using a paper incident report with fifteen key fields. Now that you have a new software, fields can easily be added in multiple workflow stages leaving you with an incident report with a total of 50 fields. In this data quality pitfall, more data is not always better. You might have an over collection problem if many of these additional fields have blanks or missing values. This should prompt the team to review the information that is being collected to evaluate its importance.

Non-Descript Entry – information management systems provide the user with the ability to enter information in many ways. Via text field, dropdown, multiple choice, and so on. This can help lessen the burden of entry and standardize responses. A data quality pitfall that arises is the input of “Other”, “NA”, “Not Listed”, or similar non-descript answers into these fields. When non-descript options are available and used by end users, the data that is gathered loses its depth.

Hierarchy Alignment – when utilizing an EHS information management system it is important to add the locations, departments, projects, inspection checklists, etc. where data will be collected. However, as the breadth of the data collection increases, the hierarchy alignment data quality pitfall rears its head. For example, is the same question being asked on multiple checklists, and if so, are those answers rolling up to a larger hierarchy for analytics purposes? Cleaning and organizing your data hierarchy is vital for the aggregation and trending of data.

How to improve data quality

What can we do to improve data quality? As the title of this article suggests, it is a dirty job, but someone must do it. Fixing data quality issues is not a glamorous job, but it is a necessary evil if we want to use data to drive decisions. The first step to fix data quality issues is to recognize where the issues lie. This can be accomplished with a systematic review of the data collection process, forms, and fields to identify gaps. Determine what forms, fields or hierarchies need to be analyzed for data quality. Next, export field and form data from the EHS management system for review. Then, analyze the outputs for data quality issues. After analyzing the data, the second step is determining the problem. Are there field dilution issues, non-descript entries into fields, or hierarchy alignment problems in the data? Lastly, once we have the issues defined we can create a plan for getting the issues resolved. This could mean consolidating fields, re-aligning hierarchies or using Natural Language Processing (NLP) to create additional dropdown selections to correct non-descript pitfalls.

Data quality is a problem that plagues most companies utilizing an EHS management system. Much like a sinkhole, poor data quality is caused by erosion to the underlaying structure, in this case that structure is data. The erosion to data will dilute insights and decrease the ROI on your EHS management system. There are three common pitfalls that EHS professionals should be aware of when analyzing their data collection processes for quality issues: field dilution, non-descript entry, and hierarchy alignment. If we are proactive and honest in the pursuit of data quality, we will find and correct quality pitfalls. In turn, we will be better positioned to use the data as a positive change agent. Ultimately, helping us to identify and control risk and eliminate death on the job by 2050.

ON DEMAND: A discussion of new data on trends in falls and updated resources from CPWR – The Center for Construction Research and Training and the OSHA-NIOSH-CPWR National Campaign to Prevent Falls in Construction.

Selecting the proper PPE is crucial for workplace safety. This webinar will demonstrate that while thermal hazards may vary across different industries, the basics for selection, use, care, and maintenance of FR Clothing share some similarities.