The importance of a data dictionary
The first report of Data Quality software was noted at least 20 years ago. The market for data quality has changed dramatically over the past two decades. Where data quality initially started out as a standalone solution, it gradually became a commodity, and is now often integrated into other applications. There are only a few standalone data quality solutions still available. So why is it that most organisations still have problems with data quality, 20 years after it was introduced?
More surprisingly, the problems are often more or less the same as they were 20 years ago. Seems a bit awkward at best, but nothing significant seems to have changed in those 20 years… We might even conclude that if we haven’t managed to improve data quality with tooling over the past 20 years, it’s likely that doing exactly the same thing for the next 20 years will produce similarly underwhelming results.
Let’s take a step back. Why are we having problems with data quality? Reference literature and various research studies show that problems with data quality are most often attributed to:
- Data entry by employees
- Data migration projects
- Batch feeds
- Mismatched user expectations
1. Data entry by employees
Data entry is consistently listed as the most important reason for poor data quality. Entering a new client into a system based on SAP, for example, involves a lot more than just typing in a name and address. A huge amount of information is needed to add a new client. That information often needs to be supplied by several different departments, depending on e.g. which type of client you’re entering. Is the new client the entity to which products are being sold? Or is it only a delivery location to ship the goods to? Are any special discounts applicable to this type of client? Has this company been a client before? If so, do you want to create a new entry, or just reactivate the old one?
The use and significance of the various client groups and bank details are often complex, and the dependencies and connections with other information are not always clear. This lack of clarity grows worth if new clients are only added occasionally, rather than updated regularly. In that case, employees often have to refresh their understanding of how the client data is structured all over again each time. Sometimes you see that a department gradually loses the knowledge of how correct (client) data entry works. For example, maybe a new employee hasn’t been fully onboarded yet and does not understand the context that defines how and when certain information needs to be entered.
2. Data migration
In general, data migration processes are handled by external consultants. These experts are familiar with data migration projects and the associated tooling. They know the challenges that are characteristic of data migration projects and have frequently resolved them. In an ideal world, these consultants would also be familiar with the source application that the data comes from, as well as the target application that the data will be migrated to. They may even know a lot about the sector or industry in which your organisation operates.
However, these external consultants are not familiar with how the data was structured in your organisation in the past, or how it is being used. Rules and exceptions that all the users know may or may not be consulted by the migration team. Most of these rules and exceptions are identified by analysing errors during the migration process. Eventually, the team will manage to load the data into the target application, but only after repeated iterations, prolonged analyses and partially accurate assumptions – only to discover later that some assumptions were incorrect, and some of the data is therefore unusable.
Although the consultants may know the technical data model, they can’t possibly know all the company’s definitions and rules that apply to the data.
3. Batch feeds
A batch feed is when large quantities of data are entered all at once, in one batch. These situations often encounter a combination of the problems common in manual data entry and data migration. In general, batch feeds take place in two ways:
- Users who prepare data manually in a specific format, and them use a mechanism to load all the data into the application in one go
- A custom-developed interface between two applications that constantly transfers data from one application to the other in the background, without users noticing.
There are several problems that could cause difficulties here:
- The data is often prepared in Microsoft Excel. Over time, I have increasingly come to see Excel as a source of data quality issues, not the solution for those problems.
- Data input checks that generally take place when data is entered manually may or may not be performed during a batch feed. One example is autosuggestions that display a street name based on postal code and house number.
- It takes extra effort to ensure that the batch interface and/or the interface between the two applications stays up-to-date when one or both of the applications are changed.
The technical employee responsible for the batch interface may or may not be aware that the business implications of the data have changed over time. A new client segment may have been added which is not yet supported by the batch interface. There is a risk that the new client segment will vanish without a trace during the data transfer, or be overwritten by the default client segment.
4. Mismatched expectations
Both during data entry and when the data is used, mismatched expectations can cause problems.
Different people may have different interpretations of the requested information during data entry. One example is the CRM application, which has an entry field labelled “ETA” when you enter a new opportunity. Some employees may not know what ETA stands for, so they’ll leave the field empty. Others may know that ETA stands for “estimated time of arrival”, but not what that means in this context. When we’re talking about an opportunity, should the field contain the estimated time that the deal will be closed, or the estimated time that the prospect needs their items? The two interpretations are very different, but they can both be expressed in a date format.
Obviously, when everyone enters data accordingly to their best judgement of what it means, but interprets it differently, it becomes almost impossible to use the data later on. For example, what does it mean if a date has not been entered into this field? Does it mean:
- There is no ETA?
- The employee entering the data was not sure what to enter?
- The employee does not have access to the required information and therefore skipped this field?
Change is the only constant
The only constant in business is change. What was perfectly correct and required for data entry one year will probably be different again next year. The information we need to do our work every day changes constantly, and so does the demand for the information we store. Different client types emerge as companies try to serve the market even better. Over time, those new client types may become established, change slightly or dramatically, or vanish altogether.
When making decisions about this type of client, it’s obviously important to know if it should be updated or discontinued. Discussions about a type of client and the reasons for updating or discontinuing its use are just as important as the final decision, possibly even more so.
Data quality tooling can be used effectively to map out problems involving data quality in your organisation. However, tooling is not the answer to resolving those problems. It is important to record the significance of the data, the permitted data range, and any exceptions – and to make sure that contextual information is recorded clearly and in an accessible way. What you often see now is that every project reinvents the wheel, and documents the slice of data that is important to finish the project at that point in time. The project documentation vanishes after the project’s conclusion, or isn’t recorded at all in an attempt to be ‘agile’. That is a missed opportunity. Solid documentation is the basis for good-quality data.
Rudi van Lent, Master Data & Data Quality Specialist