Space probe worth $125M destroyed. Crash of an Air Canada Boeing 767 airliner that ran out of fuel. How was this even possible? Both accidents were caused by careless data handling, which eventually had disastrous consequences. A fundamental principle for building information systems, especially those integrating data from various sources, is checking the integrity – or quality – of such data. Even the best system, when fed with unreliable data, becomes useless.
Organizations want to make better decisions based on their collected data. With this goal in mind, they build data warehouses to integrate data from a variety of sources, which they process and use to support decision-making. However, success depends on a sufficiently large amount of quality data. The more data and data sources, the greater the challenge. However, only high-quality data input allows for reliable reports to be developed and used for making the right decisions.
Why is data quality important?
With small-scale projects, any mistake may be severe but not very costly in terms of dollars and cents. After all, it’s not a big deal to correct a few mistakes in invoices every once in a while. However, major difficulties will arise in the case of large-scale projects.
US scientists learned it the hard way when, in 1999, they accidentally destroyed the $125 million Mars Climate Orbiter because of a minor mistake. How did it happen? Two teams of scientists responsible for the project used different measurement systems, namely the imperial and metric systems. This led to a miscalculation that caused the probe to enter orbit around Mars at an altitude of 60 km instead of 150 km, and the probe burned up in the denser atmosphere. A group of experienced scientists who had graduated from the best universities failed to notice that part of the team made their calculations in kilometers, while others used miles.
While the disaster of the unmanned Mars Climate Orbiter probe was a big financial loss, a mistake in aviation puts passengers’ lives at risk. In 1983, during a regularly scheduled Air Canada flight, both engines shut down at 26,000 feet (6,500 meters) halfway through the flight. It turned out the machine had no fuel at all. How did that happen? Before the flight, the ground crew loaded the plane’s tanks with 22,300 pounds (10,115 kg) of fuel – i.e., less than half of the 22,300 kilograms necessary to fly the planned route.
What was the reason for the mistake? That very same year, Canada abandoned the imperial system of units in favor of the metric system. Even though the new aircraft had gauges set according to the new unit system, the staff was still in the habit of using the previous units. Worse yet, the fuel-quantity indicator system in the tanks failed, so the flight crew could not identify the problem earlier, during refueling.
This could have been one of the biggest disasters in the history of aviation, but fortunately, the experienced captain skillfully put the plane in a controlled glide and steered it to a former Canadian airbase. None of the 70 people on board were killed or seriously injured. Just like with NASA, the underlying cause of the problem was a mistake in units of measurement—an error all too easily made, as everyone knows from physics classes at school.
Another disaster that also had a disastrous potential occurred in 2003 at Disneyland in Japan. An axle in a roller coaster train broke during the ride. Fortunately, the safety systems worked, and there were no injuries or fatalities. Why did the incident even occur? It turned out that the new axle was ordered on the basis of technical designs from the mid-1990s, in which dimensions were given in inches. The part was ordered with the proper numerical value, but it was given in centimeters. As a result, the new axle was 2.5 times thinner than the original one, and therefore could not withstand the loads.
There may be dozens of types of errors in data management, including differences in rounding numbers, different measurement systems (imperial, metric, mixed), erroneous sensor data, differing date formats, errors in human data entry (e.g., common errors in vehicle registration cards), incompatible file formats, etc. The greater the chaos in the data, the greater the potential for errors in calculations.
Make sure your data quality is high
How do you ensure the quality of the data? Are you sure you are comparing the right values with each other during the process of integrating them from different sources? In a large organization, wrong decisions caused by incorrect data may cost millions of dollars and, in extreme cases, pose a risk to human life or health.
Data integrity prevents unauthorized changes to data, which may lead to unintentional errors.
Here are 6 steps to ensure data integrity:
1. Data source records
The first step towards ensuring data integrity in an organization is to check how and from which systems suppliers can make data available to us. Data suppliers can include systems operating within the organization, IoT devices (especially in the manufacturing or logistics industries), as well as business partners (e.g., accounting companies, logistics operators) and external or independent data sources (e.g., publicly available weather forecast data, government databases, etc.).
Having such a record of data sources may seem fairly obvious. In practice, however, an increasing scale of an organization’s operations may entail data from dozens or hundreds of sources, with not so obvious accountability for data quality.
2. Connection to a data sharing system
Nowadays, most IT systems allow data exchange between applications from different vendors via APIs. An API standardizes and automates the process, thereby simplifying it.
In order to obtain data from the app, you must first connect to it. Usually, the authentication process is not complicated, and application providers offer ready-made solutions, such as login or token exchange.
An important aspect that influences data sharing is the type of relation you have with your data provider. If this is a partnership governed by a contract, you have more influence on the scope of the data or on how the data is shared. The situation is different when the relationship is asymmetrical, for example, when it is a public API. In such a case, the data depends on its provider, and you have no control over what is shared.
3. Data retrieval
Once you are connected to your provider, the next step is to retrieve the data. Data taken directly from a data provider is often referred to as “raw data” since it was not modified at this point.
The most common way to share data is through web services that enable the user to download the data, using a properly constructed query sent to a designated URL. The range of solutions also includes connecting to database systems or reconstructing databases based on so-called “snapshots” (namely, images of the database at the time the snapshot was generated). Often, however, systems make their data available by exporting it to flat files (this is a sequential file in which data is stored in a linear structure). These may be Excel files or text files.
In the worst-case scenario, the information system does not make the data available, and the developers did not provide for the possibility of exporting it, which requires the use of more creative solutions. These may include, e.g., “scraping” a website. Web scraping involves downloading a web page the way a web browser does, but the retrieved data is interpreted the way you want it to be. Important: This must be done in accordance with the applicable law to prevent copyright violations.
4. Data processing and quality checks
When you have your raw data, you can process it. In order to analyze the collected data, it must first be unified, i.e., brought to a state where it can be compared or matched with each other. This step may require the most effort.
The clearest difference between data from different systems may be its format. If it is not the same, recognizing the value the data brings may be the biggest challenge.
This may be demonstrated by the use of different date formats in European (DD/MM/YYYY) and Anglo-Saxon applications (MM/DD/YYYY). Or, because of the aforementioned inconsistencies in the units of measurement, even within the metric system, one data provider may provide distances in meters (260 m), another one in kilometers (0.26 km), etc. In the NASA probe story, one team used the metric system, while the other used the imperial system, as a result of which the final calculations had an error of nearly 90 km.
5. Data usage
With unified data stored in one place (e.g., in a data warehouse), you can import it and connect it to business intelligence solutions such as Data Studio or Power BI. These enable you to create interactive dashboards and reports.
The data collected may also become a new source of data to be used by other tools. Take, for example, a ticketing system built by the local government. Having collected and unified data from various entities (carriers), a new system can be created, based entirely on the collected data (e.g., a comprehensive application that integrates information on various modes of transportation).
6. Data validation
Data validation, or data verification, can have two purposes. The first one is to identify data that carry useless values and distort the big picture. In such a case, it’s worth performing a process called data cleaning, which involves the removal of outliers that may indicate wrong measurements or accidental aggregation of erroneous data.
The second purpose may be to verify the data, so you can make sure that the data objects are working properly. For example, if a supermarket monitors the temperature in refrigerated food storage warehouses and a temperature of -1,000 degrees Celsius is recorded, this means that the sensor failed. A faulty device should be replaced, and until it is replaced, the values obtained from it should not be included in the monthly statement. At the same time, this is an example of an unrealistic value, since a temperature below absolute zero is simply not W.
Integration of data from independent systems built at different times can, on the one hand, bring great business value, and be a major challenge on the other. Resources may not only differ in their file extension, but also in content. Once you have successfully navigated the entire process from data retrieval to processing, you gain the ability to access new resources that can lay the foundation for new software or become the basis for more informed business decisions.
For a complex project, the steps we have described may mean weeks or even months of work by a team of analysts, developers and database specialists. OChK have already gained vast experience in conducting such projects for our clients. Verification and integration of data sources were the foundation for building and developing IT systems.
Make sure you contact us – let’s think and plan how to help your organization ensure data integrity.