I was reading an article the other day about checksums. I touched on why you may want to use checksums briefly in my book and thought I would go into a little more detail in this blog post.
A checksum is a binary or hexadecimal number derived from the contents of a file. If you want more details about what they are and how to create them I will direct you to the article I have linked to above. For our purposes, I will just talk about why you may want to use one in the EDI world.
Why use a checksum
It is unusual, but I have had situations where I am connecting to a trading partner’s server, downloading files, and then not have the permission to delete the file or files from the remote server that I just downloaded. Usually, this is the type of situation where we have paid to purchase a commercial data-set, and the file needs to remain for their other customers to download.
One of the main things we need to concern ourselves with in the EDI space, is we have to make sure that we process every file that we should; and we have to make sure that we process a file once and only once. In our use case, we are going on the assumption that you don’t want to depend on the date timestamps of files to know which ones you have processed and which ones you haven’t. Two reasons I can think of for not wanting to use the date timestamp off top of my head are:
- The previously downloaded files are no longer available to you.
- In a lot of places, the time changes twice a year, so it is not always a safe bet to use this.
What the process may look like
You would need to compute the checksum of the file or files that are on the remote server, and compare them to the checksums of the files you have previously downloaded. In order to do this efficiently (and if the files you previously downloaded are not available to you) then you would need to have the checksums of the files you previously downloaded stored in a database table that you have access to.
Then you would compare the checksum of the file sitting on the remote server to the checksums stored in your database. If you find a match then you would ignore that file, and not download it. If you do not find a match then you would download the file, and then insert the computed checksum for that file into your database table; preventing you from downloading that same file again.
I would be interested to know what other solutions people have come up with for solving this type of problem, or other situations where the use of checksums has proven helpful. Please leave a comment below.
Great overview of this often overlooked design feature.
Some more sophisticated transfer tools even have basic checksum functionality built-in.
Ensuring the entire content of a file is in tact after transfer is one of many important steps to ensure a higher degree of data quality overall. Data quality organizations have risen out of a culmination of many data checking processes that were either incomplete or poorly designed to begin with… often a simple checksum operation could prevent significant and even expensive downstream process, system and reporting issues. In addition to completeness operations such as this one, I would add the importance of notification on error to the person(s) supporting those critical processes and log tracking. Building fault tolerance is critical and often missed by architecture teams as well as put aside while speed to market over rules basic quality controls.
Naively and unintentionally, many developers responsible for data transfer don’t always understand the monetary value of one lost record, the potential of regulatory exposure of a single missed transaction, or that transaction that happened to be with a customer whose very first order was coming through your process. This is where back-end design can even impact the customer experience.
Every “bit”matters, and no pun intended!
In my post, I say that I often encounter this situation when downloading a commercial data set that we have paid for. I completely failed to mention that usually, they will have a second file sitting out there containing the checksum of the data file, so you can validate that you downloaded the file correctly. One of the very points that you bring out.
Thank you so much for taking the time to comment, and making this a better site.
And I for one love a pun.
Great content. Glad to contribute and thank you for spearheading important topics on best practices!