A user request led to the unsatisfying situation that the PDF document that was asked for, was not presented on the screen. Instead, there appeared an empty screen with an error message.
In this case a set of PDFs was delivered as a “zip” file. One of the steps in the pre-process was to “unzip” the file and to get separate documents for further processing. Unfortunately, in the “unzip” tool of Info-Zip that was used, a parameter was switched on that should not have been switched on.
Normally all files will be unaltered before and after unzipping. But for (plain) text files this is not always satisfactory. The reason is that not all operating systems use the same byte codes to indicate line endings in a text file. For example, under Windows / Dos the convention is to terminate each line by a two-byte character sequence (carriage return + linefeed, 0x 13 10), whereas a one-byte line ending (linefeed, 0x 10) is customary for Unix-like environments.
As a result of these differences, text files that were created under Windows don’t always display correctly under Unix, and vice versa. The “unzip” tool has a parameter to solve this. When activated, it first tries to identify which files in the “zip file” are plain text. For these files, it then tries to establish which line endings were used. If the line endings are different from the default ones in the destination environment, it then changes them in the unzipped file. (On a side note, the documentation of “unzip” explicitly warns that its identification of text files is far from flawless!)
In this case, the PDFs were unzipped in a Unix-based environment. Some of them were then mis-identified as text files with Windows (two-byte) line endings, which were subsequently replaced with Unix (one-byte) line endings in the output files.
This resulted in losing bytes in the PDF file. Internally, a PDF file is made up of a large number of numbered objects. A cross-reference table contains references to the exact locations (byte offsets) of each object in the file. However, since bytes were removed, the values in the cross-reference table no longer matched the actual object locations. As a result, this made the PDFs unrenderable.
Can we avoid this?
There were several lessons to be learnt from this case:
- Use checksums. If the original file has a checksum, this can be compared with the “unzipped” file, and you might get a warning at an early stage, for example during the Ingest procedure.
- Know the tools you are using! In this case, the effects of the parameter were apparently not known by the creator of the “unzip” script, even though the documentation of “unzip” is completely clear about its behaviour.
Research done in the KB-NL by Johan van der Knijff