Damaged scanned documents

At the heart of digital preservation is the original digital object, the “thing” we want to preserve. Organisations that will preserve digital material, will have an ingest procedure where they receive the digital objects and  will do various quality checks, like whether the received object is what they expected to receive, be it a born-digital or a digitized object. But this quality control is only possible to a certain level, mainly related to technical aspects, like file formats, structure or size. The quality control of the intellectual content of the object is often done by the creator of the object. Digitized material can be compared with the analogue original and deviations can be identified. This process being finished, someone gives the green light that the digital object is “correct” . But what if one is not aware of errors that can occur and are difficult to notice? My colleague Johan van der Knijff pointed me to this interesting article.


See change at the third row. (from the article mentioned)

David Kriesel describes here his recent experience with the Xerox WorkCentre machines. While scanning an image and comparing it later with the original, he discovered that some figures on the image were changed.  “66” became “86” in several cases. This is a deviation that is not easy to detect when scanning many pages with figures, apart from the fact that no one expects the need of checking this! The error had nothing to do with the OCR process as the OCR functionality was not active in this task.  The current assumption is that is has to do with the use of JBIG2  for compression. Xerox has confirmed this error and will create a patch.  However, this error might have been present in Xerox WorkCentres and other copiers for years – we never will know how many documents are “damaged” by this error. Metadata in the original digital object about the scanning environment that was used and the date of creation might be helpful to retrieve possible faulty documents and support  future digital detectives.

Biggest Data Breaches

In a very nice visualisation the “Information is Beautiful” people present an overview of the major data breaches in the past few years, categorized in methods of leaks and types of organisation where the original data were stolen: banks, health services, insurance companies you name it.  We all have experienced such incidents, when we got a message to change the password of LinkedIn, Dropbox or Evernote after a breach occurred.

data breaches

The underlying  data of this visualisation can be examined in a separate file, describing each incident with a brief explanation, and when available, a reference to the original source. This information offers some interesting examples of things that went wrong in real life. Incidents that also might happen in the world of digital preservation. We are familiar with a list of (security) risks , this visualisation shows the evidence of these risks.

Although I don’t expect many organisations to have their preserved data on laptops that can be stolen, – a frequently recurring cause -, (unhappy)  former employees and lack of strict authorisation procedures for hired companies can lead to revealing  sensitive personal information.  Both small and big organisations can be a victim of hackers, theft, stupidity; incidents that might lead to leaks of information that should be kept private, like credit card numbers, health information, address information and so on.

One could think that this kind of sensitive data is less likely to be present in National Libraries, in contrast to for example data centers preserving social science data. But also (National) Libraries preserve material with a commercial value, for example  contemporary e-books, e-journals, movies, music etc. Materials that are in their custody and should not be under threat of these risks. We talk a lot about how the digital objects might be affected by technical risks. But are we sure we take enough measures to prevent preserved collections to appear in this visualisation?