Loss of research data … or not?

There is a raising awareness amongst scientists that their data sets will need attention if they will be able to use them in the future and some of them learnt this the hard way from past experience. An interesting article by T. Vines at all in Current Biology, Volume 24, Issue 1, 94-97, 19 December 2013 describes a study  into the availability of research data years after the article was published . 516 Publicly available articles, published between 1991 and 2011 were used to find the related data sets, via authors email addresses, either from the article or by searching the web. Vines and his colleagues received 101 data sets, and another 20 datasets  were reported to be still in use. Especially for older papers the related datasets were not readily available any more. The original authors were asked for the data and they gave a variety of reasons why they could not.

Responses included authors being sure that the data were lost (e.g., on a stolen computer) or thinking that they might be stored in some distant location (e.g., their parent’s attic) to authors having some degree of certainty that the data are on a Zip or floppy disk in their possession but no longer having the appropriate hardware to access it. In the latter two cases, the authors would have to devote hours or days to retrieving the data.

The article was discussed in Nature and two other cases of lost data sets were mentioned, which will be cited here, as they are too small to put in the Stories part of the Atlas.

Showing that “benign neglect”, after all, often seems to be not the way to preserve digital information.

Agricultural researcher Melvin McCarty, for instance, spent 15 years between 1958 and 1973 recording the life cycles of plants and grasses near Lincoln, Nebraska. Forty years later, ecologist Lizzie Wolkovich went searching for McCarty’s data as part of an effort to tie together experiments exploring how rising temperatures affect plant life cycles. But McCarty had died, and his raw data could not be found. “There is nothing we can replicate now. The loss of the long-term data set is very sad,” says Wolkovich, who works at the University of British Columbia in Vancouver.

A similar fate befell the raw data collected in the 1980s by Otto Solbrig, a biologist at Harvard University in Cambridge, Massachusetts, on species of violets in New England. Plant biologist Sydne Record at Michigan State University in East Lansing wrote to him in 2009 asking for the original data, to test out a mathematical analysis of population viability that she was developing — but Solbrig didn’t have them. “We had at least 20 big folders with those data, but nobody was interested in them so we threw them away,” he says.


Hyves, a social media network


Hyves-logoIn 2004 the social media network Hyves started in the Netherlands and became a big success with at one point in time 10 million subscribers (the Netherlands has a 17 million population).  In 2010 the Telegraaf Media Group (TMG)  took over the media network and added functionalities to it, with a focus on on line gaming. Recently the TMG announced to stop with the social media site and to restrict its activities to the on line gaming. But in the past almost ten years, users had added 1 Peta byte of content to the network. Would this be deleted like happened in other cases? No! Instead, Hyves offers a service to rescue the personal content: users can request via email a copy of their content (images, blogs, conversations etc.) and this will be send to them by the end of the year. Because, so Hyves says in the press release “this content is owned by the users”.

Very quickly after this announcement, another company MijnAlbum.nl offered a way for users to download their images from the Hyves site, even before Hyves will start their service. This immediately became a success, with 2 million downloads a day.

It seems that the awareness of both the general public and the service providers for the preservation of personal digital archives is growing, a good sign!

Damaged scanned documents

At the heart of digital preservation is the original digital object, the “thing” we want to preserve. Organisations that will preserve digital material, will have an ingest procedure where they receive the digital objects and  will do various quality checks, like whether the received object is what they expected to receive, be it a born-digital or a digitized object. But this quality control is only possible to a certain level, mainly related to technical aspects, like file formats, structure or size. The quality control of the intellectual content of the object is often done by the creator of the object. Digitized material can be compared with the analogue original and deviations can be identified. This process being finished, someone gives the green light that the digital object is “correct” . But what if one is not aware of errors that can occur and are difficult to notice? My colleague Johan van der Knijff pointed me to this interesting article.


See change at the third row. (from the article mentioned)

David Kriesel describes here his recent experience with the Xerox WorkCentre machines. While scanning an image and comparing it later with the original, he discovered that some figures on the image were changed.  “66” became “86” in several cases. This is a deviation that is not easy to detect when scanning many pages with figures, apart from the fact that no one expects the need of checking this! The error had nothing to do with the OCR process as the OCR functionality was not active in this task.  The current assumption is that is has to do with the use of JBIG2  for compression. Xerox has confirmed this error and will create a patch.  However, this error might have been present in Xerox WorkCentres and other copiers for years – we never will know how many documents are “damaged” by this error. Metadata in the original digital object about the scanning environment that was used and the date of creation might be helpful to retrieve possible faulty documents and support  future digital detectives.