June 24 Data: Cooked and Raw

In which I rant about fetishizing raw data!

I am on holiday! Yes, typing this while overlooking a swathe of farmland south of Exeter, the peace of the grazing cattle and wind-ruffled grasses broken only by the buzzing from over on the opposite hillside of a motor-cross track. As a nod to the holiday spirit, I will forbear describing the two days of lab-work I accomplished before hitting the vacation trail and instead wax philosophical. About data.

Say “cooked data” to anyone and the image conjured up will be unsavory, as of a nicotine fingered accountant with a pencil behind each ear, cooking the books. In contrast, say “raw data” and the image will be wholesome, of Adam and Eve in their garden before being contaminated by subjectivity. Unquestionably good stuff.

This is odd: Although raw food is occasionally tasty and nutritious, most raw food is poisonous and often deadly. In contrast, cooking delivers the pleasures and satisfactions of cuisine.

With this metaphor abused, let me turn to data. Nowadays we are increasingly urged to preserve and even upload our raw data. There is much handwringing about data being lost or, horrors! being preserved in a format that won’t be readable in a few years. The US National Science Foundation demands all grantees submit a “data management plan” outlining all of the steps that will taken to ensure that the raw data live on, hopefully in perpetuity and are accessible to anyone and everyone. (Note—I am not alking about situations where a dataset, such as a transcriptome, is part of the result. I am talking about all of the data points that have been obtained in the course of study).

This emphasis on data is misconstrued. Apparently, it needs to be said that few data were preserved since the dawn of, let’s say, the Enlightenment, and I have not noticed any impediment to the progress of science. It is true that rather more data are being generated these days but all the more reason to question the cost of running the server farms needed to keep it accessible to everyone for eternity.

As a philosophical matter, what is the reason for seeking to embalm raw data? I can understand the point in clinical trials how a full record is merited. But even here, it seems what is really needed is what amounts to incredibly complete Materials & Methods and Results sections; not raw data as much as total description. The difficulty in doing so is arguably justified by the lives and fortunes at stake.

However, when I do an experiment to test a hypothesis about the control of growth anisotropy, what justifies the cost of eternalizing those raw data? Costs borne by me as I could be doing another experiment instead of managing data, and costs borne by whoever pays the bills to keep the data on line. It is argued that by having access to the raw data then others can repeat the analysis themselves. This strikes me as ridiculous. For one thing, few will have the time to read the paper I publish about this hypothesis, let alone go fossicking through the original data sets. But suppose someone did: there is a practical problem.

The data must be framed with context, what are sometimes called ‘meta-data’. Raw data, in and of themselves, are ones and zeros. As soon as one includes explanations and details, i.e., meta-data, one has left the jungle of the raw and entered the kitchen. Meta-data are provided by the scientist. How much is enough? What sort of details should be supplied? Everyone will have different answers and this so-called raw data will in fact be cooked, to one degree or another. My PhD research was in part repeating experiments that were done decades earlier, when the extreme sensitivity of plants to light was unknown, and as a consequence, the scientists in question did not report in the materials and methods section the amount and color of light used to grow up and work on the plants. This variable turns out to be essential, so much so that even had these researchers sent their notebooks to the Smithsonian where they could now be digitized, it would not matter. The only way to upload raw data in a way that would be reliable would be to upload an entire four-dimensional replica.

Practicalities aside, what bothers me about raw data obsession is that it removes the scientist. Such a removal, even if it were possible (see above), is not desirable. I can see why removing the scientist appeals. Science is supposed to be a rational business; and we all know there is nowt so irrational as folk, including scientist folk. We get things wrong, we miss the elephants and emphasize motes. Squelching through this mire, raw data can loom crystalline and irresistible. But we should resist. We have no choice: Science is done by humans not by robots (at least for now). That, as they say, is life. We have to accept it, even embrace it, and certainly avoid pretending otherwise.

Think about art. No one would seek to take the brushstrokes out of a painting and store them by color in a box. A painter uses brushstrokes to make a comment about the world, large or small, joining imagination to paint in an indissoluble whole. Similarly, a scientist uses data to make a statement about the world. Art is differentiated from science by the character of those statements; nevertheless, both enterprises involve a human being using materials to make meaning. Storing data is no more helpful than storing brushstrokes.

It is the scientist’s job to understand their data and to communicate their understanding. It is in the telling that meaningless ones and zeros become ideas and content. Science cannot be made without data, on that point everyone agrees; but just as surely, science cannot be made without the communication. Unfortunately, many forget that, and produce texts that are often ugly, turgid, and all but incomprehensible. Rather than fret about raw data, funding agencies would advance the progress of science effectively by mandating scientists to improve their communication.