Personal tools

Data Data everywhere but how to consume it?

We're told it's a data deluge, a tsunami or even a bonanza - whatever, there is a lot of data and certainly more than traditional research techniques could cope with. It's coming at us from all directions and if we can only process part of it then how do we know which part to do and how do we see patterns in the big picture? And if we're working with multiple different sources this problem rapidly gets worse.

 

The UK e-Science programme was motivated by this very challenge. At the outset we had high-end data sources like the Large Hadron Collider, combinatorial chemistry, DNA sequencing, telescopes and arrays of sensors (be they in our environment, our vehicles or ourselves!) And the deluge isn't confined to scientific data - we have digitisation programmes to convert our historical archives and artefacts into data, we have new government policy to release data that was hidden and we have secure data services to make data more accessible.  Above all we are all generating data in each and every interaction with the digital world - whether that's shopping with loyalty cards or online, using Facebook (I call it the Large People Collider...), making phone calls or being citizen scientists.  Even our electricity consumption is measured at a micro level. And if capturing data from the physical world and its intersection with the digital world isn't enough, we run simulations to generate even more data!

 

At a nuts and bolts level this all leads to challenges in storage, transmission and processing - three facets emphasised in the early days of the Grid and epitimised by the gridftp command! It's also led to the creation of tools to automate data processing, like data analysis pipelines and scientific workflow systems, which let computers get on with the mundane work and free researchers to be creative. While we still occasionally hear "what do you mean by big?" type discussions at e-Science events, there is a growing awareness that it's what you do with the data that counts. The techniques used to process data are themselves essential knowledge when it comes to interpreting results and reproducing research.  Hence experimental plans, workflow descriptions and provenance records are all important records of process and very much part of research know-how.  If there's a data deluge then perhaps it's generating a method bonanza!  All of this leads to a more data centric research methodology (a.k.a. the “fourth paradigm” of science) as practice is changing to be data-led, with new scientific results emerging from mining the volumes of available data.

 

So clearly the big challenge with data is not in generating it.  But that data may as well not exist if it isn't discoverable and usable in some way.  Generally it is collected with a purpose in mind and it may be fit for that immediate purpose - the challenge is making data fit for re-use by others. This takes extra effort and is really data publication rather than collection, and to continue to be useful it needs curation too.  A still bigger challenge is making data as useful as possible for unanticipated purposes, so that it may be used by others for things you haven't thought of yet.   All these require tools, techniques and effort and hence extra resource and incentive too. We can proudly say the UK has a strong tradition in this area with our national data providers, archives and services, the Digital Curation Centre, and the libraries community embracing repositories. All of whom, incidentally, are very well represented in e-Research South.

 

Of late the Linked Data movement has made significant progress in terms of tools and techniques for data publication for unanticipated re-use. As the Web gives us linked documents so the Linked Data Web gives us linked data. Instead of publishing data on individual websites for human consumption (which, incidentally, just creates more data silos), Linked Data is published for automated consumption. To be Linked Data compliant means complying with a set of simple rules which are readily achievable and buy a lot in return – a carefully tuned forcing effect in the digital ecosystem that is clearly working if we measure it by the increasing number of Linked Data sources available. The rules encourage use of common identifiers – which may relate to real objects like a star, chemical, person, place or a museum artefact – and these are what enable multiple sources to be integrated or 'mashed up'. This ease of automated assembly makes it possible to answer new research questions be they in chemistry, musicology or digital social research, and in fact I see this as exemplifying the next level of e-Research.

 

The incentive structures aren't in place yet – we are measured by papers and not by data let alone creating the tools and techniques that process it. It's important to distinguish “can't share” from “won't share”. Tools and standards, and appropriate training, help address the “can't”, but “won't” requires individuals to be incentivised to make their data reusable. For this there are sticks and carrots – our funders wield sticks (no more funding unless you publish your data!) but we also need the carrots: to put it bluntly, academics need to gain in their personal reputation when they publish reusable data. This is increasingly acknowledged on the national stage and emerging data citation standards will help, but there's nothing to stop individual institutions giving this recognition today.

Professor David De Roure

Document Actions
Log in