As part of the Office of Scholarly Communication Open Access Week celebrations, we are uploading a blog a day written by members of the team. Tuesday is a piece by Dr Danny Kingsley reflecting the talk she gave this morning to the Royal Society of Chemistry, Chemical Information and Computer Applications Group conference – Measurement, Information and Innovation: Digital Disruption in the Chemical Sciences.
The data policy landscape
The policy position on data management in the UK is driven on many levels. Many institutions now have policies – an example is the Cambridge University Research Data Management Policy Framework. Increasingly publishers such as PLOS are requiring that research published in their journals is accompanied by the data underpinning the research. Some journals, such as Nature’s Scientific Data are specifically data-only journals
There has been a country-wide movement towards opening up data. Consultation on the Draft Concordat on Open Research Data released by the RCUK ended on 28 September. Cambridge coordinated a joint response to the Concordat with several other universities.
However the real driver for action this year has been funder policies – specifically, the Engineering and Physical Sciences Research Council (EPSRC) which announced it was going to (and has begun) checking compliance as of 1 May 2015.
The devil is in the detail
While the Research Councils UK have RCUK Common Principles on Data Policy stating “Publicly funded research data are a public good (…), which should be made openly available with as few restrictions as possible”, these common principles are idiosyncratic when looked at from the individual council perspective, as the graphic on the second page of this document demonstrates.
There are variations on whether a data management plan is required, where the data should be stored, the level of support offered and even whether this can be funded through the grant (in most cases it can, but not all).
Places to share data
Open (cross-disciplinary) repositories
These include commercial options such as figshare which is owned by Digital Science who also produce Symplectic Elements research management systems and are is an offshoot company to Macmillian/Springer.
Open source solutions such as Zenodo, an open dependable home for the long-tail of science, enabling researchers to share and preserve any research outputs in any size, any format and from any science, developed by CERN.
There are a significant number of disciplinary specific data repositories. In many ways these are the most natural place for data as disciplinary experts can curate the data. For example the Natural Environment Research Council (NERC) runs seven repositories.
The first repository ever created (in 1991) was arXiv, holding e-prints in Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance and Statistics. The public functional genomics data repository is Gene Expression Omnibus . Social science data can be deposited in the UK Data Service. The Oxford Text Archive holds literary and linguistic texts for higher education.
Cambridge University is using its DSpace repository Apollo to store and share data. To date the largest dataset we have received is 28 GB – huge datasets need to sit externally to this repository. Not all institutions provide a data repository service. There are significant overheads associated with this activity both for the technology and the people to upload and curate the data.
Journals are increasingly requiring researchers to publish (or at least provide links to) their supporting data alongside their research articles. PLOS brought in their data sharing policy in December 2013.
There is also a growing selection of data only journals on the market, for example Nature’s Scientific Data. Others include Journal of Open Archaeology Data, Open Health Data, Journal of Open Psychology Data, Gigascience, Bodiversity Data Journal and Earth System Science Data.
The Australian National Data Service has built Research Data Australia (RDA) which collates and displays information about datasets held all over the country, both open and closed. As of the 20 October, the RDA contains 115,823 datasets, of which 23,322 are open.
There appears to be little appetite yet for this sort of service in the UK, or at least for providing funds to create one.
What are we actually trying to achieve ?
Cambridge University has taken a proactive approach to the RCUK data sharing policies by inviting funders to come and discuss issues with the research community. We have written up and published these discussions as an ‘In conversation with..’ series on our Unlocking Research blog.
In addition to clarification on some aspects of the policies, one of the questions we are trying to answer is what is the actual goal of these policies? Ben Ryan the Senior Manager, Research Outcomes of the EPSRC clarified that researchers needed to share:
- the data that underpins publications
- the data that validates research findings
- the data that is worth keeping
To summarise the philosophical goals of the EPSRC policy:
- The default position is ‘data should be open’
- Published research findings should be testable
- Maximise the impact of publicly funded research
- Maintain public trust in science and research
- They are trying to create a new research culture
While these might seem like lofty and even admirable goals, it does not mean that the academic response to being informed of their grant requirement for sharing data have been met with welcoming arms. Far from it in some cases.
A small selection of the responses we have received in our meetings with over 1500 researchers this year at Cambridge include:
- What’s the minimum we can get away with?
- This is crap
- ‘They’ are just doing this because ‘they’ can
- But it will take a huge effort to get the data in a useable form
- No-one will look at it
- What a waste of time
This has prompted us to play a game we call ‘data excuse bingo’ at some of our Research Data Management Workshops – see slide 16.
We are trying to start at the end
The problem might be that everyone is fixated on sharing data at the end of the research process. However this is one of the lesser data management activities if data management begins at the beginning.
The second of our “In conversation…” series was with Michael Ball the Strategy and Policy Manager at the Biotechnical and Biological Sciences Research Council (BBSRC). Amongst a long discussion about what exactly constitutes ‘data’ in the biological sciences, Michael emphasised that disciplines themselves must establish ways of dealing with data. This is the beginning of an ongoing process.
He noted that researchers need to consider how to deal with data from the beginning of a research project. Researchers can ask for money to manage data in the grant application (something which is currently quite rare).
The practice of sharing data requires the data to be: Accessible, Intelligible, Assessable and Reusable. So how do we achieve that?
Basics of Research Data Management
- Writing a research data management plan at the beginning of the research process – identifying all of these issues
- Using a file naming protocol (including version control)
- Backing up work in several places
- Identifying any data that might be politically, personally or commercially sensitive
- Determining who owns what data
- Ensuring data that is being used for research across collaborations is shared in safe, secure and legal shared facilities, bearing in mind Export Control Legislation.
- Having good metadata protocols
- Using a reputable and reliable storage/sharing facility that offers persistent identifiers (DOIs)
Who owns the data?
This is an interesting question. EPSRC policies state that researchers should ensure that collaborators are aware of the sharing requirement before they embark on research. Then there are questions about who within the team might own the data – with again a suggestion to come to some sort of agreement before work starts.
Are all collaborators equal ‘owners’? Or does the Principal Investigator have a 50% stake, with the remaining split amongst the PhD students making up the remainder of the team? You might want to talk to your legal advisors and/or your research office about this issue.
There is also the related issue here of developing author-contribution transparency. Do you include author contribution statements in your articles?.
If Michael Ball is correct and very few researchers are asking for funds associated with managing data, it is reasonable to assume that data is being managed in an ad hoc way – with reliance on the computer savvy postdoc the project hired …
Required skillsets for managing and curating data
There is a considerable range of skill sets associated with managing data, and these have been described by the Digital Curation Centre as data creator, data scientist, data librarian, data manager.
Alma Swan and Sheridan Brown’s 2008 report to Jisc on ‘The skills, role and career structure of data scientists and curators: an assessment of current practice and future needs’ described these as:
- Data Creator: Researchers with domain expertise who produce data. These people may have a high level of expertise in handling, manipulating and using data
- Data Scientist: People who work where the research is carried out – or, in the case of data centre personnel, in close collaboration with the creators of the data – and may be involved in creative enquiry and analysis, enabling others to work with digital data, and developments in data base technology
- Data Manager: Computer scientists, information technologists or information scientists and who take responsibility for computing facilities, storage, continuing access and preservation of data
- Data Librarian: People originating from the library community, trained and specialising in the curation, preservation and archiving of data
There is a simple graphic that clearly shows how these roles relate to one another.
Certainly an increasing number of data scientist jobs are being advertised. A search for ‘Data scientist’ + ‘London’ on the job site Indeed on 18 October produced 1,405 results. So where are all of these people coming from?
The Swan and Brown study in 2008 noted that ‘People in data science roles face a big, continuing challenge in remaining properly skilled up’ and this remain the situation today, although there are now many more opportunities for training – a few are listed here:
- The Digital Curation Centre offers data management and curation education and training.
- MANTRA is a free online course from the Data Library staff in Information Services, University of Edinburgh. It is designed for those who manage digital data as part of a research project. It was crafted for the use of post-graduate students, early career researchers, and also information professionals. It is freely available on the web for anyone to explore on their own.
- Data Scientist training for Librarians is a collection of notes and discussions about data work done by librarians. They ran an experimental course in August to teach “the latest tools for extracting, wrangling, storing, analysing, and visualising data”.
In addition, the professionalisation of these skills is beginning. For example, Data Science London is a community of data scientists that meets regularly to discuss data science ideas, concepts, and tools, methods and technologies used by many startups to analyse large scale data (big data), extract predictive insight, and exploit business opportunities from data products. Their website offers a list of free data science courses.
Cambridge University is one of the five founding university partners of the about to be launched Alan Turing Institute, which is intended to be the UK’s national institute for data science. The Institute will be addressing some of the issues with the data management skill gap.
Issues with sharing data
For all of the ‘feel good’ ideals about sharing data, and the processes being put in place to support this, there are some serious issues raised by researchers about the requirement to share data.
To start, there is a very real concern that the UK will become unattractive for collaborations. Why would a commercial collaborator choose a UK partner when a partner from elsewhere is not under obligation to share their data?
There have been some discussions at information sessions about the possible need to change the type of research being done to reduce the amount of data being produced because of the cost of long term storage of this data.
Indeed, there is discussion in some circles about whether applying for EPSRC funding is worth the hassle. It is a fair bet that none of these were intended outcomes of the RCUK policies.
Consequences of not sharing data
However, this does need to be a balanced strategy. There is a considerable argument that openness is related to academic integrity as it allows work to be verified and validated.
Here are examples in three disciplines where sharing data had a dramatic effect:
- Medicine – having the data publicly available in two trials of deworming pills demonstrated that a population wide deworming program did not improve school performance.
- Economics – A study widely cited to justify budget cutting in the US had a mistake in the calculations which was only revealed when the Excel file was released
- Physics – It took 12.5 years to withdraw Jan Hendrik Schon’s work on ‘organic semiconductors’ because the reviewers were unable to replicate the results without access to the original data or lab books.
Sharing data offers great challenges to the research community, not least because it is less than clear what ‘data’ means in different disciplines. It will take some time for the research community to change its philosophy and practice. But the positives outweigh the negatives and with hope we will look back at this time as a short transition period.
Note the slides from the talk are available in Slideshare.