Tag Archives: data

Sustaining long-term access to open research resources – a university library perspective

In the third in a series of three blog posts, Dave Gerrard, a Technical Specialist Fellow from the Polonsky-Foundation-funded Digital Preservation at Oxford and Cambridge project, describes how he thinks university libraries might contribute to ensuring access to Open Research for the longer-term.  The series began with Open Resources, who should pay, and continued with Sustaining open research resources – a funder perspective.

Blog post in a nutshell

This blog post works from the position that the user-bases for Open Research repositories in specific scientific domains are often very different to those of institutional repositories managed by university libraries.

It discusses how in the digital era we could deal with the differences between those user-bases more effectively. The upshot might be an approach to the management of Open Research that requires both types of repository to work alongside each other, with differing responsibilities, at least while the Open Research in question is still active.

And, while this proposed method of working together wouldn’t clarify ‘who is going to pay’ entirely, it at least clarifies who might be responsible for finding funding for each aspect of the task of maintaining access in the long-term.

Designating a repository’s user community for the long-term

Let’s start with some definitions. One of the core models in Digital Preservation, the International Standard Open Archival Information System Reference Model (or OAIS) defines ‘the long term’ as: 

“A period of time long enough for there to be concern about the impacts of changing technologies, including support for new media and data formats, and of a changing Designated Community, on the information being held in an OAIS. This period extends into the indefinite future.”

This leads us to two further important concepts defined by the OAIS:

Designated Communities” are an identified group of potential Consumers who should be able to understand a particular set of information”, i.e. the set of information collected by the ‘archival information system’. 

A “Representation Information Network” is the tool that allows the communities to explore the metadata which describes the core information collected. This metadata will consist of:

  • descriptions of the data contained in the repository
  • metadata about the software used to work with that data,
  • the formats in which the data are stored and related to each other, and so forth.  

In the example of the Virtual Fly Brain Platform repository discussed in the first post in this series, the Designated Community appears to be: “… neurobiologists [who want] to explore the detailed neuroanatomy, neuron connectivity and gene expression of Drosophila melanogaster.” And one of the key pieces of Representation Information, namely “how everything in the repository relates to everything else”, is based upon a complex ontology of fly anatomy.

It is easy to conclude, therefore, that you really do need to be a neurobiologist to use the repository: it is fundamentally, deeply and unashamedly confusing to anyone else that might try to use it.

Tending towards a general audience

The concept of Designated Communities is one that, in my opinion, the OAIS Reference Model never adequately gets to grips with. For instance, the OAIS Model suggests including explanatory information in specialist repositories to make the content understandable to the general community.

Long term access within this definition thus implies designing repositories for Designated Communities consisting of what my co-Polonsky-Fellow Lee Pretlove describes as: “all of humanity, plus robots”. The deluge of additional information that would need to be added to support this totally general resource would render it unusable; to aim at everybody is effectively aiming at nobody. And, crucially, “nobody” is precisely who is most likely to fund a “specialist repository for everyone”, too.

History provides a solution

One way out of this impasse is to think about currently existing repositories of scientific information from more than 100 years ago. We maintain a fine example at Cambridge: The Darwin Correspondence Project, though it can’t be compared directly to Virtual Fly Brain. The former doesn’t contain specialist scientific information like that held by the latter – it holds letters, notebooks, diary entries etc – ‘personal papers’ in other words. These types of materials are what university archives tend to collect.

Repositories like Darwin Correspondence don’t have “all of humanity, plus robots” Designated Communities, either. They’re aimed at historians of science, and those researching the time period when the science was conducted. Such communities tend more towards the general than ‘neurobiologists’, but are still specialised enough to enable production and management of workable, usable, logical archives.

We don’t have to wait for the professor to die any more

So we have two quite different types of repository. There’s the ‘ultra-specialised’ Open Research repository for the Designated Community of researchers in the related domain, and then there’s the more general institutional ‘special collection’ repository containing materials that provide context to the science, such as correspondence between scientists, notebooks (which are becoming fully electronic), and rough ‘back of the envelope’ ideas. Sitting somewhere between the two are publications – the specialist repository might host early drafts and work in progress, while the institutional repository contains finished, publish work. And the institutional repository might also collect enough data to support these publications, too, like our own Apollo Repository does.

The way digital disrupts this relationship is quite simple: a scientist needs access to her ‘personal papers’ while she’s still working, so, in the old days (i.e. more than 25 years ago) the archive couldn’t take these while she was still active, and would often have to wait for the professor to retire, or even die, before such items could be donated. However, now everything is digital, the prof can both keep her “papers” locally and deposit them at the same time. The library special collection doesn’t need to wait for the professor to die to get their hands on the context of her work. Or indeed, wait for her to become a professor.

Key issues this disruption raises

If we accept that specialist Open Research repositories are where researchers carry out their work, that the institutional repository role is to collect contextual material to help us understand that work further down the line, then what questions does this raise about how those managing these repositories might work together?

How will the relationship between archivists and researchers change?

The move to digital methods of working will change the relationships between scientists and archivists.  Institutional repository staff will become increasingly obliged to forge relationships with scientists earlier in their careers. Of course, the archivists will need to work out which current research activity is likely to resonate most in future. Collection policies might have to be more closely in step with funding trends, for instance? Perhaps the university archivist of the digital future might spend a little more time hanging round the research office?

How will scientists’ behaviour have to change?

A further outcome of being able to donate digitally is that scientists become more responsible for managing their personal digital materials well, so that it’s easier to donate them as they go along. This has been well highlighted by another of the Polonsky Fellows, Sarah Mason at the Bodleian Libraries, who has delivered personal digital archiving training to staff at Oxford, in part based on advice from the Digital Preservation Coalition. The good news here is that such behaviour actually helps people keep their ongoing work neat and tidy, too.

How can we tell when the switch between Designated Communities occurs?

Is it the case that there is a ‘switch-over’ between the two types of Designated Community described above? Does the ‘research lifecycle’ actually include a phase where the active science in a particular domain starts to die down, but the historical interest in that domain starts to increase? I expect that this might be the case, even though it’s not in any of the lifecycle models I’ve seen, which mostly seem to model research as either continuing on a level perpetually, or stopping instantly. But such a phase is likely to vary greatly even between quite closely-related scientific domains. Variables such as the methods and technologies used to conduct the science, what impact the particular scientific domain has upon the public, to what degree theories within the domain conflict, indeed a plethora of factors, are likely to influence the answer.

How might two archives working side-by-side help manage digital obsolescence?

Not having access to the kit needed to work with scientific data in future is one of the biggest threats to genuine ‘long-term’ access to Open Research, but one that I think it really does fall to the university to mitigate. Active scientists using a dedicated, domain specific repository are by default going to be able to deal with the material in that repository: if one team deposits some material that others don’t have the technology to use, then they will as a matter of course sort that out amongst themselves at the time, and they shouldn’t have to concern themselves with what people will do 100 years later.

However, university repositories do have more of a responsibility to history, and a daunting responsibility it is. There is some good news here, though… For a start, universities have a good deal of purchasing power they can bring to bear upon equipment vendors, in order to insist, for example, that they produce hardware and software that creates data in formats that can be preserved easily, and to grant software licenses in perpetuity for preservation purposes.

What’s more fundamental, though, is that the very contextual materials I’ve argued that university special collections should be collecting from scientists ‘as they go along’ are the precise materials science historians of the future will use to work out how to use such “ancient” technology.

Who pays?

The final, but perhaps most pressing question, is ‘who pays for all this’? Well – I believe that managing long-term access to Open Research in two active repositories working together, with two distinct Designated Communities, at least might makes things a little clearer. Funding specialist Open Research repositories should be the responsibility of funders in that domain, but they shouldn’t have to worry about long-term access to those resources. As long as the science is active enough that it’s getting funded, then a proportion of that funding should go to the repositories that science needs to support it. The exact proportion should depend upon the value the repository brings – might be calculated using factors such as how much the repository is used, how much time using it saves, what researchers’ time is worth, how many Research Excellence Framework brownie points (or similar) come about as a result of collaborations enabled by that repository, etc etc.

On the other hand, I believe that university / institutional repositories need to find quite separate funding for their archivists to start building relationships with those same scientists, and working with them to both collect the context surrounding their science as they go along, and prepare for the time when the specialist repository needs to be mothballed. With such contextual materials in place, there don’t seem to be too many insurmountable technical reasons why, when it’s acknowledged that the “switch from one Designated Community to another” has reached the requisite tipping point, the university / institutional repository couldn’t archive the whole of the specialist research repository, describe it sensibly using the contextual material they have collected from the relevant scientists as they’ve gone along, and then store it cheaply on a low-energy medium (i.e. tape, currently). It would then be “available” to those science historians that really wanted to have a go at understanding it in future, based on what they could piece together about it from all the contextual information held by the university in a more immediately accessible state.

Hence the earlier the institutional repository can start forging relationships with researchers, the better. But it’s something for the institutional archive to worry about, and get the funding for, not the researcher.

Published 11 September 2017
Written by Dave Gerrard

Creative Commons License

Open at scale: sharing images in the Open Research Pilot

Dr Ben Steventon is one of the participants in the Open Research Pilot. He is working with the Office of Scholarly Communication to make his research process more open and here reports on some of the major challenges he perceives at the beginning of the project.

The Steventon Group is a new group established last year which looks at embryonic development, in particular focusing on the zebrafish. To investigate problems in this area the group uses time-lapse imaging and tracks cells in 3D visualisations which presents many challenges when it comes to data sharing, which they hope to address through the Wellcome Trust Open Research Project. Whilst the difficulties that this group are facing are specific to a particular type of research, they highlight some common challenges across open research: sharing large files, dealing with proprietary software and joining up the different outputs of a group.

Sharing imaging data 

The data created by time-lapse imaging and cell tracking is frequently on a scale that presents a technical, as well as financial, challenge. The raw data consists of several terabytes of film which is then compressed for analysis into 500GB files. These compressed files are of a high enough quality that they can be used for analysis but they are still not small enough that they can be easily shared. In addition the group also generates spreadsheets of tracking data, which can be easily shared but are meaningless without the original imaging files and specific software to allow the two pieces of data to be connected. One solution which we are considering is the Image Data Resource, which is working to make imaging datasets in the life sciences, which have not previously been shareable due to their size, available to the scientific community to re-use.

Making it usable

The software used in this type of research is a major barrier to making the group’s work reproducible. The Imaris software the group uses costs thousands of pounds so anything shared in their proprietary formats are only accessible to an extremely small group of researchers at wealthier institutions, which is in direct opposition to the principles of Open Research. It is possible to use Fiji, an open source alternative, to recreate tracking with the imaging files and tracking spreadsheets; however, the data annotation originally performed in Imaris will be lost when the images are not saved in the proprietary formats.

An additional problem in such analyses is the sharing of protocols that detail the methodologies applied, from the preparation of the samples all the way through data generation and analysis. This is a common problem with standard peer-review journals that are often limited in the space available for the description of methods. The group are exploring new ways to communicate their research protocols and have created an article for the Journal of Visualised Experiments, but these are time consuming to create and so are not always possible. Open peer-review platforms potentially offer a solution to sharing detailed protocols in a more rapid manner, as do specialist platforms such as Wellcome Open Research and Protocols.io.

Increasing efficiency by increasing openness

Whilst the file size and proprietary software in this type of research presents some barriers to sharing, there are also opportunities through sharing to improve practice across the community. Currently there are several different software packages being used for visualisation and tracking. Therefore, sharing more imaging data would allow groups to try out different types of images on different tools and make better purchasing decisions with their grant money. Furthermore, there is a great frustration in this area that lots of people are working on different algorithms for different datasets, so greater sharing of these algorithms could reduce the amount of time wasted creating algorithms when it might be possible to adapt a pre-existing one.

Shifting models of scholarly communication

As we move towards a model of greater openness, research groups are facing a new difficulty in working out how best to present their myriad outputs. The Steventon group intends to publish data (in some form), protocols and a preprint at the same time as submitting their papers to a traditional journal. This will make their work more reproducible, and it also allows researchers who are interested in different aspects of their work to access the bits that interest them. These outputs will link to one another, through citations, but this relies on close reading of the different outputs and checking references. The Steventon group would like to make the links between the different aspects of their work more obvious and browsable, so the context is clear to anyone interest in the lab’s work. As the research of the group is so visual it would be appropriate to represent the different aspects of their work in a more appealing form than a list of links.
The Steventon lab is attempting to link and contextualise their work through their website, and it is possible to cross-reference resources in many repositories (including Cambridge’s Apollo), but they would like there to be a more sustainable solution. They work in areas with crossovers to other disciplines – some people may be interested in their methodologies, others the particular species they work on, and others still the particular developmental processes they are researching. There are opportunities here for openness to increase the discoverability of interdisciplinary research and we will be exploring this, as well as the issues around sharing images and proprietary software, as part of the Open Research Pilot.

Published 8 May 2017
Written by Rosie Higman and Dr Ben Steventon

Creative Commons License

‘Paperless research’ solutions – Electronic Lab Notebooks

The Office of Scholarly Communication started 2017 with a discussion about ‘going digital’ – on 13 January 2017 we organised an event at Cambridge University’s Department of Engineering to flesh out the problems preventing researchers from implementing Electronic Lab Notebook solutions. Chris Brown from Jisc wrote an excellent blog post with his reflections of the event* and agreed for us to re-blog it here.

For researchers working in laboratories the importance of recording experiments, results, workflows, etc in a notebook is engrained into you as a student. However, these paper-based solutions are not ideal when it comes to sharing and preservation. They pile on desks and shelves, vary in quality and often include printed data stuck in. To improve on this situation and resolve many of these issues, e-lab notebooks (ELNs) have been developed. Jisc has been involved in this work through funding projects such as CamELN and LabTrove in the past. Recently, interest in this area has been renewed with the Next Generation Research Environment co-design challenge.

On Friday 13 January I attended the E-Lab Notebooks workshop at the University of Cambridge, organised by Office of Scholarly Communication. Its purpose was to open up the discussion about how ELNs are being used in different contexts and formats, and the concerns and motivations for people working in labs. A range of perspectives and experience was given through presentations, group and panel discussions. The audience were mostly from Cambridge, but there was representation from other parts of the UK, as well as Denmark and Germany. A poll at the start showed that the majority of the audience were researchers (57%).

Institutional and researchers’ perspective on ELNs at Cambridge

The first part of the workshop focussed on the practitioners’ perspective with presentations from the School of Biological Sciences. Alastair Downie (Gurdon Institute) talked about their requirements for an ELN as well as anxieties and risks of adopting a particular system. Research groups currently use a variety of tools, such as Evernote and Dropbox, and often these are trusted more than ELNs. The importance of trust frequently came up during the day. Alastair conducted a survey to gather more detail on the use and requirements of ELNs and received an impressive 345 responses. Cost and complexity were given as the main reasons not to use ELNs. However, when asked for the most important features, cost was less important but ease of use was the most. Researchers want training, voice recognition and remote access. There is clear interest across the school at all levels, but it requires a push with guidance and direction.

Pic1Marko Hyvönen (Dept of Biochemistry) gave the PI perspective and the issues with an ELN for a biochemical lab. He reinforced what Alastair had said about ELNs. He showed how paper log books pile up, deteriorate over time and sometimes include printed information. They are hard to read and easy to destroy, a poor return on effort, often disappear and not searchable. It was interesting to hear about bad habits such as storing data in non-standardised ways, missing data, printing out Word documents and sticking them into the lab books.

With 99% of their data electronic many of the issues in the use of lab books generally are around data management and not ELNs. An ELN solution should be easy to use, cross platform, have a browser front end, be generic/adaptable, allow sharing of data and experiments, enforce Standard Operating Procedures when needed, have templates for standard work to minimise repetition, include inputting of data from phones and other non-specific devices. What they don’t want are the “bells and whistles” features they don’t use. Getting buy-in from people is the top issue to overcome in implementing an ELN.

Views on ELNs from outside the UK

Jan Krause from the École pPolytechnique Fédérale de Lausanne (EPFL) gave a non-UK perspective on ELNs. He described a study, as part of a national RDM project, where they separated ELNs (75 proprietary, 12 open source – 91 features) and Lab Info Management Systems (LIMS) (281 proprietary, 9 open source – 95 features) and compared their features. The two tools used mostly in Switzerland are SLims (commercial solution) and openBIS (homemade tool). To decide which tool to use they undertook a three phase selection process. The first selection was based on disciplinary and technical requirements. The second selection involved detailed analysis based on user requirements (interviews and evaluation weighted by feature) and price. The third selection was tendering and live demos.

Data storage, security and compliance requirements

When using and sharing data you need to make sure your data is safe and secure. Kieren Lovell, from the University Information Services, talked about how researchers should keep their data and accounts safe. Since he started in May 2015, all successful hacks on the university have been due to human error, such as unpatched servers, failures in processes, bad password management, and phishing. Even if you think your data and research isn’t important, the reputational damage of security attacks to the university is huge. He recommended that any research data is shared through cloud providers rather than email, never trust public wifi as is not secure so use Cambridge’s VPN service. If using a local machine you should encrypt your hard drive.


Providers’ perspective

In the afternoon, presentations were from the providers’ perspective. Jeremy Frey, from the University of Southampton, talked about his experience of developing an open source ELN to support open and interdisciplinary science. He works on getting the people and technology to work together. It’s not just recording what you have done, you need to include the narrative behind what you do. This is critical for understanding and ELNs are one part of the digital ecosystem in the lab. The solution they’ve developed is LabTrove, partly funded by Jisc, which is a flexible open source web based solution. Allowing pictures to be added to the notes has really helped with accessibility and usability, such as dyslexia. Sustainability, as is often the case, came up and how a community is required to support such a system. It also needs to expand beyond Southampton. Finally, Jeremy used Amazon Echo to query the temperature within part of his lab. He hopes that this will be used more in the lab in the future when it can recognise each researcher’s voice.

In the next two presentations, it was over to the vendors to show the advantages of adopting RSpace (by Rory Macneil) and Dotmatics (by Dan Ormsby). The functionality on offer in these types of solutions is attractive for scientists and RSpace showed how it links to most common file stores. With any ELN, it should enhance researchers’ workflow and integrate with the tools they use.

Removing the barriers

After lunch there were three parallel focus group discussions. I attended the one on sustainability, something that comes up frequently in discussions, particularly when looking at open source or proprietary solutions. Each group reported back as follows:

Focus group 1: Managing the supplier lock in risk

Stories of use need to be shared. The PDF is not a great format for sharing. Vendors tell the truth like estate agents. Have to accept the reality that won’t have 100% exporting functionality so need to decide the minimum level. Determine specific users’ requirements.

Focus group 2: Sustainability of ELN solutions

What is the lifetime of an ELN? How long should everything be accessible? Various needs come from group and funder requirements, e.g. 10 years. There is concern if you are relying on one commercial solution as companies can die, so how can you guarantee the data will be available? Have exit policies and support standards and interoperability so data can be moved across ELNs. Broken links and file formats expiring is not just an ELN problem, but relates to the archiving of data in general. Should selection and support of an ELN be at group, department, institution or national level? This is difficult if it’s in one group as adopting any technical solution requires support in place. It requires institutional level support.

Focus group 3: Human element of ELN implementation

The biggest hurdle is culture change and showing the benefits of using an ELN. Training and technical support costs money and time. It would cost more initially but becomes more efficient. You can incentivise people by having champions. There are different needs in a large institution. You may join a lab and find the ELN is not adequate. Legal issues around sensitive data complicates matters. You need to believe it will save time. Long term solutions include using cloud base solutions, even MS Office, but what happens when people leave? Need support from higher level. Functionality should be based on user requirements. A start would be to set up a mailing list of people interested in ELNs.

Remaining barriers to wide ELN adoption

Finally, I chaired a panel session with all the presenters. Marta Teperek had kindly asked me to give a short presentation on what Jisc does as many researchers don’t know (in fact I was asked “what’s Jisc?” in the focus group) and to promote the Next Generation Research Environment co-design challenge. Following my presentation the discussion was prompted by questions from the audience and remotely via sli.do. Much of the discussion re-iterated what had been said in the presentations, such as the importance of an ELN that meets the requirements of researchers. It should allow integration with other tools and exporting of the data for use it other ELNs. Getting ELNs used within a department is often difficult so it does need institution level commitment and support. Without this ELNs are unlikely to be adopted within an institution, never mind nationally. One size does not fit all and we should not try to build an ELN that tries to satisfy the different needs of various disciplines. A modular system that integrates with the tools and systems already in use would be a better solution. Much of what was said tallied with the feedback received for the Next Generation Research Environment co-design challenge.

Closing remarks

Ian Bruno closed the workshop and he reiterated what was said in the panel discussion. I found the event extremely helpful and it provided lots of useful information to feed into the Next Generation Research Environment work. I’d like to thank Marta Teperek for inviting me to chair the panel and for all her hard work putting the event together with @CamOpenData. Marta has put together the tweets from the day into the following storify.  All notes and presentations from the event are now published in Apollo, the University of Cambridge’s research repository.

Follow-up actions at the University of Cambridge – give it a go!

Those of you who are interested in ELNs and who are based at the University of Cambridge might be interested in knowing that we are planning to do some trial access to Electronic Lab Notebooks (ELN). The purpose of this trial will be to test out several ELNs to decide on solutions which might best meet the requirements of the research community. A mailing list has been set up for people who are interested in being part of this pilot or would like to be involved in these discussions. If you would like to be added to the mailing list, please fill in the form here: https://lists.cam.ac.uk/mailman/listinfo/lib-eln

*Originally published by Jisc on 18 January 2017.

Published on 29 January 2017
Written by Chris Brown
Creative Commons License