Next steps for Text & Data Mining

Sometimes the best way to find a solution is to just get the different stakeholders talking to each other – and this what happened at a recent Text and Data Mining symposium held in the Engineering Department at Cambridge.

The attendees were primarily postgraduate students and early career researchers, but senior researchers, administrative staff, librarians and publishers were also represented in the audience.

Background

This symposium grew out of a discussion held earlier this year at Cambridge to consider the issue of TDM and what a TDM library service might look like at Cambridge. The general outcome of that meeting of library staff was that people wanted to know more. Librarians at Cambridge have developed a Text and Data Mining libguide to assist.

So this year the OSC has been doing some work around TDM, including running a workshop at Research Libraries UK annual conference in March. This was a discussion about developing a research library position statement on Text and Data Mining in the UK. The slides from that event are available and we published a blog post about the discussion.

We have also had discussions with different groups about this issue including the Future TDM project which has been looking to increase  the amount of TDM happening across Europe. This project is now finishing up. The impression we have around the sector is that ‘everyone wants to know what everyone else is doing’.

Symposium structure

With this general level of understanding of TDM as our base point, we structured the day to provide as much information as possible to the attendees. The Twitter hashtag for the event is #osctdm, and the presentations from the event are online.

The keynote presentation was by Kiera McNeice, from the FutureTDM Project who have an overview of what TDM is, how it can be achieved and what the barriers are. There is a video of her presentation (note there were some audio issues in the beginning of the recording).

The event broke into two parallel sessions after this. The main room was treated to a presentation about Wikimedia from Cambridge’s Wikimedian in Residence, Charles Matthews. Then Alison O’Mara-Eves discussed Managing the ‘information deluge’: How text mining and machine learning are changing systematic review methods. A video of Alison’s presentation is available.

In the breakout room, Dr Ben Outhwaite discussed Marriage, cheese and pirates: Text-mining the Cairo Genizah  before Peter Murray Rust spoke about ContentMine: mining the scientific literature.

After lunch, Rosemary Dickin from PLOS talked about Facilitating Test and Data Mining how an open access publisher supports TDM. PhD candidate Callum Court presented ChemDataExtractor: A toolkit for automated extraction of chemical information from the scientific literature. This presentation was filmed.

In the breakout room, a discussion about how librarians support TDM was led by Yvonne Nobis and Georgina Cronin. In addition there was a presentation from John McNaught –  the Deputy Director of the National Centre for Text and Data Mining (NaCTeM), who presented Text mining: The view from NaCTeM .

Round table discussion

The day concluded with the group reconvening together for a roundtable (which was filmed) to discuss the broader issue of why there is not more TDM happening in the UK.

We kicked off by asking each of the people who had presented during the event to describe what they saw as the major barrier for TDM. The answers ranged from the issue of recruiting and training staff to the legal challenges and policies needed at institutional level to support TDM and the failure of institutions and government to show leadership on the issue. We then opened up the floor to the discussion.

A librarian described what happens when a publisher cuts off access, including the process the library has to go through with various areas of the University to reinstate access. (Note this was the reason why the RLUK workshop concluded with the refrain: ‘Don’t cut us off!’). There was some surprise in the group that this process was so convoluted.

However, the suggestion that researchers let the library know that they want to do TDM and the library will organise permissions was rejected by the group, on both the grounds that it is impractical for researchers to do this, and that the effort associated with obtaining permission would take too long.

A representative from Taylor and Francis suggested that researchers contact the publishers directly and let them know. Again this was rejected as ‘totally impractical’ because of the assumption this made about the nature of research. Far from being a linear and planned activity, it is iterative and  to request access for a period of three months and to then have to go back to extend this permission if the work took an unexpected turn would be impractical, particularly across multiple publishers.

One attendee in her blog about the event noted: “The naivety of the publisher, concerning research methodology, in this instance was actually quite staggering and one hopes that this publisher standpoint isn’t repeated across the board.”

Some researchers described the threats they had received from publishers about downloading material. There was anger about the inherent message that the researcher had done something criminal.

There was also some concern raised that TDM will drive price increases as publishers see ‘extra value’ to be extracted from their resources. This sparked off a discussion about how people will experiment if anything is made digitally available.

During the hour long session the conversation moved from high level problems to workflows. How do we actually do this? As is the way with these types of events, it was really only in the last 10 minutes that the real issues emerged.  What was clear was something I have repeatedly observed over the past few years – that the players in this space including librarians, researchers and publishers, have very little idea of how the others work and their needs. I have actually heard people say: ‘If only they understood…’

Perhaps it is time we started having more open conversations?

Next steps

Two things have come out of this event. The first is that people have very much asked for some hands on sessions. We will have to look at how we will deliver this, as it is likely to be quite discipline specific.

The second is there is clearly a very real need for publishers, researches and librarians to get into a room together to discuss the practicalities of how we move forward in TDM. One of the comments on Twitter was that we need to have legal expertise in the room for this discussion. We will start planning this ‘stakeholder’ event after the summer break.

Feedback

The items that people identified as the ‘one most important thing’ they learnt was instructive. The answers reflect how unaware people are of the tools and services available, and of how access to information works. Many of the responses listed specific tools or services they had found out about, others commented on the opportunities for TDM.

There were many comments about publishers, both the bad:

  • Just how much impact the chilling effect of being cut off by publishers has on researchers
  • That researchers have received threats from publishers
  • Very interesting about publishers and ways of working with them to ensure not cut off
  • Lots can be done but it is being hindered by publishers

and the good:

  • That PLOS is an open access journal
  • That there are reasonable publishing companies in the UK
  • That journals make available big data for meta analysis

Commentary about the event

There has been some online discussion and blog posts on the event:

Published 17 August 2017
Written by Dr Danny Kingsley 
Creative Commons License

Planning scholarly communication training in the UK

In June 2017 a group of people (see end for attendees) met in London to discuss the issues around scholarly communication training delivery in the UK. Representatives from RLUK, UKSG, SCONUL, UKCoRR, Vitae, Jisc and some universities had a workshop to nut through the problem. Possibly because of the nature of the attendees of the group, the discussion was very library-centric, but this does not preclude the need for training outside the library sector. This blog is a summary of the discussion from that day.

Background

The decision to hold a meeting like this came out of the a library skills workshop run at UKSG recently. In ensuing discussions, it was agreed that it would be a good idea to get stakeholders together for a symposium of some description to try and nut out how we could collaborate and provide training solutions for scholarly communication across the sector. There is plenty of space in this area for multiple offerings but we do want to make sure we are covering the range of areas and the types of delivery modes and levels required. In preparation for the discussion the group created a document listing scholarly communication training on offer currently.

What is scholarly communication?

An informal survey of research libraries in the UK earlier this year showed that while all respondents had some kind of service that supports aspects of scholarly communication, only half actually used the term ‘scholarly communication’ to describe those services.

A discussion around the table concluded that the term scholarly communication encompasses a wide range of definitions. Some libraries take the boundary that it refers to post-publication. Others address the pre-publication aspect and meet the need of Early Career Researchers for advice on publishing. Services can focus on the academic’s profile of themselves and their research, or the research lifecycle. In some cases there is a question about whether research data management is part of the equation.

The failure of library schools to deliver

It is fairly universally acknowledged that it is a challenge to engage with library schools on the issue of scholarly communication, despite repositories being a staple part of research library infrastructure for well over a decade. There are a few exceptions but generally open access or other aspects of scholarly communication are completely absent from the curricula. (Note: any library school that wishes to challenge this statement, or provide information about upcoming plans are welcome to send these through to info@osc.cam.ac.uk)

This raises the question – if library schools are not providing, how do we recruit and train the staff we need? Indeed, who are we actually recruiting? Is it essential for staff to have a library degree, or experience in an academic library? Or are our requirements more functional such as the ability to manipulate large data sets, or experience working with academics, or an understanding of the Higher Education environment?

While libraries are starting to employ post-graduate researchers because they can lend skills to the library, library culture is a consideration. Employing researchers who are not librarians has the benefit of bringing in expertise from outside, but there are challenges to integrate their work into the library culture. We need to look at competencies in terms of the structure and size of the organisation, both for current staff and staff of the future.

In the absence of scholarly communication instruction within the basic qualification, skills training in this space would appear to need to be addressed at the profession level.

One possible route to prepare the next generation is offering some modular approach of on the job learning with very practical experience. An option could be to work with people who have come from outside the library space. Given libraries seem to be starting to bring skill sets in, we need to consider how this sits with the existing profession.

Audiences and their training needs

The goal of the meeting was to resolve what kinds of training the sector needs, for whom and how it is delivered. For example, with many general library staff there is a basic need to understand the issues with scholarly communication. The number one question is ‘what is scholarly communication’? The possibly it is enough for these people to just be familiar with the terminology.

It is possible we need lots of short courses on the general topic of: this is what OA is, basics of RDM etc (that could potentially be delivered online), but probably fewer more complex courses on issues like analysing publisher and funder policies. There are also debates and higher order areas which require face to face debate.

  • Front facing staff
    • Need an overview so the language is familiar and they can refer queries on
  • People working in scholarly communication
    • Day to day practicalities of funder open access compliance
  • Specialist roles in scholarly communication
    • Specific areas
  • Senior managers
    • Very much need a refresher so they can help their staff.
    • Similar overview training, leadership is around the advocacy
    • Need conceptual framework for scholarly communication – how do the technical parts sit together for the infrastructure and governance of institutions
    • Stakeholder management skills.

Skill sets in scholarly communication

It was agreed that budgetary, presentation and negotiation skills are needed in this area as general skills. When it comes to specialist skills these include:

  • Research Integrity
  • Bibliometrics
    • Involved in providing specialist advice on metrics within a school discussion
    • Providing advice on impact
  • Pushing the open research agenda
  • Academic reward structure
  • Technical and infrastructure eg: integrating ORCIDS etc

Considerations – Lack of perceived need?

There appears to be a problem with a lack of perceived need for training in this space. We are encountering issues where people in libraries are saying ‘I don’t think this is our job’. This points to what should we be presenting librarianship as – what kind of people do we want in the profession? A ‘traditional librarian’ of 20 years ago is not the same job now, the skills are different. Today much of an academic librarian’s job is about winning over people who don’t want to hear the message. It is possible there does need to be a different sort of person who is pushing an open access agenda.

There have been other innovations in library work that required engaging different behaviours and tasks in the past. For example, is this move towards a scholarly communication future different from when the discovery search was introduced? The eResources experience is similar in terms of new competencies required in the profession. However the difference in the scholarly communication environment is there is an external driver – we need to understand the politics of how open access can move forward in the UK.

Considerations – budgets

There is a mismatch between what people would love to have, what can be designed and what people can afford. Anecdotally the group heard that training budgets are really squeezed so priority and focus might be heavily influenced by this, with geography and travelling costs being central to decisions.

The group discussed the need to make training accessible to all. Even free events can be prohibitive in terms of travel, and hosting them in off-peak periods can be helpful with costs. The blockage is not just money, it includes time – in terms of loss of a team member while they are away. This is particularly problematic if scholarly communication is only a part of their job. Most of the need comes from really small institutions where the work is part of a bigger role, however that is where there is little money. This also raises challenges for the time available for those people to self educate.

UKSG run events in London which is expensive for organisations north of London to attend. To increase participation UKSG are now trying to put regional events on, and have shifted their training to a webinar programme rather than face to face.

SCONUL has done basic copyright training and this has thrown up price sensitivity. One solution is trying to keep it local, and members can volunteer staff in kind.

One option could be online training where participants log on at a certain time once a week for 10 weeks. Many of the people in scholarly communication work in universities, and have distance education software available to them. An alternative is having courses done in house – that could part of a modular package (but how do you link this?). The course content needs to be agnostic enough to be useful (not discussing DSpace or PURE for example) before delving into institutional specifics. Make it modular with core principles and then have options.

There was a suggestion that we create a nonprofit making shared collaborative service. The costs to developing this type of deliverable include the development of the training materials, infrastructure costs, room hire, catering etc. Can we make it all online and available? This could work if it were modular.

Next steps

We have not yet bottomed out the need yet – perception of needs at the practitioner level and senior management might be different. Cost is an issue here. Universities need to work out how much it costs to do in-house training – what is the opportunity cost to employ a staff member without experience or training and then get them up to speed?

It would be useful to have an understanding of what training is happening within institutions. What subjects/topics are being taught, who is doing it, what language is being used, is there a dedicated staff member. Where else do people get information and support?

The general plan is to reconvene in September.

Useful Resources

Skill sets analyses

Here are links to work that has already been done on the required skill sets:

Organisations providing or coordinating training

Organisations are running similar events and then participants have to choose what to focus on. If we divvy it up across the sector it might help the situation.

The Society for College, National and University Libraries (SCONUL) does basic copyright training. There is more focus on the leadership end of the equation. The Collaboration Strategy Group is considering a shared service. People come from non traditional groups and this reflects a broader skills sets required in libraries than traditional library courses give you. SCONUL are about to scope out where those services might be and try to identify needs into the future. There are challenges are in recruiting people given the slightly moralistic nature of library culture and whether they are welcoming of people from different background. How do we promote, retain and incentivise people who may not come from this area?

Research Libraries UK (RLUK) don’t do direct training, but they do have programmes of works and networks around these issues. The RLUK board recently had a meeting to look at a new strategy – updating the existing 2014-2017 RLUK Strategy. They are looking at the bigger picture for scholarly communication – the infrastructure challenges, the bigger picture related to licensing and costs and how to leverage members in the consortia. Their role is very much supporting and helping out.

UK Serials Group (UKSG) runs a conference programme. One day events are a mix of standing repeated courses and one off sessions. In conferences often the breakout sessions are the things that people find really valuable. These include soft skills like mindfulness in leadership. The audience tends to be practitioners, people in their mid-career. Traditional areas such as library have been focused around collection management because that is where publishers are. But it is not just about traditional publishing. They are our members and that is moving our agenda to meet those needs. UKSG cannot get anywhere in contributing to university publishing courses. Libraries are starting to employ people who have publishing backgrounds.

The Association of Research Managers and Administrators (ARMA) has special interest groups in open access. (Note: ARMA were invited to this meeting but unfortunately couldn’t attend.)

The Chartered Institute for Library and Information Professionals (CILIP) conducts training at a local level. It was agreed we can’t have the conversation without having CILIP in the room – they are wanting to offer more support for academic libraries and seem to be recognising that the library schools program for CILIP is not the be-all and end-all any more. This is partly why they have developed a recognised trainer programme. (Note: CILIP were invited to this meeting but unfortunately couldn’t attend.)

Representatives attending the discussion

  • Helen Dobson – Manchester University
  • Danny Kingsley – Cambridge University
  • Claire Sewell – Cambridge University
  • Anna Grigson representing UKSG
  • Fiona Bradley – RLUK
  • Ann Rossiter – SCONUL
  • Katie Wheat – Vitae
  • Sarah Bull – UKSG
  • Stephanie Meece -UKCoRR
  • Frank Manista – Jisc
  • Helen Blanchett – Jisc (a member of the group coordinating the meeting, but was unable to attend on the day)

ARMA and CILIP were also invited but were not able to send a representative.

Published 15 August 2017
Written by Dr Danny Kingsley 

Sustaining open research resources – a funder perspective

This is the second in a series of three blog posts which set out the perspectives of researchers, funders and universities on support for open resources. The first was Open Resources, who should pay? In this post, David Carr from the Open Research team at the Wellcome Trust provides the view of a research funder on the challenges of developing and sustaining the key infrastructures needed to enable open research.

As a global research foundation, Wellcome is dedicated to ensuring that the outputs of the research we fund – including articles, data, software and materials – can be accessed and used in ways that maximise the benefits to health and society.  For many years, we have been a passionate advocate of open access to publications and data sharing.

I am part of a new team at Wellcome which is seeking to build upon the leadership role we have taken in enabling access to research outputs.  Our key priorities include:

  • developing novel platforms and tools to support researchers in sharing their research – such as the Wellcome Open Research publishing platform which we launched last year;
  • supporting pioneering projects, tools and experiments in open research, building on the Open Science Prize which with the NIH and Howard Hughes Medical Institute;
  • developing our policies and practices as a funder to support and incentivise open research.

We are delighted to be working with the Office of Scholarly Communication on the Open Research Pilot Project, where we will work with four Wellcome-funded research groups at Cambridge to support them in making their research outputs open.  The pilot will explore the opportunities and challenges, and how platforms such as Wellcome Open Research can facilitate output sharing.

Realising the long-term value of research outputs will depend critically upon developing the infrastructures to preserve, access, combine and re-use outputs for as long as their value persists.  At present, many disciplines lack recognised community repositories and, where they do exist, many cannot rely on stable long-term funding.  How are we as a funder thinking about this issue?

Meeting the costs of outputs sharing

In July 2017, Wellcome published a new policy on managing and sharing data, software and materials.  This replaced our long-standing policy on data management and sharing – extending our requirements for research data to also cover original software and materials (such as antibodies, cell lines and reagents).  Rather than ask for a data management plan, applicants are now asked to provide an outputs management plan setting out how they will maximise the value of their research outputs more broadly.

Wellcome commits to meet the costs of these plans as an integral part of the grant, and provides guidance on the costs that funding applicants should consider.  We recognise, however, that many research outputs will continue to have value long after the funding period comes to an end.  Further, while it not appropriate to make all research data open indefinitely, researchers are expected to retain data underlying publications for at least ten years (a requirement which was recently formalised in the UK Concordat on Open Research Data).  We must accept that preserving and making these outputs available into the future carries an ongoing cost.

Some disciplines have existing subject-area repositories which store, curate and provide access to data and other outputs on behalf of the communities they serve.  Our expectation, made more explicit in our new policy, is that researchers should deposit their outputs in these repositories wherever they exist.  If no recognised subject-area repository is available, we encourage researchers to consider using generalist repositories – such as Dryad, FigShare and Zenodo – or if not, to use institutional repositories.  Looking ahead, we may consider developing an orphan repository to house Wellcome-funded research data which has no other obvious home.

Recognising the key importance of this infrastructure, Wellcome provides significant grant funding to repositories, databases and other community resources.  As of July 2016, Wellcome had active grants totalling £80 million to support major data resources.  We have also invested many millions more in major cohort and longitudinal studies, such as UK Biobank and ALSPAC.  We provide such support through our Biomedical Resource and Technology Development scheme, and have provided additional major awards over the years to support key resources, such as PDB-Europe, Ensembl and the Open Microscopy Environment.

While our funding for these resources is not open-ended and subject to review, we have been conscious for some time that the reliance of key community resources on grant funding (typically of three to five years’ duration) can create significant challenges, hindering their ability to plan for the long-term and retain staff.  As we develop our work on Open Research, we are keen to explore ways in which we adapt our approach to help put key infrastructures on a more sustainable footing, but this is a far from straightforward challenge.

Gaining the perspectives of resource providers

In order to better understand the issues, we did some initial work earlier this year to canvas the views of those we support.  We conducted semi-structured interviews with leaders of 10 resources in receipt of Wellcome funding – six database and software resources, three cohort resources and one materials stock centre – to explore their current funding, long-term sustainability plans and thoughts on the wider funding and policy landscape.

We gathered a wealth of insights through these conversations, and several key themes emerged:

  • All of the resources were clear that they would continue to be dependent on support from Wellcome and/or other funders for the long-term.
  • While cohort studies (which provide managed access to data) can operate cost recovery models to transfer some of the cost of accessing data onto users, such models were not appropriate for data and software resources who commit to open and unrestricted access.
  • Several resources had additional revenue-generation routes – including collaborations with commercial entities– and these had delivered benefits in enhancing their resources.  However, the level of income was usually relatively modest in terms of the total cost of sustaining the resource. Commitments to openness could also limit the extent to which such arrangements were feasible.
  • Diversification of funding sources can give greater assurance and reduce reliance on single funders, but can bring an additional burden.  There was felt to be a need for better coordination between funders where they co-fund resources.  Europe PMC, which has 27 partner funders but is managed through a single grant is a model which could be considered.
  • Several of the resources were actively engaged in collaborations with other resources internationally that house related data – it was felt that funders could help further facilitate such partnerships.

We are considering how Wellcome might develop its funding approaches in light of these findings.  As an initial outcome, we plan to develop guidance for our funded researchers on key issues to consider in relation to sustainability.  We are already working actively with other funders to facilitate co-funding and make decisions as streamlined as possible, and wish to explore how we join forces in the future in developing our broader approaches for funding open resources.

Coordinating our efforts

There is growing recognition of the crucial need for funders and wider research community to work together develop and sustain research data infrastructure.  As the first blog in this series highlighted, the scientific enterprise is global and this is an issue which must be addressed international level.

In the life sciences, the ELIXIR and US BD2K initiatives have sought to develop coordinated approaches for supporting key resources and, more recently, the European Open Science Cloud initiative has developed a bold vision for a cloud-based infrastructure to store, share and re-use data across borders and disciplines.

Building on this momentum, the Human Frontiers Science Programme convened an international workshop last November to bring together data resources and major funders in the life sciences.  This resulted in a call for action (reported in Nature) to coordinate efforts to ensure long-term sustainability of key resources, whilst supporting resources in providing access at no charge to users.  The group proposed an international mechanism to prioritise core data resources of global importance, building on the work undertaken by ELIXIR to define criteria for such resources.  It was proposed national funders could potentially then contribute a set proportion of their overall funding (with initial proposals suggesting around 1.5 to 2 per cent) to support these core data resources.

Grasping the nettle

Public and charitable funders are acutely aware that many of the core repositories and resources needed to make research outputs discoverable and useable will continue to rely on our long-term funding support.  There is clear realisation that a reliance on traditional competitive grant funding is not the ideal route through which to support these key resources in a sustainable manner.

But no one yet has a perfect solution and no funder will take on this burden alone.  Aligning global funders and developing joint funding models of the type described above will be far from straightforward, but hopefully we can work towards a more coordinated international approach.  If we are to realise the incredible potential of open research, it’s a challenge we must address

Published 26 July 2017
Written by David Carr, Wellcome Trust (d.carr@wellcome.ac.uk)

Creative Commons License