Tag Archives: TDM

Next steps for Text & Data Mining

Sometimes the best way to find a solution is to just get the different stakeholders talking to each other – and this what happened at a recent Text and Data Mining symposium held in the Engineering Department at Cambridge.

The attendees were primarily postgraduate students and early career researchers, but senior researchers, administrative staff, librarians and publishers were also represented in the audience.

Background

This symposium grew out of a discussion held earlier this year at Cambridge to consider the issue of TDM and what a TDM library service might look like at Cambridge. The general outcome of that meeting of library staff was that people wanted to know more. Librarians at Cambridge have developed a Text and Data Mining libguide to assist.

So this year the OSC has been doing some work around TDM, including running a workshop at Research Libraries UK annual conference in March. This was a discussion about developing a research library position statement on Text and Data Mining in the UK. The slides from that event are available and we published a blog post about the discussion.

We have also had discussions with different groups about this issue including the Future TDM project which has been looking to increase  the amount of TDM happening across Europe. This project is now finishing up. The impression we have around the sector is that ‘everyone wants to know what everyone else is doing’.

Symposium structure

With this general level of understanding of TDM as our base point, we structured the day to provide as much information as possible to the attendees. The Twitter hashtag for the event is #osctdm, and the presentations from the event are online.

The keynote presentation was by Kiera McNeice, from the FutureTDM Project who have an overview of what TDM is, how it can be achieved and what the barriers are. There is a video of her presentation (note there were some audio issues in the beginning of the recording).

The event broke into two parallel sessions after this. The main room was treated to a presentation about Wikimedia from Cambridge’s Wikimedian in Residence, Charles Matthews. Then Alison O’Mara-Eves discussed Managing the ‘information deluge’: How text mining and machine learning are changing systematic review methods. A video of Alison’s presentation is available.

In the breakout room, Dr Ben Outhwaite discussed Marriage, cheese and pirates: Text-mining the Cairo Genizah  before Peter Murray Rust spoke about ContentMine: mining the scientific literature.

After lunch, Rosemary Dickin from PLOS talked about Facilitating Test and Data Mining how an open access publisher supports TDM. PhD candidate Callum Court presented ChemDataExtractor: A toolkit for automated extraction of chemical information from the scientific literature. This presentation was filmed.

In the breakout room, a discussion about how librarians support TDM was led by Yvonne Nobis and Georgina Cronin. In addition there was a presentation from John McNaught –  the Deputy Director of the National Centre for Text and Data Mining (NaCTeM), who presented Text mining: The view from NaCTeM .

Round table discussion

The day concluded with the group reconvening together for a roundtable (which was filmed) to discuss the broader issue of why there is not more TDM happening in the UK.

We kicked off by asking each of the people who had presented during the event to describe what they saw as the major barrier for TDM. The answers ranged from the issue of recruiting and training staff to the legal challenges and policies needed at institutional level to support TDM and the failure of institutions and government to show leadership on the issue. We then opened up the floor to the discussion.

A librarian described what happens when a publisher cuts off access, including the process the library has to go through with various areas of the University to reinstate access. (Note this was the reason why the RLUK workshop concluded with the refrain: ‘Don’t cut us off!’). There was some surprise in the group that this process was so convoluted.

However, the suggestion that researchers let the library know that they want to do TDM and the library will organise permissions was rejected by the group, on both the grounds that it is impractical for researchers to do this, and that the effort associated with obtaining permission would take too long.

A representative from Taylor and Francis suggested that researchers contact the publishers directly and let them know. Again this was rejected as ‘totally impractical’ because of the assumption this made about the nature of research. Far from being a linear and planned activity, it is iterative and  to request access for a period of three months and to then have to go back to extend this permission if the work took an unexpected turn would be impractical, particularly across multiple publishers.

One attendee in her blog about the event noted: “The naivety of the publisher, concerning research methodology, in this instance was actually quite staggering and one hopes that this publisher standpoint isn’t repeated across the board.”

Some researchers described the threats they had received from publishers about downloading material. There was anger about the inherent message that the researcher had done something criminal.

There was also some concern raised that TDM will drive price increases as publishers see ‘extra value’ to be extracted from their resources. This sparked off a discussion about how people will experiment if anything is made digitally available.

During the hour long session the conversation moved from high level problems to workflows. How do we actually do this? As is the way with these types of events, it was really only in the last 10 minutes that the real issues emerged.  What was clear was something I have repeatedly observed over the past few years – that the players in this space including librarians, researchers and publishers, have very little idea of how the others work and their needs. I have actually heard people say: ‘If only they understood…’

Perhaps it is time we started having more open conversations?

Next steps

Two things have come out of this event. The first is that people have very much asked for some hands on sessions. We will have to look at how we will deliver this, as it is likely to be quite discipline specific.

The second is there is clearly a very real need for publishers, researches and librarians to get into a room together to discuss the practicalities of how we move forward in TDM. One of the comments on Twitter was that we need to have legal expertise in the room for this discussion. We will start planning this ‘stakeholder’ event after the summer break.

Feedback

The items that people identified as the ‘one most important thing’ they learnt was instructive. The answers reflect how unaware people are of the tools and services available, and of how access to information works. Many of the responses listed specific tools or services they had found out about, others commented on the opportunities for TDM.

There were many comments about publishers, both the bad:

  • Just how much impact the chilling effect of being cut off by publishers has on researchers
  • That researchers have received threats from publishers
  • Very interesting about publishers and ways of working with them to ensure not cut off
  • Lots can be done but it is being hindered by publishers

and the good:

  • That PLOS is an open access journal
  • That there are reasonable publishing companies in the UK
  • That journals make available big data for meta analysis

Commentary about the event

There has been some online discussion and blog posts on the event:

Published 17 August 2017
Written by Dr Danny Kingsley 
Creative Commons License

Service Level Agreements for TDM

Librarians expect publishers to support our researchers’ rights to Text and Data Mining and not cut access off for a library if they see ‘suspicious’ activity before they establish whether it is legitimate or not. These were the conclusions of a group who met at a workshop to discuss provision of Text and Data Mining services in March. The final conclusions were:

Expectations libraries have of publishers over TDM

The workshop concluded with very different expectations to what was originally proposed. The messages to publishers that were agreed were:

  1. Don’t cut us off over TDM activity! Have a conversation with us first if you notice abnormal behaviour*
  2. If you do cut us off and it turns out to be legitimate then we expect compensation for the time we were cut off
  3. Mechanisms for TDM where certain behaviours are expected need to be built into separate licensing agreements for TDM

*And if you want to cut us off – please demonstrate there are all these illegal TDM activities happening in the UK

Workshop on TDM

The workshop “Developing a research library position statement on Text and Data Mining in the UK” was part of the recent RLUK2017 conference.  My colleagues, Dr Debbie Hansen from the Office of Scholarly Communication and Anna Vernon from Jisc, and I wanted to open up the discussion about Text and Data Mining (TDM) with our library community. We have made the slides available and they contain a summary of all the discussions held during the event. This short blog post is an analysis of that discussion.

We started the workshop with a quick analysis of who was in the room using a live survey tool called Mentimeter. Eleven participants came from research institutions – six large, four small and one  from an ‘other research institution’. There were two publishers, and four people who identified as ‘other’ – which were intermediaries. Of the 19 attendees, 14 worked in a library. There was only one person who said they had extensive experience in TDM, four people said they were TDM practitioners but the largest group were the 14 who classified themselves as having ‘heard of TDM but have had no practical experience’.

The workshop then covered what TDM is, what the legal situation is and what publishers are currently saying about TDM . We then opened up the discussion.

Experiences of TDM for participants

In the initial discussion about experiences of the participants, a few issues were raised if libraries were to offer TDM services. Indeed there was a question whether this should form part of library service delivery at all. The issue is partly that this is new legislation, so currently publisher and institutions are reactive, not strategic in relation to TDM. We agreed:

  • There is a need for clearer understanding of the licensing situation with information
  • We also need to create a mechanism of where to go for advice, both within the institution and the publisher
  • We need to develop procedures of what to do with requests – which is a policy issue 
  • Researcher behaviour is a factor – academics are not concerned by copyright.

Offering TDM is a change of role of the library – traditionally libraries have existed to preserve access to items. The group agreed we would like to be enabling this activity rather than saying “no you can’t”. There are library implications for offering support for TDM, not least that librarians are not always aware of TDM taking place within their institution. This makes it difficult to be the central point for the activity. In addition, TDM could threaten access through being cut off, so this is causing internal disquiet.

TDM activity underway in Europe & UK

We then presented to the workshop some of the activities in TDM that are happening internationally, such as the FutureTDM project. There was also a short run down on the new copyright exception for research organisations carrying out research in public interest being proposed to the European Commission allowing researchers to carry out TDM of copyright protected content if they have lawful access (e.g. subscription) without prior authorisation.

ContentMine is a not for profit organisation that supplies open source TDM software to access and analyse documents. They are currently partnering with Wikimedia Foundation with a grant to develop WikiFactMine which is a project aiming to make scientific data available to editors of Wikidata and Wikipedia.

The ChemDataExtractor is a tool built by the Molecular Engineering Group at the University of Cambridge. It is an open source software package that extracts chemical information from scientific documentation (e.g. text, tables). The extracted data can be used for onward analysis. There is some information in a paper  in the Journal of Chemical Information and Modelling: ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature“.

The Manchester Institute of Biotechnology hosts the National Centre for Text Mining (NaCTeM), which works with research partners to provide text mining tools and services in the biomedical field.

The British Library had a call for applications for a PhD student placement to undertake thesis text mining on 150,000 theses held in EThOS to extract new metadata such as names of supervisors.  Applications closed 20 February 2017, but according to an EThOS newsletter from March,  they had received no applications for the placement. The suggestion is that “perhaps that few students have content mining skills sufficiently well developed to undertake such a challenging placement”.

The problem with supporting TDM in libraries

We proposed to the workshop group that libraries are worried about getting cut off from their subscription by publishers due to large downloads of papers through TDM activity. This is because publishers’ systems are pre-programmed to react to suspicious activity. If TDM invokes automated investigation, then this may cause an access block.

However universities need to maintain support mechanism to ensure continuity of access. For this to occur we require workflows for swift resolution, fast communication and a team of communicators. This also requires education of researchers of potential issues.

We asked the group to discuss this issue – noting reasons why their organisation is not actively supporting TDM and if they are the main challenges they face.

Discussion about supporting TDM in libraries

The reasons put forward for not supporting TDM included practical issues such as the challenges of handling physical media and the risk of lockout.

The point was made that there was a lack of demand for the service. This is possibly because the researchers are not coming to the Library for help. There may be a lack of awareness in the IT areas that the Library can help and they may not even pass on the queries.  This points to the need for internal discussion with institutions.

It was noted that there was an assumption in the discussion that the Library is at the centre of this type of activity, however and we are not joined up as organisations. The question is who is responsible for this activity? There is often no institutional view on TDM because the issues are not raised at academic level. Policy is required.

Even if researchers do come to the library, there are questions about how we can provide a service. Initially we would be responding to individual queries, but how do we scale it up?

The challenges raised included the need for libraries to ensure everyone understands the needs at the the content owner level. The library, as the coordinator of this work would need to ensure the TDM is not for commercial use, and need to ensure people know their responsibilities. This means the library is potentially being intrusive on the researcher process.

Service Level Agreement proposal

The proposal we put forward to the group was that we draft a statement for a Service Level Agreement for publishers to assure us that if the library is cut off, but the activity is legal, we will be reinstated within and agreed period of time. We asked the group to discuss the issues if we were to do this.

Expectation of publishers

The discussion has raised several issues libraries had experienced with publishers over TDM. One participants said the contract with a particular publisher to allow their researchers to do TDM took two years to finalise.

There was a recognition that for genuine TDM to be identified might require some sort of registry of TDM activity which might not be an administrative task all libraries want to take on. The alternative suggestion was a third party IP registry, which could avoid some of the manual work. Given that LOCKSS crawls publisher software without getting trapped, this could work in the same way with a bank of IP addresses that is secured for this purpose.

Some solutions that publishers could help with include publishers delivering material in different ways – not on a hard drive. The suggestion was that this could be part of a platform and the material was produced in a format that allowed TDM (at no extra cost).

Expectation of libraries

There was some distaste amongst the group for libraries to take on the responsibility for maintaining  a TDM activity register. However libraries could create a safe space for TDM like virtual private networks.

Licenses are the responsibility of libraries, so we are involved whether we wish to be or not. Large scale computational reading is completely different from current library provision. There are concerns that licensing via the library could be unsuitable for some institutions. This raises issues of delivery and legal responsibilities. One solution for TDM could be to record IP address ranges in licence agreements. We need to consider:

  • How do we manage the licenses we are currently signed up to?
  • How do we manage licensing into the future so we separate different uses? Should we have a separate TDM ‘bolt on’ agreement.

The Service Level Agreement (SLA) solution

The group noted that, particularly given the amount publisher licenses cost libraries, being cut off for a week or two weeks with no redress is unusual at best in a commercial environment. At minimum publishers should contact the library to give the library a grace period to investigate rather than being cut off automatically.

The basis for the conversation over the SLA includes the fact that the law is on the subscriber’s side if everyone is doing it legally. It would help to have an understanding of the extent of infringing activity going on with University networks (considering that people can ‘mask’ themselves). This would be useful for thinking of thresholds.

Next steps

We need to open up the conversation to a wider group of librarians. We are hoping that we might be able to work with RLUK and funding councils to come to an agreed set of requirements that we can have endorsed by the community and which we can then take to to publishers.

Debbie Hansen and Danny Kingsley attended the RLUK conference thanks to the support of the Arcadia Fund, a charitable fund of Lisbet Rausing and Peter Baldwin.

Published 30 March 2017
Written by Dr Danny Kingsley
Creative Commons License