Value and benefits of text mining
Through text, data mining and analytics we can exploit the vast amounts of information and data generated everyday through economic, academic and social activities.
Executive summary
Businesses use data and text mining to analyse customer and competitor data to improve competitiveness; the pharmaceutical industry mines patents and research articles to improve drug discovery; within academic research, mining and analytics of large datasets are delivering efficiencies and new knowledge in areas as diverse as biological science, particle physics and media and communications.
We have explored the costs, benefits, barriers and risks associated with text mining within UKFHE research using the approach to welfare economics laid out in the UK Treasury best practice guidelines for evaluation.
The global research community generates over 1.5 million new scholarly articles per annum [1]. As the recent Hargreaves report into 'Digital Opportunity: A Review of Intellectual Property and Growth' [2] highlighted, text mining and analytics of this scholarly literature and other digitised text affords a real opportunity to support innovation and the development of new knowledge. However, current UK copyright laws are restricting this use of text mining. To remedy this, Hargreaves proposes an exception to support text mining and analytics for non-commercial research....
Key findings
- We found some significant use of text mining in fields such as biomedical sciences and chemistry and some early adoption within the social sciences and humanities. Current UK copyright restrictions, however, mean that most text mining in UKFHE is based on Open Access documents or bespoke arrangements. This means that the availability of material for text mining is limited.
- The costs of text mining relate to access rights to text-minable materials, transaction costs (participation in text mining), entry (setting up text mining), staff and underlying infrastructure. Currently, the most significant costs are transaction costs and entry costs. Given the sophisticated technical nature of text mining, entry costs will by and large remain high.
Current high transaction costs are attributable to the need to negotiate a maze of licensing agreements covering the collections researchers wish to study....
1. Introduction
1.1 Context
Economic, academic and social activities generate ever increasing quantities of data. Businesses collect trillions of bytes of information on customer transactions, suppliers, internal operations and indeed competitors [4]; the global research community generates over 1.5 million new scholarly articles per annum; and social networking sites such as Facebook and twitter enable users to share over 1.3 billion pieces of information/content per day [5].....
1.2 Report background
The 2011 Hargreaves report into 'Digital Opportunity: A Review of Intellectual Property and Growth' 15 explored whether the current intellectual property (IP) framework in the UK is hindering innovation and economic growth. In examining the potential obstacles, Hargreaves argued that exception(s) to the existing IP framework are required that: allow shifting between formats; are sufficiently general to enable emerging research tools to be applied; and that cannot be overridden by contracts. Without such exceptions, Hargreaves argues, UK business and research will be unable to reap the full benefits of emerging technologies and business models.
1.3 Aim, focus and scope of the study
The overarching aim of this study was to explore the value and benefits of the use of text mining and analytics to UKFHE both currently and if Hargreaves exceptions were to be implemented. The research was guided by the following research questions:
- What is the potential for text mining and text analytic technologies and practices in UKFHE?
- What are the costs, benefits (in particular the economic value) and risks of exploiting this potential, for whom, both now and in the foreseeable future?
- What are the main barriers to the exploitation of this potential, and how might they be overcome?
Text mining is an enabling technology with applicability across learning, research and management. The focus of this study is on the public intellectual outputs of further and higher education, rather than (for example) administrative records, and how the application of text mining to these outputs can benefit UK academics, colleges and universities, and thereby the wider UK economy and society. That said, the study draws on the wider use of text mining software in the commercial sector and internationally to inform how it could be applied within UKFHE in the future...
1.4 Study approach
For text mining to be used in UKFHE for competitive advantage (as Hargreaves advocates), there needs to be a better understanding of the value and benefits it can generate, particularly in economic terms. Better evidence is required to help inform the decisions regarding the optimal policy, technical and support infrastructures to help UKFHE exploit the potential that text mining offers. Evidence gathering and analysis needs to be based on methodologically sound techniques that are appropriate to the further and higher education sector. Particular issues for assessment of the value and benefits include:
- Text mining within UKFHE is in relatively early stages of development but generation of benefits can have a long time frame
- Not all economic and social benefits can be captured in financial evaluations but require a broader perspective on economic value and non-market impacts [26]..
2. Text mining: UKFHE and beyond
Text mining is being used in research both within the UK and across the world. As well as NaCTeM, UK institutions using text mining include: University of Manchester, University of Cambridge, University of Oxford, Institute of Education, University of Strathclyde, University of Lancaster, King's College London, University of St Andrews, University of Bangor, London Metropolitan University, University of Surrey and University of Liverpool. Internationally, text mining is being undertaken in, for example, the USA [34, 35, 36], Sweden [37], Japan [38], Australia [39], Israel [40], Germany [41] and China [42].
2.1 Text mining and its rationale
Scholarly journals and data sources are increasingly available in electronic form making them more accessible to researchers and innovators, in theory at least. However, availability does not equate to being able to analyse easily the content to find sought after information or to develop new insights. The reason is two-fold:
- There is too much literature for a researcher to read. The scholarly publication base consists of 11,550 journals, to which 1.5 million articles are added per year [43]. Similarly, text-based research resources such as social networking communications or policy documents are too numerous for a single researcher or group to read.
- While key word searches might reduce the number [44] of documents, there is no guarantee that the search terms have an identical meaning in the documents retrieved. For example, 'tree', 'branch' and 'leaf' have very different meaning in ecology and informatics, something that is easy for a researcher to see but not for a computer....
2.2 Applications of text mining in UKFHE and beyond
Text mining has applications in all parts of the research process from literature review and hypothesising, through experimentation and analysis to generalisation, peer review and publishing. Our investigation revealed six broad categories of use – systematic review of literature, developing new hypotheses, testing hypotheses, building reusable representations of knowledge, improving the quality of text-based artefacts and improving usability of research literature. This list is, however, not exhaustive.
- In systematic reviews of literature, text mining is used to automatically identify literature that should be reviewed by researchers wishing to establish the current state of knowledge in a particular field. The mining takes place across both traditional peer-reviewed academic journals and grey literature such as technical reports, policy documents and pre-prints. Researchers can use the information extracted to identify relevant documents from a much wider source pool, including from other disciplines and non-traditional sources. This enables efficiencies. For example, Thomas and O'Mara-Eves showed that text mining enabled identification of the relevant works with only 25% of the manual effort otherwise needed [53]
3. Costs, benefits, barriers and risks associated with text mining in UKFHE
We explored how text mining is being used, the associated costs, benefits and the barriers, risks and other issues during 17 interviews with a range of researchers, tools and service providers, and representatives from business and non-commercial organisations. All have a strong interest in the value and benefits of text mining within UKFHE.
The following themes emerged:
- Costs include access, transaction, entry, staff and infrastructure costs
- Benefits include: efficiency; unlocking hidden information and developing new knowledge; exploring new horizons; improved research and evidence base; and improving the research process and quality
- Broader economic and societal benefits were also highlighted, such as cost savings and productivity gains, innovative new service development, new business models and new medical treatments
- Barriers and risks. In general those consulted felt that there were significant barriers to uptake of text mining in UKFHE. These include: legal uncertainty, orphaned works and attribution requirements; entry costs; 'noise' in results; document formats; information silos and corpora specific solutions; lack of transparency; lack of support, infrastructure and technical knowledge; and lack of critical mass
These broad themes (presented in no particular order) and observations are discussed in further detail below [68].
3.1 Costs associated with text mining
3.1.1 Access costs
Where text mining explores copyrighted materials, the copyright holders may require extra payment to allow their material to be used in text mining. This is in addition to the purchase of the right to view the materials. Indeed in some cases the user (or more likely their institution) may need to pay four different costs to enable the materials to be text mined – traditional access (reading) costs, the right to copy, the right to digitise and then the right to text mine. As several of those consulted highlighted, this means that most text mining is limited to exploring Open Access documents where no additional charges are incurred...
3.1.2. Transactions costs
Transaction costs in this context relate to the effort required to enable text mining to take place. This is principally associated with obtaining permission to mine particular corpora of documents. As several of those consulted noted, the nature of publishers' contracts means that it is often ambiguous regarding whether text mining is permissible; not being specifically excluded from an agreement does not imply permission, and it can take significant effort to find the correct contact and then a definitive response. Where additional permission (and payment) is required, this may further prolong the discussion. For example, establishing permission to digitise alone takes roughly the equivalent of 1 FTE [69] as part of the national SHERPA/RoMEO service which offers information about publishers' policies with respect to self-archiving pre-print and post-print research papers [70] – transaction costs associated with mining copyrighted material may be considerably more.
Such transaction costs mean that text mining in UKFHE is mostly limited to Open Access sources, abstracts or full texts or where the individual researcher/group already has a well-established relationship with publishers. This was, for example, the case in 'Digging into The Enlightenment: Mapping the Republic of Letters' [71], where a corpus of 53,000 18th-century letters was text mined.
3.1.3. Entry costs
Entry costs refer to the resources required to develop and/or configure text mining tools to be used within a specific context. There are some generic tools available that require little configuration; however, higher end tools generally require adaption and significant training before they can be used in a different domain. For example, if a researcher wishes to use one of NaCTeM's higher level tools, they generally first need to explore with NaCTeM what the specific requirements are. NaCTeM then undertakes the required developments. Once the refined tool is available, it must be 'trained' to understand the key concepts and relationship with the domain by a domain expert....
3.2 Benefits and opportunities
3.2.1. Efficiency
A key benefit of text mining is that it enables much more efficient analysis of extant knowledge. The ability to extract information automatically cuts down the time spent on ensuring coverage of domain knowledge in the literature review process. For example, given the sheer volume of scholarly publications now available in the biomedical fields, it could take a human researcher several years to analyse the corpus to identify all relevant sources for a particular problem. Using text mining to identify relevant material could drastically cut down the time required. (See the case studies of section 4 for some examples.) Further, if the text mined documents were annotated with the semantic information that has been extracted and were then made available for reuse, key resources would be found more quickly....
3.2.2. Unlocking 'hidden' information and developing new knowledge
The enormous volume of academic publications and grey literature means that there may be underlying connections between different subtopics that could not be found without automated analysis. The potential links found between diseases and drugs developed for other purposes mentioned in section 2 are a good example of the unlocking of this hidden information. The unlocked information can lead to new knowledge and improved understanding. For example, text mining has been used to identify new therapeutic uses for thalidomide [78].
3.3 Barriers, risks and issues
3.3.1 Legal uncertainty, orphaned works and attribution requirements
As the Hargreaves report points out, at one level the legal position is quite clear – permission from the copyright holder is required before the digital copying and annotation required as part of text mining can be undertaken. However, where institutions already have existing contracts to access particular academic publications, it is often unclear whether text mining is a permissible use. The resource implications of seeking clarification can be significant...
3.3.2 Entry costs
The entry costs associated with development and 'training' of text mining tools for use within a different topic from that for which they were originally designed were also identified as a significant barrier to uptake of text mining. Investment in training for researchers is also required. Significant tools have been developed through various initiatives in, for example, biomedicine and chemistry. However, there is little uptake in other disciplines, which a number of those consulted felt was at least in part due to such entry costs. The Digging into Data Challenge [90] is however beginning to support and encourage development within the humanities.
3.3.3 Noise in text mining results
Text mining of documents may produce errors. False connections may be identified or others missed. In most contexts, where the noise (error rate) is sufficiently low, the advantages of automation outweigh the possibility of a higher error than that produced by a human reader. However, in some contexts even low error rates cannot be tolerated. While this can be viewed as a barrier, text mining is still used in a range of safety critical areas such as drug development. In such cases the extraction of information is only partially automated, with a (human) domain expert checking the automated selections. More extensive (and complementary) mining of the full text could also reduce error rates, where the full text is available...
4. Case studies of the economic value of text mining to UKFHE
We undertook text mining case studies to collect, where possible, direct evidence across the whole value chain of the costs and benefits of text mining and text analytics which would enable generalisations pertinent for UK HE/FE to be drawn.
Sourcing suitable case studies to cover the range of potential uses and fields proved problematic for reasons mentioned earlier: text mining is used in just a few specialised fields; where text mining is taking place, data on its use and value are sparse and often anecdotal; legal and commercial restrictions limited participation. The five case studies presented in this section were therefore selected pragmatically; they focus on specific small-scale examples of the value and benefits of text mining and the wider potential value and benefits that could be delivered if technical and legal limitations were resolved....
4.1 Text mining to support literature review in systems biology
Researchers in the biomedical sciences trying to develop new understanding and medicines to treat diseases are increasingly struggling to keep up to date with relevant literature. PubMed alone has 21 million citations for abstracts or full articles and this is increasing at a rate of two per minute [98]. This case study is based on the literature review and synthesis undertaken by Professor Douglas Kell in 2008–2009 to produce the highly cited journal article – 'Iron behaving badly: inappropriate iron chelation as a major contributor to the aetiology of vascular and other progressive inflammatory and degenerative diseases' [99]. It provides insight into the benefits of and barriers to text mining, illustrating how the full potential value that text mining could offer is yet to be realised...
4.2 Using text mining to expedite research
Text mining potentially offers two ways of decreasing the expensive and lengthy drug discovery life cycle. Internally, the pharmaceutical industry uses text mining to help identify information required to develop new drugs as well as to explore new application areas for existing drugs. This involves targeted information retrieval, entity extraction and finding links and associations across documents. As this is a highly competitive area, commercial considerations mean that it is not possible to make public the efficiency gains achieved; however, the extent of text mining undertaken by the pharmaceutical industry indicates that it finds the process valuable [111, 112]...
4.3 Using text mining to increase accessibility and relevance of scholarly content
As this case study of the Jisc JournalArchives [121] illustrates, text mining can be used to provide more efficient searching, which returns higher quality results than traditional information retrieval techniques. Jisc JournalArchives contains a selection of journal archives that have been licensed for perpetual access by member institutions. MIMAS has recently developed a service that enables simple and fast conceptual searching across more than 450 journals published by Brill, Institution of Civil Engineers, Institute of Physics, ProQuest, Oxford University Press and the Royal Society of Chemistry. The aim of this subscription service [122] is to enable researchers to access well-targeted content through three simple clicks from one central interface rather than having to visit multiple content providers' websites and negotiate their differing interfaces. As Box 4 below illustrates, it increases researcher efficiency.
5. Economic analysis of the value and benefits of text mining in UKFHE
Improved understanding of how text mining in UKFHE can generate wider economic benefit is an important part of the evidence base underpinning discussions about text mining and whether the Hargreaves-recommended legislative change is necessary. The Hargreaves report [145] highlighted two core areas of potential economic and social benefit and value:
- Where text mining could potentially generate cost savings and productivity gains
- Where text mining in UKFHE could lead to wider innovation in products or services with broader economic and social benefit
We examined both of these areas and also went further to examine the implications of current barriers to text mining....
5.1 Cost savings and productivity gains
Although most text mining activity in UKFHE research has been in specialist areas such as biomedical sciences and computer science [146], there appears clear potential for use in every branch of university research. Different disciplines may use different terminologies and 'ontologies' and require tools tailored to their subject 'dictionaries'. However, all disciplines share the basic principle of requiring systematic reviews of literature (which is essentially search for 'prior art') – and this is time consuming and resource intensive. There are a number of potential process benefits from text mining:...
5.2 Wider impact through innovation
Text mining has considerable potential to 'unlock' knowledge and help leverage maximum value from the higher education research base, at a time when maximising such value is seen as a high policy priority.
The newly published Government strategy towards innovation, 'Innovation and Research Strategy for Growth' [164 ], which proposes a raft of measures to open up access to data and information to stimulate innovation, is strongly underpinned by economic evidence and analysis, summarised in a BIS economics report [165]. This report draws on current analytical thinking to present innovation as underpinning the productivity gains that drive economic growth and social welfare.
5.2.1 The research base
Higher education and the public research base are recognised as having key roles in the innovation process alongside a wide range of influences and supporting factors such as training, skills and intellectual property, as well as governance regimes, manufacturing base, enterprise access to finance etc. Extensive research by NESTA has recognised that a strong research base is one of the six wider framework conditions necessary to foster innovation [167, 168]. There is substantial public investment in the research base every year (in 2010 68% of the £6.9bn research investment in UK higher education institutions was publicly funded) [169]. Maximising the knowledge to be extracted and diffused from that research base is seen as a high priority for innovation policy [170]. UK research is regarded as being high quality, 'with more articles per researcher, more citations per researcher, and more usage per article than researchers in the USA, China, Japan and Germany' [171]...
5.2.2 The potential for new cutting edge services and business models
Research using text mining requires an extensive range of supporting infrastructure and services. This includes domain-specific tools, training and the construction and curation of collections of documents that are in compatible formats for mining. As there is a strong predominance of English language journals in the scholarly publishing world this also gives an advantage to the UK for capitalising on demand for text mining. The time could come when UK-developed text mining tools and services become essential purchases alongside any journal access licences and create opportunities for new service development, which the UK's leading publishing industry is well placed to take...
5.3 Market failure and fairness
Current copyright law-driven restrictions on text mining, particularly the text mining of scholarly journals, appear to be inhibiting its wider usage or take-up in UKFHE. Without wider usage, the potential for text mining to generate gains for the economy and society cannot be exploited and the UK economy will be less able to take advantage of its strong public research base. This carries dangers of 'being left behind' as other competitor countries (such as Japan) adopt a more liberal approach that encourages text mining usage.
This observation raises some fundamental questions about the nature and structure of the market for text mining of scholarly journals and why such a situation exists. The current situation may be a result of market failure, which would be detrimental to the economy and society overall. Consideration of the value chain of the scholarly publishing communication system, which shows a substantial public investment in the underlying research base, also raises the question of 'equity' or fairness in current market operations: are copyright barriers to text mining in UKFHE preventing society from deriving a fair share of the return on society's own investment in research? [179]...
5.3.1 Market failure
The Green Book explains that 'market failure' occurs when the usual market mechanisms and transactions do not enable the achievement of 'economic efficiency'. Economic efficiency is the 'ideal' state when all relevant resources are being allocated and used to their maximum productivity: the point is reached where no one can become better off without someone else becoming worse off [184]...
5.4 Equity: who pays and who gains?
A number of those interviewed for this study drew attention to the significant investment embodied in the corpora of data being mined ie the actual research base itself. While the text mining process had particular costs associated with it (including equipment, software, training, licensing and curation costs), these paled into insignificance beside the underlying investment that had been made in the original research itself. The additional costs associated with text mining would be a small price to pay if they enabled the leverage of maximum value from the existing research base....
5.5 Reflections on the economic assessment of text mining in UKHFE
- We have found evidence for a clear potential for text mining usage in UKFHE to generate significant productivity gains, with benefit both to the business of the sector itself and to the wider economy
- Widespread take up of text mining by higher education researchers could be an opportunity for the UK, encouraging innovation and growth through leveraging additional value from the public research base
- The UK has a number of strengths including good framework conditions for innovation and the natural advantage of its native language for it potentially to be an early mover in text mining development. The scholarly publishing market is a global market with global potential for demand for text mining tools and services. This offers opportunities for new service companies as well as current content providers
- However, these opportunities for productivity improvements, knowledge discovery and innovation are being hindered by a range of economic-related barriers including legal restrictions, high transactions costs and information deficit, which are strongly indicative of market failure
6. Summary of Findings
The potential for text mining and text analytic technologies and practices in UKFHE
- Text mining offers a way of helping researchers to make sense of and leverage value from the vast sea of electronic resources, which is continually expanding. These research resources include both raw information sources such as the web and extant scholarly communications
- There is significant potential for using text mining to facilitate and advance research across all disciplines in UKFHE
- Use is most advanced within the biomedical sciences and related fields. Much of this work has involved development of and experimentation with text mining tools to explore their potential applications within the domain. However, text mining in these fields is beginning to be embedded in some workflows, which will aid uptake
- Use within other fields in UKFHE is less widespread, although pilot initiatives are beginning to explore its possibilities
- Where it is being used, text mining and analytics are being successfully employed in research to generate new knowledge and to support the research process
- Text mining and analytics have the potential to increase the research base available to business and society and to enable business and others to use the research base more effectively
- However, access restrictions to copyrighted documents, transaction costs, entry costs, lack of open infrastructure and lack of critical mass are all barriers to uptake
- Participants and evidence from the case studies suggest that barriers to uptake and restrictions in use of text mining and analytics that are limiting uptake have wider implications in terms of hindering innovation...
6.1 Limitations and issues
As highlighted in the introduction (section 1.4), the short time scales, small scale of the project and the limited use of text mining in UKFHE restricted the evidence, particularly the quantitative data, that could be collected. Participation in the study was further limited by two more sensitive reasons. First, some of the text mining that is currently undertaken might not necessarily abide by strict copyright licensing agreements and, second, the qualitative data relating to use can be considered commercial sensitive. This data limitation impacted the study in two ways. First, it meant that case studies needed to be stylised, combining existing and potential use cases, drawing on otherwise analogous uses in, for example, the commercial sector. Second, while best effort has been made to draw appropriate generalisations based on the case studies and quantitative data relating to research practice, these generalisations are indicative rather than statistically relevant. However, they provide a reasonable indication of the scale and magnitude of the economic benefits that could be derived....
7. Conclusions and Recommendations
7.1 Economic and regulatory related
There is evidence to suggest a degree of market failure in text mining. There are also fundamental questions about the 'fairness' of the current situation that limits text mining usage in UKFHE and thereby limits the returns to society. This would tend to support the Hargreaves recommendation for an exception to text mining for non-commercial use...
7.2 Infrastructure and support related
Realisation of the full potential of text mining within UKFHE is inexorably linked to the scholarly publication system. Issues relating to interoperability, information silos and access restrictions are limiting the uptake, degree of automation and potential application areas of text mining.
There is a significant lack of awareness regarding the potential for text mining in research apart from in specialised fields. This is hindering uptake.