Making data FAIR using InterMine
Principal Investigator / Supervisor
Dr Gos Micklem
University of Cambridge
We propose to extend InterMine to make it a better provider of FAIR data. We will provide unique and stable URIs for InterMine data objects, so that they can be accessed RESTfully and safely used in external datasets and systems. We will register these identifiers with third-parties and embed search-engine friendly metadata in web pages, to make InterMine-served data more findable. To enhance interoperability, we will generalize the facility for attaching the Sequence Ontology to the data model into one that can attach any arbitrary ontology. This will allow us to fully annotate the core model, and allow InterMine operators to annotate their own extensions to it. InterMine's existing JSON and XML query output formats will be extended with this new metadata, and these formats will be available for individual objects as well. RDF will become a new InterMine output format allowing users who want to integrate all InterMine data to bulk download RDF triples. For piecemeal integration, there will be RDF representations of InterMine objects, lists and query results. We will also provide a SPARQL endpoint Docker image for experimentation rather than production, as SPARQL is a powerful technology but still subject to performance and uptime issues. Data interoperability means adding links between datasets and data objects. We will create as many links as possible. In some cases these will target primary data providers, in other cases intermediate FAIR link registries. Increasing data integration also makes recording and providing data license information more important. We will improve InterMine's capabilities in this area and advocate its importance. As a long-lived professionally developed open source project, we will do all this following best practices. We will also make it usable by writing documentation, put it in front of people by presenting papers and running workshops, and adapt our plans in response to community feedback and contributions.
Against a backdrop of ever-increasing data generation by the world's bioscience community there is growing recognition that it is essential to describe data more precisely and so make it easier to find and reuse, including integrating it with other datasets. In order to achieve these goals, the community needs to describe the data better (i.e. create better "metadata") and also to store and transport the data according to rigorously defined standards. This is hard in the biosciences because of the huge diversity in experimental techniques and the extremely rapid pace of technological change. Accordingly, the recently published FAIR principles are defining a consensus that data should be "Findable, Accessible, Interoperable and Reusable". These principles are providing an opportunity for a concerted effort towards their increased adoption by the bioscience research community together with science funders and publishers, who are starting to ask for data to be managed according to the FAIR principles. InterMine is a data integration framework that has been developed for over a decade and is used for large-scale data integration projects around the world, including by many of the main plant and animal model organism databases (MODs). These MODs are integrated repositories representing the output of much of the world's basic research, and have correspondingly large user communities. Since starting in 2002, the InterMine project has been under continuous development to reflect changes in best practice and to exploit the best available technologies. Consistent with this approach, in this proposal, we aim to make extensive and coherent changes to InterMine, to enable InterMine database operators to create FAIR resources for their user communities. Thus this project should positively and directly impact current and future providers of integrated data resources based on InterMine, indirectly impact their collectively large user communities by providing better-described data, and facilitate one of the key aims of the FAIR principles, which is large-scale data re-use.
(Please see the Academic Beneficiaries section to see how the academic community will benefit from this proposal. Below we outline potential Economic and Societal benefits) Agritech, biotech and pharmaceutical companies, that depend extensively on the academic community for tools and access to data, will benefit by being able to find and access data more easily as well as subsequently being able to integrate and reuse it, so increasing their efficiency with consequent economic benefits. Google, Yahoo, Yandex, Microsoft and other companies involved in schema.org will benefit from our involvement in bioschemas.org: this proposal will yield a compelling and high value use case for the Bioschemas community through thousands of end-users accessing the terabytes of data provided through InterMine databases. Funders, such as the Research Councils, will benefit from the increased impact of the projects they support. Data will be made FAIR more easily, and so data generated from grants will be re-used more and thus will provide better value for taxpayers' money. Similarly by increasing the availability of FAIR data, the overall research effort will be made more efficient. Local schools and school-trips visiting Cambridge for possibly entry, together with the general public through outreach activities, will benefit from greater awareness of the importance of big data to modern bioscience, and how this can benefit the economy and society.
Research Committee C (Genes, development and STEM approaches to biology)
Technology and Methods Development
X – Research Priority information not available
Bioinformatics and Biological Resources Fund (BBR) [2007-2015]
X – not Funded via a specific Funding Scheme
I accept the
terms and conditions of use
(opens in new window)
export PDF file
back to list