Award details

BioStudies and the Image Data Resource: Expanding Imaging Datasets, Linkage, Metadata, and Value

ReferenceBB/R015384/1
Principal Investigator / Supervisor Professor Jason Swedlow
Co-Investigators /
Co-Supervisors
Institution University of Dundee
DepartmentSchool of Life Sciences
Funding typeResearch
Value (£) 583,944
StatusCurrent
TypeResearch Grant
Start date 01/07/2018
End date 30/06/2021
Duration36 months

Abstract

Much of the published research in the life sciences includes multidimensional, quantitative image data. These images are routinely used for quantitative measures of biological processes and structures that form the foundation of many of the results published in peer-reviewed life sciences journals. In almost all cases, however, images are presented in published articles in processed, compressed formats that do not accurately convey the quality and complexity of the original image data. The sheer size and heterogeneity of image data sets- multi-dimensional image stacks combined with experimental metadata and analytic results-- makes image data handling and publication extremely complex, and in practice, rarely achieved. In this project we aim to build the submission pipeline for deposition of reference imaging data in BioStudies and then into IDR. This will grow the datasets that are publicly available in both BioStudies and IDR. We will do this by building a submission pipeline and updating the data submission templates and building metadata validators for use by submitters. This will ensure correct metadata submission and reduce the time spent curating submitted studies by IDR staff. We will also extend the value of data stored by adding links to several valuable resources and extending the metadata the IDR holds.

Summary

Access to primary research data is vital for the advancement of the scientific enterprise. It facilitates the validation of existing observations and provides the raw materials to build on those observations. In the life sciences, research communities have repeatedly collaborated to build resources that allow public submission and access to particular types of datasets. These include gene sequences, protein structural data, and gene and protein expression profiles. In these cases the community united to standardize the structure of the data and its associated metadata, and to create centralized repositories to facilitate deposition, promote discoverability, and ensure the longevity of the data. Much of the published research in the life sciences carries with it detailed image data. These images are routinely used for quantitative measures of biological processes and structures that form the foundation of many of the results published in peer-reviewed life sciences journals. In almost all cases, however, images are presented in published articles in processed, compressed formats that do not accurately convey the quality and complexity of the original image data. Those original data are housed in thousands of individual labs in hundreds of different file formats. The data are difficult for researchers to share and in practice impossible to publish. The sheer size and complexity of image data sets and even of individual multi-dimensional images makes data submission, handling and publication extremely complex. An image-based genome-wide "high content" screen (HCS) may have over a million images, new "virtual slide" and "light sheet" tissue imaging technologies generate individual images that contain gigapixels of data showing tissues or whole organisms at subcellular resolutions. Many of these datasets-acquired on the latest generation imaging systems-- are valuable resources that contain so much data that their full value can only be achieved if a large community isgiven the opportunity to view, analyse and re-analyse the data, sometimes in combination with other datasets. Just as genomic and structural biology have already done, the imaging community must address the challenges posed by multidimensional image datasets so scientists, educators, students and the wider public can find, share, and validate the data that underlie published scientific results. This proposal connects a growing community resource called BioStudies with a previous BBSRC-funded project that developed a public Image Data Repository to deliver the next step in public resources for scientific images.

Impact Summary

There are several forms of impact from this project. The first will derive from the imaging datasets we make available in BioStudies and IDR. These datasets can be accessed through the interactive interfaces presented by the two resources, and thus meet two recent requirements for scientific data, that the datasets will be findable and accessible. Those reference datasets that are included in IDR will further be integrated with other datasets through curation and normalisation, thus starting to make them interoperable, and available via the IDR Jupiter resource and also downloadable by Aspera, so they are reusable. One of the aims of BioStudies is to catalyze the development of data standards in life sciences - data can be initially described using the lightweight structures offered by BioStudies, and then tighter requirements can be defined in an incremental fashion. The proposed project will serve as a proof of concept of this process. In addition, this project will help support the movement that is emerging to make the publication of imaging data routine, and possibly in the future, mandatory for scientific publications. Currently, journals, funders and community scientists are debating this issue- we hope to energise this debate and provide both technical solutions and scientific examples and rationales for publishing imaging data routinely. This potentialimpact is demonstrated by the LoS's from several leading journals. Finally, the datasets are all available for download from BioStudies or IDR, providing resources for the development of new tools of image processing and analysis. Moreover, from the IDR, the application stacks and the metadata databases are all available, which allows others to download and re-use IDR data and systems, and integrate their datasets and analytics.
Committee Research Committee D (Molecules, cells and industrial biotechnology)
Research TopicsX – not assigned to a current Research Topic
Research PriorityX – Research Priority information not available
Research Initiative Bioinformatics and Biological Resources Fund (BBR) [2007-2015]
Funding SchemeX – not Funded via a specific Funding Scheme
terms and conditions of use (opens in new window)
export PDF file