Building a Next Generation Image Repository: Molecular Annotation and Cloud-based Data Processing and Analysis
Principal Investigator / Supervisor
Professor Jason Swedlow
Dr Alvis Brazma
Professor Rafael Edgardo Carazo Salas
University of Dundee
School of Life Sciences
We will construct the Image Data Repository (IDR) based on hardware infrastructure located at EMBL-EBI and integrated with its existing resources for hosting and delivering large datasets to the world's scientific community. These resources will serve as the storage and archive for IDR data. OME's Bio-Formats and OMERO will be used to read, manage, serve, and link the data to EMBL-EBI's molecular and structural resources. We will build custom user interfaces and workflows for the IDR, to ensure easy access and browsing to the datasets it holds. To enable computational re-analysis of the data, we will extend OMERO's distributed compute capacity and make use of EMBL-EBI's Embassy system to allow virtual access to IDR data. This virtual resource will provide a 'sandbox' for performing processing and reanalysis of data deposited in the IDR and provide a working example of a next generation data repository that stores and manages data, but also provides community services for scientific data.
Access to primary research data is vital for the advancement of the scientific enterprise. It facilitates the validation of existing observations and provides the raw materials to build on those observations. In the life sciences, there are numerous examples where members of a research community determined that a particular type of data would be useful and necessary to share. These include gene sequences, protein structural data, and gene and protein expression profiles. In these cases the community united to standardize the structure of the data and its associated metadata, and to create centralized repositories to facilitate deposition, promote discoverability, and ensure the longevity of the data. Imaging in the life sciences has undergone a revolution in recent years and is now used as a quantitative assay technology throughout the life and biomedical sciences. Imaging is used to understand the behavior of organisms, the formation of embryos, the structure and dynamics of cells, and the function and interactions of molecules that are the building blocks of life. Imaging datasets are complex, heterogeneous, and often extremely large, so they are rarely shared or published. Based on the recent development of several image data management technologies and the rapidly decreasing cost of large data storage facilities, we propose to create a resource to host, serve, and make available original scientific image data that underpins life sciences research. Our proposal is based on open source technologies with proven utility and performance that already run on-line resources serving several terabytes (TBs) of image data. We propose to place this resource at EMBL-EBI, which is the established home of molecular and structural life sciences data and interface the resource with ELIXIR, Europe's research infrastructure for life science informatics. In particular we will build links with established molecular and structural resources and work towards a seamlessintegration of these data, so that any scientist can easily browse, query and compute on genomic, structural and phenotypic data across several scales.
The resource has the potential to impact all branches of basic life sciences research. If the IDR is built and delivered there will be literally massive impact for the community. Datasets that have never previously been accessible will be available for the community to search, view, mine and even process and analyze. Rich visualization and annotation will make both interactive browsing and programmatic mining possible for the first time. This project will deliver a resource valuable for scientists, funders, and journals, by promoting the validation of experimental methods and scientific conclusions, the comparison with new data obtained by scientists in the world, the possibility of data re-use by developers of new analysis and processing tools. In particular, the IDR will provide an opportunity to test concepts and measure the true value of reproducibility in science. Finally the IDR can serve as a model for how large complex multidimensional datasets can be shared with worldwide scientific communities.
Research Committee A (Animal disease, health and welfare)
Technology and Methods Development
X – Research Priority information not available
Bioinformatics and Biological Resources Fund (BBR) [2007-2015]
X – not Funded via a specific Funding Scheme
I accept the
terms and conditions of use
(opens in new window)
export PDF file
back to list