Publishing Data

From BILS Wiki
Jump to: navigation, search

Guidelines for publication of life sciences research data

Introduction

The scientific method is based on hypotheses being supported or refuted by experimental evidence - data. In order to keep scientific advances transparent and verifiable, it is important that the data is accessible to anyone. Furthermore, published data can be used for new research and for method development. Data archiving is a good investment and sharing detailed research data is associated with increased citation rate.

As the benefits of Open Access Data are becoming more and more apparent, many journals and funding agencies are now requiring that data is made available. See the JULIET project for a list of Open Access policies of both publications and data from funding agencies across the world.

In this document, we try to give practical information to scientists in Sweden about how to publish scientific data from the fields of life sciences. As there are different standards and recommendations for different disciplines, we only give an overview, and refer to other sources of information where possible.

Publishing Data

The typical scenario for a scientific project is to conduct experiments, draw conclusions and publish a paper presenting the findings. The publishing of the underlying data requires some additional efforts. First of all, in order for the data to be regarded as Open Access Data, it should be saved in an open format, so that anyone can access the information in the data. Open formats also ensures that the data is readable also in the distant future.

Metadata

The data can not become useful unless it is accompanied by detailed data descriptions, such as how, where, under which conditions and by whom the data was collected. This so-called metadata is ideally published so that it can be searched easily. Different disciplines have different standards and requirements for the metadata (read more here), but an often used minimal set is the so-called Dublin Core Metadata Element Set (DCMES).

Flow chart for publishing data

Repositories

When the data set is prepared for publication, one must thus find a suitable repository to deposit the data in. In many cases, there are community standards for where to publish the data, such as the Protein Data Bank or the EMBL Nucleotide Archive. A community database is usually the preferred place to publish data, as they also have policies regarding what metadata that must be included, tools for searching for data and api:s and web services for analyzing the data. A few examples of community repositories can be found in the table below. A longer list can be found in the Wikipedia List of biological databases.

Example community repositories
Biological Magnetic Resonance Data Bank http://www.bmrb.wisc.edu/ Repository for Data from NMR Spectroscopy on Proteins, Peptides, Nucleic Acids, and other Biomolecules
Protein Data Bank in Europe http://www.ebi.ac.uk/pdbe/ The European resource for the collection, organisation and dissemination of data on biological macromolecular structures.
European Nucleotide Archive http://www.ebi.ac.uk/ena/ A comprehensive record of the world's nucleotide sequencing information, covering raw sequencing data, sequence assembly information and functional annotation.
Proteomics Identifications database http://www.ebi.ac.uk/pride/ A centralized, standards compliant, public data repository for proteomics data.
ArrayExpress Archive http://www.ebi.ac.uk/arrayexpress/ Database of functional genomics experiments including gene expression.

Licenses

It is recommended to publish data with an explicit license. The recommended license to use for data is the Creative Commons CC0, which is a public domain dedication. For some of the reasoning behind this, see BMC Research Notes 2012, 5:494. For software source code, other licenses can be used; the appropriate license is often depending on the licenses of 3rd party contributions to the software.

Identifiers

For research fields without community-wide data repositories, one can choose to publish the data elsewhere. For example, making the data available from the homepage of the research group, or using SweStore for hosting the data set. However, in these cases it is recommended to also acquire a persistent identifier for the data set; homepage addresses can change with time, and in order to make links to data stable, it is advantageous to use handles, which are permanent pointers to the data. If the data is moved to a different the location, the pointer is updated, but the identifier remains the same. The most commonly used system is the Digital Object Identifier system (doi), which implements the Handle system. Using doi:s also facilitates citing the data. In Sweden, the Swedish National Data Service SND is a member of DataCite, and can thus delegate the right to mint doi:s. BILS has been given this right for the life sciences.

In addition to identifiers to publications and data sets, you can obtain an identifier to yourself. This allows for linking research activities and outputs to these identifiers. ORCID is an open, non-profit, community-based effort to provide a registry of unique researcher identifiers.

Generic data publishing services

In recent years, several generic services for publishing data has appeared. CERN has a service, Zenodo, where all research outputs from across all fields of science are welcome. Zenodo is part of the FP7 project OpenAire, to support the implementation of Open Access in Europe. Another EU funded effort in this area is EUDAT, that offers as suite of tools that are designed to work together through the data management life cycle, e.g. B2SHARE for data publishing and B2DROP for data collaboration. There is also figshare which is supported by a company (Digital Science), where anyone can publish data and obtain a doi.

For data that is published in conjunction with an article within biosciences, it is possible to utilize the organization Data Dryad. They can act as a data repository, and will also issue a doi for the data set. Another option is to publish in the integrated database and journal Giga Science, which publish big-data studies from the entire spectrum of life and biomedical sciences. However, we do recommend that the research data is published in it's own right (using domain-specific repositories or the generic data publishing services mentioned above) rather than these "Supplementary Information" repositories.

More information about these services is available from our comparison of some generic research data repositories

Data publication plans

Many funding agencies now require a data publication plan (DPP) as part of a funding application. The purpose of a DPP is to ensure that the applicants know how and where to store and how to make the data output of the project available. This includes specifying data formats (for interoperability), metadata, and making a budget for the cost of data storage.

BILS can assist in writing a data publication plan. In the future, we will also publish DPP templates for common cases on this wiki.

Further reading

Contact

BILS can assist in all steps in the data publication process, you can request assistance by sending an e-mail to BILS.

Disclaimer

This document is produced and updated as a service to the Swedish life sciences research community, but may contain errors. If you have suggestions for improvements, please contact Mikael Borg.

This work is licensed under a Creative Commons Attribution 4.0 International License.