Data Publication

When considering data publication, there are three main topics of interest.

License

A data license is a legal arrangement between the creator of the data and the end-user, or the place the data will be deposited, specifying what users can do with the data.

D.B. Deutz, M.C.H. Buss, J. S. Hansen, K. K. Hansen, K.G. Kjelmann, A.V. Larsen, E. Vlachos, K.F. Holmstrand (2020). How to FAIR: a Danish website to guide researchers on making research data more FAIR.

License Types

Open vs. Closed

Open licensing is the practice of using a license that allows for more permissive distribution and modification of a work, such as a data set or a program. Open licenses grew out of Lawrence Lessig’s idea of ‘Copyleft,’ a form of licensing that takes advantage of existing copyright law to achieve the aims above.

Licenses that are not considered open are generally called closed or ‘Proprietary.’ Proprietary licenses are licenses that restrict access to the raw data and impose certain limitations on use, modification, and distribution.

Sample Licenses
License (SPDX IDs)DomainBYSAComments
Creative Commons CCZero (CC0-1.0)Content, DataNNDedicate to the Public Domain (all rights waived)
Open Data Commons Public Domain Dedication and Licence (PDDL-1.0)DataNNDedicate to the Public Domain (all rights waived)
Creative Commons Attribution 4.0 (CC-BY-4.0)Content, DataYN
Open Data Commons Attribution License (ODC-By-1.0)DataYNAttribution for data(bases)
Creative Commons Attribution Share-Alike 4.0 (CC-BY-SA-4.0)Content, DataYY
Open Data Commons Open Database License (ODbL-1.0)DataYYAttribution-ShareAlike for data(bases)
Table of Data/Content Conformant Licenses from https://opendefinition.org/licenses/.
Domain = Domain of application, i.e., what type of material this license should/can be applied to
BY = requires attribution
SA = require share-alike
LicenseID
Apache License, Version 2.0Apache-2.0
Common Development and Distribution License 1.0CDDL-1.0
Eclipse Public License version 2.0EPL-2.0
GNU General Public License version 3GPL-3.0-only
GNU Lesser General Public License version 3LGPL-3.0-only
The 2-Clause BSD LicenseBSD-2-Clause
The 3-Clause BSD LicenseBSD-3-Clause
The MIT LicenseMIT
Table of Open-source Licenses from the Open Source Initiative for Software.

Repository

When choosing a repository on which to host a dataset, there are six main questions to consider.

Does the repository assign a DOI to the data?

For data to be Findable (FAIR), it needs a globally unique and persistent identifier.

Examples of globally unique and persistent identifiers
Examples of globally unique and persistent identifiers

DOI is not the only one, but it is one of the most common persistent identifiers for data.

Do I want to publish my data in a domain-specific or general repository?

General data repositories often have fairly stable funding/support and are the go-to for interdisciplinary scientists, but they may have less tailored metadata.

General Data Repository Examples
General Data Repository Examples

Domain specific data repositories often have metadata tailored to the domain and a more specific audience, but they can sometimes have less stable funding and support.

Domain Specific Data Repository Examples
Domain Specific Data Repository Examples
Can this repository support my dataset?

What are the repository data size limits?

Each repository has limits on the size of data that can be uploaded – both for individual files and datasets as a whole. With the rise of open big data this can be a major limitation: many popular general repositories have data limits on the order of Gb rather than Tb. That said, there is also increasing interest in publishing big data. Repositories like Harvard Dataverse and Zenodo have policies for contacting the repository curators about publishing larger datasets.

Does this repository support my data formats?

Some repositories are tailored to specific data types and may only support a subset of file formats. Researchers should check for any format restrictions before publishing their data in a given repository.

Does this repository meet the data stakeholder’s requirements for publication?

In many research settings, including government research laboratories, datasets may be derived all or in part from an external source with their own rules and regulations for data publication. This can affect the choice of an appropriate repository– for example, a customer may stipulate that while their data can be published openly, it must be in a repository sponsored by organizations in the country of origin.

When creating a data management plan (see Data Documentation page), it is always important to communicate with all data stakeholders as early as possible about any restrictions or requirements for data publication.

Is the repository actively funded and maintained?

This step is critical for long term sustainability. When considering a data repository, check the data publication trends (are they generally high? Infrequent?). Also spend some time researching who funds and maintains the repository.

How easy/difficult is it for someone to download my data from this repository?

In the case of open data, it is important to have an open protocol and as few steps as possible to download it. It’s also critical to make links to any supporting software easily available.

Does the repository have a clear procedure for releasing and tracking updates to the dataset?

There is no such thing as a perfect dataset. Corrections will need to be made! A clear procedure and versioning system allows open data to be fixed or expanded upon with clear provenance. Clear updates linked to prior dataset versions allow for reproducible research on those datasets.

Some great resources for searching and research data repositories are the Registry of Research Data Repositories (re3data.org) and FAIRsharing.org.

Journal

Why should you also publish a paper about your dataset in parallel to your data release? It supports the F and I in FAIR! Paper publications add searchability, generate readable, motivation-driven documentation, and give publication credit.

There are many different kinds of publications that can be associated with a data release, we review these options below. Among these options, data descriptors align particularly well with FAIR data release.

Ways to Publish

Conference Proceeding Data Tracks

More and more venues are recognizing that novel data is itself an incredibly important contribution to scientific research. A growing trend for many conferences–especially AI/ML and data mining conferences–is to include a proceedings track specifically to introduce novel data sets or benchmarks to the community. A few examples include:

  1. NeurIPS: Explicit “Data and Benchmarks” track since 2021
    1. 2022 Datasets and Benchmarks proceedings
    2. 2021 Datasets and Benchmarks Proceedings
  2. ACM SIGKDD: “Applied Data Science” Track (data sets are a specific thrust of this track). The scope for this track was broadened in 2022.
    1. 2023 Applied Data Science Track proceedings papers
    2. 2022 Applied Data Science Track proceedings papers
  3. ICIP: Introduced a “Data and Benchmarks” track this year.
Methodology Papers

Methodology papers focus on the novel impacts of how data was generated over the novel data itself. These papers are common for data created under a new technique, build, or pipeline where the data is the “result” of the new technique. This form of data publication is common for traditionally experimental domains, such as materials science.

Analytical papers

Not all journals emphasize the novelty of a data release on its own. Thus a traditional type of publication for data-release includes both data and analysis. This analysis can come in various forms: applying initial machine learning to the data, including initial benchmarks on the data, or even simply including statistical analysis on the data.

The issue with this style of publication is that it attaches an initial interpretation of the dataset to the data release. An analytical paper associated with a data release has the potential to become prescriptive, thus we typically recommend releasing both a data descriptor article (free from interpretation) and a separate analytical paper on the data.

An analytical paper can be the appropriate publication associated with a data release for two reasons:

  1. The data set corresponds to a very specific mission-driven real-world task that requires prescriptive interpretation for future results to be relevant in the mission space.
  2. The targeted publication journal only accepts analytical papers.
Data Descriptor Articles

A data descriptor article (sometimes also called a data note) emphasizes that releasing open data is just-as (if not more) valuable than new analysis, methods, or algorithm development for furthering state-of-the-art in a given domain.

This is an article that presents an open dataset as the novel paper contribution without interpretation. We consider this the gold standard for a journal publication that accompanies data.

Because the article describes the data but does not interpret it, it becomes a highly accessible foundation that introduces researchers to the dataset without biasing the work that they do with it.

Choosing a Data Descriptor Journal

There are two kinds of journals that publish data descriptor articles: pure and mixed. Pure data journals are those that exclusively publish data descriptor articles. Mixed data journals are those that explicitly accept data descriptor articles. There are key advantages to both journal types.

Kindling and Strecker published an excellent list of journals from a variety of domains that accept data descriptors, available both on Zenodo and GitHub.

Pure Data Journals

For pure data journals, the journal editors and reviewers are more familiar with the expectations for a data descriptor. There will be no confusion about why the article does not include interpretation or analysis on the dataset. Pure data journals also typically have well-defined data descriptor templates. A few well-known pure data journals are Data in Brief, Scientific Data, and Data.

Image: Left are three general pure data journals; right are three domain-specific pure data journals

Mixed Journals

The problem with pure data journals, of course, is that relatively few exist.

Mixed data journals are far more prevalent, so it’s more likely you will find a mixed data journal that matches your target audience. More and more journals are including data descriptors as a new article type–but this means that there is a lack of precedent. Reviewers may be less familiar with the requirements of a data descriptor and the journal may lack a data-descriptor-specific template. Since data descriptors are still fairly new in the publication world, there may be few or no examples of previously published data descriptors in the journal. In these cases, reach out to the mixed journal editor to get a better sense of the journal’s expectations.

Springer’s Discover journal series is very new but is domain specific for over 40 domains and has fast turnaround.
Springer’s Discover journal series is very new but is domain specific for over 40 domains and has fast turnaround.

Resources and References