Data Documentation

Data Management Plan

A Data Management Plan (DMP or DMSP) details how data will be collected, processed, analyzed, described, preserved, and shared during the course of a research project.

NIH, National Library of Medicine

The Data Management Plan is a formal document discussing how data will be handled during and after the project. It can be used as a key tool for communicating expectations with data stakeholders. It is also an evolving document that should be revisited and revised as necessary.

The Fundamental README

The README is the key file that describes a dataset and its metadata. A template can be found below.

Template
Introduction
--------------------------------------------------------------------------
Title:
Authors:
 - % Author 1
 - % Author 2
 - % Author 3

Technical POC:
Data Access POC:

Funding Statement:

Data Description:
% Insert a brief "abstract" like description of the data here

Organization:
% Give an overview of the directory organization and contents

--------------------------------------------------------------------------
Data Type
--------------------------------------------------------------------------
-----------------
FILES
-----------------

Filename format:

File description:

File format:
% Describe the file format here. For columnar data list and explain column names. 

Important Notes:
% Add any important notes about issues, quirks, missing data, messiness, etc for this file type here.


-----------------
METHODS
-----------------
Setup:


Steps:
	1. 
	2.
	3.
	4.
	5. 

Important Notes:

Methods Section

The methods section of a README is arguably one of the most important. To the best of your ability, document your methods to improve the reproducibility of your results. However, with respect to human data, privacy is more important than reproducibility.

Experimental Data

In addition to the information in the general template, add these specifics.

Setup
----------------------------
For machines/sensors:
  - Machine types
  - Relevant machine settings
  - Process parameters in a meta file (ideally with ontology)

Consider which variables are critical to reproducibility:
  - Location
  - Time
  - Demographics
  - Materials
  - Machines
  - Sensors
  - Build architectures

Steps
----------------------------
  - Data collection
  - Data post-processing
  - Pointer to software release that captures any steps
Derivative Data
  • Clearly specify the original data source
    • Make sure the data release is in accordance with the original data license
  • List processing steps
    • Include code generated to process data
  • Detail the methods for collecting, compiling, and correlating any additional data outside the original dataset
A plot of derivative data, including references to original data sources
A plot of derivative data, including references to original data sources
Simulated Data

The primary decision regarding a simulated data set is whether you are publishing only the generated data, only the data generating code, or both.

Use the previous sections’ suggestions based on this decision to complete your README.

Software and Data

Do any of these suggestions or better practices apply to your software?

YES.

Software that generates or processes data should be seen as an extension of the data itself. In fact, the FAIR principles have been extended to FAIR for Software.

ConceptDefinitionExample
FindableSoftware, and its associated metadata, is easy for both humans and machines to find.Journal of Open Source Software (JOSS)
AccessibleSoftware, and its metadata, is retrievable via standardised protocolsHosting software on GitHub
InteroperableSoftware interoperates with other software by exchanging data and/or metadata, and/or through interaction via application programming interfaces (APIs), described through standards.Use standard file formats (e.g., csv)
ReusableSoftware is both usable (can be executed) and reusable (can be understood, modified, built upon, or incorporated into other software).Use of open-source licenses
FAIR concepts for Software

Implicit Metadata

Documentation needs to stand the test of time. It must be resilient to:

The four facets that can cause documentation to go out of date. [1] Icon by Iconsea, Freepik; [2] Icon by AGE, Freepik; [3] Icon by Haca Studio, Freepik.
The four facets that can cause documentation to go out of date. [1] Icon by Iconsea, Freepik; [2] Icon by AGE, Freepik; [3] Icon by Haca Studio, Freepik.
When in doubt, spell it out

Say you are given the following list of column names:

R
Th
Data
Rs
RsA
RsB

Do you know what they stand for? Rs for example could be “Right side”, “Remote sensing”, “Row space”, “Rapid Start”, “Reed-Sternberg” – the list is endless!

The file these come from is named PtAu_20deg_sheet_resistance.csv. We can deduce that at least one of these columns refers to sheet resistance, but which one?

R = Radius, polar coordinate
Th = Theta, polar coordinate
Data = Sheet resistance
Rs = Sheet resistance, duplicate of Data
RsA = Sheet resistance in single mode configuration
RsB = Sheet resistance in dual mode configuration

How could we make these better? First we can remove Rs given that it’s a duplicate. After that, we can rename the rest of the variables to make them clear:

position_radius
position_theta
sheet_resistance
sheet_resistance_single_mode
sheet_resistance_double_mode
If you don’t want to lose a very important tortoise, include your units
An image of a Galapagos turtle

This is Clarence. Clarence is a 101 year old tortoise brought from the Galapagos islands to the United States in 1927 as part of an effort to save his species. In 2001, Clarence moved from the Los Angeles Zoo to the Moorpark College Teaching Zoo. In preparation for this move, the Los Angeles Zoo informed the Moorpark College Zoo that they should prepare an enclosure that was big enough for a tortoise that weighs 250.

250… what? The Los Angeles Zoo meant kilograms; the Moorpark College Zoo assumed pounds. Clarence immediately escaped from his new enclosure. (Don’t worry – they found him later!)

Clarence the Escape Artist
Clarence the Escape Artist

The moral of the story? Include your units! Using the variable names from before, we can adjust them to:

position_radius_microns
position_theta_deg
sheet_resistance_ohm/square
sheet_resistance_single_mode_ohm/square
sheet_resistance_double_mode_ohm/square

Ontologies

Metadata essentially constitutes all of the data that describes your dataset. While setting appropriate metadata names yourself (spelled out and including units) is important, this can be taken a step further by using domain-appropriate ontologies.

An Ontology is a formal dictionary of terms for a given industry or field that shows how the properties are related, Terms are stored as object-relationship pairs.

The key power of using a common ontology to describe your metadata is that gives you human, machine and dataset interoperability. When the same quantities are described by exactly the same predefined nomenclature across datasets, they can be seamlessly integrated and compared.

There are many fantastic resources available to get you started with understanding ontologies and how they might integrate with your system. A few we recommend:


Resources and References