Data Documentation

Data Management Plan

A Data Management Plan (DMP or DMSP) details how data will be collected, processed, analyzed, described, preserved, and shared during the course of a research project.
NIH, National Library of Medicine

The Data Management Plan is a formal document discussing how data will be handled during and after the project. It can be used as a key tool for communicating expectations with data stakeholders. It is also an evolving document that should be revisited and revised as necessary.

The Fundamental README

The README is the key file that describes a dataset and its metadata. A template can be found below.

Template

Introduction
--------------------------------------------------------------------------
Title:
Authors:
 - % Author 1
 - % Author 2
 - % Author 3

Technical POC:
Data Access POC:

Funding Statement:

Data Description:
% Insert a brief "abstract" like description of the data here

Organization:
% Give an overview of the directory organization and contents

--------------------------------------------------------------------------
Data Type
--------------------------------------------------------------------------
-----------------
FILES
-----------------

Filename format:

File description:

File format:
% Describe the file format here. For columnar data list and explain column names. 

Important Notes:
% Add any important notes about issues, quirks, missing data, messiness, etc for this file type here.


-----------------
METHODS
-----------------
Setup:


Steps:
	1. 
	2.
	3.
	4.
	5. 

Important Notes:

README_template Download

Methods Section

The methods section of a README is arguably one of the most important. To the best of your ability, document your methods to improve the reproducibility of your results. However, with respect to human data, privacy is more important than reproducibility.

Open All

Close All

Experimental Data

In addition to the information in the general template, add these specifics.

Setup
----------------------------
For machines/sensors:
  - Machine types
  - Relevant machine settings
  - Process parameters in a meta file (ideally with ontology)

Consider which variables are critical to reproducibility:
  - Location
  - Time
  - Demographics
  - Materials
  - Machines
  - Sensors
  - Build architectures

Steps
----------------------------
  - Data collection
  - Data post-processing
  - Pointer to software release that captures any steps

Derivative Data

Clearly specify the original data source
- Make sure the data release is in accordance with the original data license
List processing steps
- Include code generated to process data
Detail the methods for collecting, compiling, and correlating any additional data outside the original dataset

Simulated Data

The primary decision regarding a simulated data set is whether you are publishing only the generated data, only the data generating code, or both.

Use the previous sections’ suggestions based on this decision to complete your README.

Software and Data

Do any of these suggestions or better practices apply to your software?

YES.

Software that generates or processes data should be seen as an extension of the data itself. In fact, the FAIR principles have been extended to FAIR for Software.

Concept	Definition	Example
Findable	Software, and its associated metadata, is easy for both humans and machines to find.	Journal of Open Source Software (JOSS)
Accessible	Software, and its metadata, is retrievable via standardised protocols	Hosting software on GitHub
Interoperable	Software interoperates with other software by exchanging data and/or metadata, and/or through interaction via application programming interfaces (APIs), described through standards.	Use standard file formats (e.g., csv)
Reusable	Software is both usable (can be executed) and reusable (can be understood, modified, built upon, or incorporated into other software).	Use of open-source licenses

FAIR concepts for Software

Implicit Metadata

Documentation needs to stand the test of time. It must be resilient to:

The four facets that can cause documentation to go out of date. [1] Icon by Iconsea, Freepik; [2] Icon by AGE, Freepik; [3] Icon by Haca Studio, Freepik.

Open All

Close All

When in doubt, spell it out

Say you are given the following list of column names:

R
Th
Data
Rs
RsA
RsB

Do you know what they stand for? Rs for example could be “Right side”, “Remote sensing”, “Row space”, “Rapid Start”, “Reed-Sternberg” – the list is endless!

The file these come from is named PtAu_20deg_sheet_resistance.csv. We can deduce that at least one of these columns refers to sheet resistance, but which one?

R = Radius, polar coordinate
Th = Theta, polar coordinate
Data = Sheet resistance
Rs = Sheet resistance, duplicate of Data
RsA = Sheet resistance in single mode configuration
RsB = Sheet resistance in dual mode configuration

How could we make these better? First we can remove Rs given that it’s a duplicate. After that, we can rename the rest of the variables to make them clear:

position_radius
position_theta
sheet_resistance
sheet_resistance_single_mode
sheet_resistance_double_mode

If you don’t want to lose a very important tortoise, include your units

This is Clarence. Clarence is a 101 year old tortoise brought from the Galapagos islands to the United States in 1927 as part of an effort to save his species. In 2001, Clarence moved from the Los Angeles Zoo to the Moorpark College Teaching Zoo. In preparation for this move, the Los Angeles Zoo informed the Moorpark College Zoo that they should prepare an enclosure that was big enough for a tortoise that weighs 250.

250… what? The Los Angeles Zoo meant kilograms; the Moorpark College Zoo assumed pounds. Clarence immediately escaped from his new enclosure. (Don’t worry – they found him later!)

The moral of the story? Include your units! Using the variable names from before, we can adjust them to:

position_radius_microns
position_theta_deg
sheet_resistance_ohm/square
sheet_resistance_single_mode_ohm/square
sheet_resistance_double_mode_ohm/square

Ontologies

Metadata essentially constitutes all of the data that describes your dataset. While setting appropriate metadata names yourself (spelled out and including units) is important, this can be taken a step further by using domain-appropriate ontologies.

An Ontology is a formal dictionary of terms for a given industry or field that shows how the properties are related, Terms are stored as object-relationship pairs.

The key power of using a common ontology to describe your metadata is that gives you human, machine and dataset interoperability. When the same quantities are described by exactly the same predefined nomenclature across datasets, they can be seamlessly integrated and compared.

There are many fantastic resources available to get you started with understanding ontologies and how they might integrate with your system. A few we recommend:

Cambridge Semantics OWL 101
CSIRO’s Intro to RDF and OWL Tutorial
Current Efforts from CWRU’s Center for Materials Data Science for Stockpile Stewardship (MDS3)

Resources and References

Data Management
- Becker, Carina, Carolin Hundt, Claudia Engelhardt, Johannes Sperling, Moritz Kurzweil, and Ralph Müller-Pfefferkorn. “Data management plan tools: Overview and evaluation.” In Proceedings of the Conference on Research Data Infrastructure, vol. 1. 2023.
- Hudson-Vitale, Cynthia, and Heather Moulaison-Sandy. “Data Management Plans: A Review.” DESIDOC Journal of Library & Information Technology 39, no. 6 (2019).
- Smale, Nicholas, Gareth Denyer, Kathryn Unsworth, Elise Magatova, and Daniel Barr. “A review of the history, advocacy and efficacy of data management plans.” International Journal of Digital Curation (2020).
Software
- Barker, Michelle, Neil P. Chue Hong, Daniel S. Katz, Anna-Lena Lamprecht, Carlos Martinez-Ortiz, Fotis Psomopoulos, Jennifer Harrow et al. “Introducing the FAIR Principles for research software.” Scientific Data 9, no. 1 (2022): 622.
- Smith, Arfon M., Kyle E. Niemeyer, Daniel S. Katz, Lorena A. Barba, George Githinji, Melissa Gymrek, Kathryn D. Huff et al. “Journal of Open Source Software (JOSS): design and first-year review.” PeerJ Computer Science 4 (2018): e147.
- The Open Source Initiative Approved Licenses (software).
Clarence the Tortoise
Ontologies
- Oltjen, William C, Fan, Yangxin, Liu, Jiqi, Huang, Liangyi, Huang, Liangyi, Yu, Xuanji, Li, Mengjie, Seigneur, Hubert, Xiao, Xusheng, Davis, Kristopher O., Bruckman, Laura S, Wu, Yinghui, & French, Roger H. FAIRification, Quality Assessment, and Missingness Pattern Discovery for Spatiotemporal Photovoltaic Data. United States. https://doi.org/10.2172/1959042
- Priyan Rajamohan, Alexander Harding Bradley, Hayden Caldwell, Erika I. Barcelos, and Roger H. French, “FAIRmaterials: Find the docs.” SDLE Res. Cntr., Case Western Reserve University, 2023. Available: https://cwrusdle.bitbucket.io/. [Accessed: Mar. 14, 2023]
- A. Nihar et al., “Toward Findable, Accessible, Interoperable and Reusable (FAIR) Photovoltaic System Time Series Data,” 2021 IEEE 48th Photovoltaic Specialists Conference (PVSC), Fort Lauderdale, FL, USA, 2021, pp. 1701-1706, doi: 10.1109/PVSC43889.2021.9518782.
- Cambridge Semantics OWL 101: https://cambridgesemantics.com/blog/semantic-university/learn-owl-rdfs/owl-101/
- CSIRO’s Intro to RDF and OWL Tutorial: https://csiro-enviro-informatics.github.io/info-engineering/tutorials/tutorial-intro-to-rdf-and-owl.html
- FAIR Materials FindTheDocs: https://cwrusdle.bitbucket.io/
- WebVOWL Ontology visualizer: https://webvowl-w4rydgnccq-ul.a.run.app/
- JSON-LD Playground: https://json-ld.org/playground/