Realistic Data

Bridging Research and Reality

There is a gap between the data used to develop fundamental research and the data actively used in operations applications.

Many governments and research laboratories use various models for what are called technology readiness levels (TRL) ranging from fundamental research (TRL 1-3) to actively deployed operations (TRL 7-9).

TRL	Definition
1	Basic principles observed and reported.
2	Technology concept and/or application formulated
3	Analytical and experimental critical function and/or characteristic proof of concept
4	Component and/or breadboard validation in a laboratory environment.
5	Component and/or breadboard validation in a relevant environment.
6	System/subsystem model or prototype demonstration in a relevant environment
7	System prototype demonstration in an operational environment
8	Actual system completed and qualified through test and demonstration.
9	Actual system proven through successful mission operations.

Technology Readiness Levels as defined by the U.S. Department of Defense

A common issue with research at low TRL and research at high TRL is that there are frequently key differences between the data used to develop fundamental research and the data used in actual applications.

Algorithms developed on data that doesn’t mimic the messiness and scope of real applications data typically can’t even be applied to those applications–creating a gap between “state-of-the-art” and “state-of-use.”

How do we build a clear path from the data we release to the application that inspired it?

Data size matters

Researchers put an emphasis on “big data,” but there is a case to be made for small data.

Relative Largeness: In scientific domains, “large” may have a different meaning. In materials science, 1000 data points is very large, whereas that may seem small others. Just because your dataset may seem small compared to another scientific domain does not mean it’s not valuable!
High risk/Rare events: Some events just don’t happen that often. A dataset of any rare or high-risk event is useful.
Sensitive/classified applications: In sensitive or classified applications, there are inherently fewer data points. As appropriate, these should still be published so others studying the same or related application can benefit.

Publish raw data

No dataset is perfect. Frequently they are noisy, messy, or having elements missing. That’s okay – the whole idea of FAIR(ER) data is that, realistically, data isn’t perfect. Publishing the raw data (plus anything done to make it cleaner) is essential to FAIRification.

Connect pre-processing to operations

How did you get your data to a point at which it could be processed or analyzed? Did you write code or apply an algorithm? Publish this connection. Whenever you can, share the code or method used to clean up, modify, or generate data. This enhances the R in FAIR – Reusability.

Clearly define where synthetic/surrogate data does and does not mimic reality

Synthetic anomalies rarely perfectly mimic the real thing. By nature, in fact, most real anomalies are odd. If your synthetic data does not align with real anomalies, point that out!

Resources and References

Technology Readiness Assessment Guidebook. (2023) United Sates Department of Defense.