4 Documentation and metadata

Clear and comprehensive documentation is essential to ensure that the data collected during the study is FAIR (Findable, Accessible, Interoperable, and Reusable) not only today but also in the future. This section outlines the approach to documenting all datasets and associated metadata, including the tools, standards, and processes that will be applied. By maintaining detailed descriptions of variables, data sources, and collection methods, and by implementing robust metadata management practices, we want to support data integrity, making it easier to use the data, and comply with FAIR principles.

4.1 Documentation

All data stored on the GenomeDK servers will be generated during the study, either by study staff or directly by participants. No external datasets (such as national registers) will be stored within the ON LiMiT environment. The primary sources of data include:

Participant input via the Liva Healthcare app and responses to questionnaires.
Medical devices and scanners, which will produce structured data files. A detailed list of scanners and devices will be maintained as part of the technical documentation.
Additional sensors and site-level data collection, uploaded as CSV files by study staff.

Data generation will occur throughout the two-year participation period for each individual. While there are plans to link study data with register data, this work will take place on Statistics Denmark’s servers. Access to the data will be strictly controlled through GenomeDK. Researchers must submit an application to the Publications Panel to gain access, ensuring transparency and governance.

4.2 Metadata strategy

To ensure data usability and interoperability, comprehensive metadata will be created and maintained throughout the study:

Field-level documentation: We will write detailed variable descriptions using the Field Annotation functionality in REDCap. This includes adherence to naming conventions defined by internal rules.
Supplier data dictionaries: All external data providers (e.g., Liva Healthcare, MyFood24) will supply data dictionaries in CSV format. These will be integrated into the metadata repository by the data architect.
Device and sensor data: The site staff and the data architect will collaboratively document the CSV filesoriginating from scanners and sensors to ensure completenessand clarity.
Metadata packaging: We will use software developed from the Seedcase Project to generate and store metadata in a standardized data package format. These software will also structure the data in a standardized format and will run checks on the data to confirm that it corresponds to the metadata. The data package will be stored on GenomeDK, where the data will be version controlled and a backup kept in case of errors or issues. The metadata will be connected to GitHub to increase accessibility and discoverability as well as improve future collaboration on analyses of the data.

4.3 Versioning and updates

Metadata will be treated as a living resource. Updates will occur whenever new variables are introduced, existing fields are modified, or additional data sources are integrated. Version control will be implemented through the Seedcase software, ensuring that:

Each metadata package is assigned a unique version identifier.
Changes are logged with timestamps and descriptions.
Previous versions remain accessible for audit and reproducibility purposes.

This approach guarantees traceability and supports long-term data integrity.

4.4 Data discovery and accessibility

To facilitate data discovery, a dedicated website will be developed. This site will provide:

A searchable catalog of all variables.
Clear descriptions and metadata for each variable.
A foundation for the Publications Panel application process, ensuring researchers can identify relevant data before requesting access.

This approach guarantees that data is well-documented, discoverable, and compliant with FAIR principles.