6 Storage, backup, recovery, and access
As part of a DMP, consideration should be given to the physical storage and security of the data that a research project generates. This should be done to minimize the risk of accidental data loss, be it from overwriting raw data with an edited copy, or the loss of either a laptop or the breakdown of a disk. Some data will be easier to replace, data which contains measurements done in a clinic and entered directly on a laptop will be harder to recover than a data set downloaded from an online repository. It is therefore important when looking at a data management plan that we are careful with how we store data we have generated within the study, and that we do our due diligence in making sure that data we store is recoverable, should a disk fail, or a server stop working.
6.1 Storage of data and metadata
There are three main places we will store data from ON-LiMiT. The first two, where the data is entered or generated are REDCap and the LIVA application. These are not final storage areas, but they will hold data for long enough that we will need to take them into consideration. The final storage place for data will be GenomeDK, which also is the place where we are most responsible for setting up not only the storage, but also the subsequent safety and security of the data.
6.1.1 GenomeDK (GDK)
GDK provides several storage areas. We will use the project storage in their Open zone for the ON LiMiT project (see a description of the zones here). These locations are designed for collaborative data and are eligible for backup. We do not need a Data Processing Agreement with GenomeDK as our host organisations Aarhus University (AU) and Aarhus University Hospital (AUH) already have an agreement in place which will cover us.
6.1.2 REDCap
REDCap is a secure web application for research databases and e‑forms. It stores data in a PostreSQL database and related files in a controlled file area (“File Repository”). AU provides REDCap facilities for both the University and AUH employees. The File Repository is intended for storing data and study files, but it should not be the primary long‑term file storage for bulk data; we will export and archive data into our primary storage (GDK) using the API facility provided by REDCap. As with GDK we don’t need a Data Processing Agreement for the same reasons as above.
6.1.3 LIVA
Liva Healthcare states compliance with GDPR and ISO/IEC 27001/27002 and NHS Data Security & Protection standards on their website. We understand that data is being stored on AWS services that are physically based in Frankfurt, Germany. The data will be made available to us via AWS on a ftp connection on request, we hope to transfer the data directly to folders on GDK, from where we will transform and load them into storage.
6.2 Backup and recovery
When looking at backup and recovery we again have two situations. With the LIVA system we will need to trust that what is stated in the contract is also what LIVA is doing, and that backup and recovery is possible without any significant loss of data. We also need to trust REDCap’s system, as the setup is done entirely without any user engagement.
Where we will play a role is with GenomeDK, as described below it is the responsibility of the user to set up the files for backup, although it is not possible for us to check that a recovery is possible, as backup is still done by the GDK team.
6.2.1 GenomeDK
GDK runs its own disk‑based backup at a remote site. By default nothing is backed up. There is guidance given on the GDK website which details how the backup should be set up. It is the responsibility of the individual study teams to set this up correctly, but help can be requested from the GDK team (in particular to check that the folders have been set up correctly).
Backups are conducted once per week and retained for 14 days and it is possible to review recent backup runs. Though, it is recommended that larger files are kept out of the backup structure and only added in compressed format. At present we don’t believe that we will have any files that require this, but it will be something that we will keep under observation (see section on data volume).
We will back up the original raw data, the cleaned data in Parquet format, and the metadata. As we do not expect to do any analysis in the main study folder (see section on Sharing Data) we don’t expect to be backing up any analysis scripts or results here. This will be done in separate project folders which will be set up to only back up scripts and results.
6.2.2 REDCap
There isn’t any information on the website for REDCap at AU about backup and recovery, but we have received the following reply from the team maintaining the servers in answer to our questions about the procedures in place.
“AU-IT performs daily backups of the REDCap application and the data contained within it. All backups are managed under IBM’s Tivoli Storage Manager and stored in encrypted form. Ubuntu servers are configured to automatically install security updates. A monthly vulnerability scan is carried out using Tenable Security Centre. In accordance with Aarhus University’s annual information security cycle, a system and data classification is conducted at least once a year, followed by a risk assessment. In addition to the regular annual classification, a new risk assessment is carried out whenever significant changes are made to the system.”
6.2.3 LIVA
Backups, retention, recovery, and export options are governed by our contract with them. The Liva contract draft should specify backup frequency, encryption, location (EU/EEA), recovery time, and access/audit reporting. We will confirm these parameters during contract finalization and update this section once the contract is signed.
6.3 Access to the main data structures
It should be a limited number of people who have access to the full set of research data, which is currently the project manager and the Seedcase Project team, which includes the data architect, will have access to data on GDK as they will be helping with the transformation of raw data to Parquet files, and the creation of metadata. It should be made clear that no data analysis is to take place using the data without prior approval of project synopsis from the Publication Panel and from Project Management Team. There will also be staff at LIVA which will have access to their back-end data, this is covered by the contract.
6.3.1 GDK
Access is managed on a project folder basis. GDK maintains logs and states it records login and file access events to help ensure GDPR compliance. We have been informed that we will be receiving regular reports from the system once we start using the site. These will be reviewed and signed off by the project team when we receive them.
6.3.2 REDCap
REDCap provides a logging module that automatically records project‑level activity (e.g., data exports, data changes, user creation/deletion). We will review logs on a monthly basis and after any major change. This will be done by the project team on the back of an automated email sent out by REDCap encouraging project owners to check through their log files and access controls once a month.
6.3.3 LIVA
We will rely on the Liva contract agreement for access control, audit reporting cadence, and data‑sharing/export mechanisms. We expect that Liva will be able to provide us with either access to a list of users with data viewing access, or will be able to provide such a list on demand.