Archiving and preservation for research environments

LABDRIVE Research Data Management and Digital Preservation platform

LIBNOVA is a European SME created in 2009 focused on developing software solutions for Digital Preservation, providing software to ensure secure long-term guaranteed access to information in a simple and efficient manner, without deep knowledge of technology, full availability and access, making use of Digital Preservation standards, such as ISO 14721 (OAIS) and ISO 16363 (Trustworthy Digital Repository). During the ARCHIVER project, Libnova has developed LABDRIVE, a digital Research Data Management and digital preservation solution, capable of managing the full research project lifecycle, including budgeting, collaboration and preservation, for research organizations that need a unified platform to understand, protect and re-use their PB-range research datasets. The LABDRIVE SaaS resulting from ARCHIVER has now become Libnova’s solution for research organizations to create a consolidated preservation repository for their scientific data, achieving abstraction of the underlying platforms and technologies, in a cost-effective manner.


LABDRIVE Research Data Management and digital preservation solution

The R&D activities during the ARCHIVER project included the re-engineering and migration to a new architecture to make it scalable to high volumes and throughputs. This has been achieved following an incremental R&D process with the ARCHIVER buyers validation, starting at 100k files, then 400k, 1 million, 20 million, 40 million finishing with 140 million and a total of 1,86PB of ingested content. The final tests of Libnova allowed a total volume of 15,87 PB ingested representing 618.162.714 files (including 739.416 large files, ingested during 31 days from a Libnova location to AWS in Frankfurt through the GÉANT network at a rate of approximately 500TB every 24 hours. The data ingested was based on the ARCHIVER Buyer’s datasets, replicated hundreds of times.

Libnova considers that this volume of ingestion in a preservation system represents a new milestone in the preservation industry, allowing the resulting product of ARCHIVER (a new line of product designated by LABDRIVE Research) to have a new redesigned architecture performing stable with a volume of data 2-3 orders of magnitude better than in the legacy architecture before the ARCHIVER project.

The resulting services of ARCHIVER from Libnova have addressed  the four layers of the R&D challenge as follows:

  • Layer 1 (Storage/basic archiving/secure backup): Capability of handling content of several PBs, with a strong foundation to continue improving in the range of hundreds of PB in the future. The resulting LIbnova services can manage large peaks in data traffic and are directly linked to the GÉANT network supporting multi-cloud data storage and an escrow to keep multiple automated replicas of the data.

  • Layer 2 (Preservation): Libnova is using standard data packages (BagIt) with full support for data ingestion and retrieval. Services offer a full set of APIs, not a single byte of information about every preserved object is not accessible using the API, where types of metadata (of all types) can be extracted. It supports a flexible metadata management mechanism by considering cases in which metadata does not exist along with cases in which complex metadata needs to be preserved.

  • Layer 3 (User services): Support of complex data type search modes: dates, times, locations, numbers, etc. can all be used for searching. Fulfilling use cases like, “get all datasets belonging to X experiment” with granular search (fields, specific dates, etc.) are supported. Full support for Federated AuthC/Z.

  • Layer 4    (Reproducibility Services): implementation of a generic emulation engine supporting for example native Jupyter Notebooks and the creation of a Python SDK. Reana workflows have been integrated as an example of support of a fully integrated reproducibility environment as well as SnakeMake workloads.   

The approach of Libnova consisted in forming a consortium of partners covering different aspects, converging on a final resulting service that can be made available to a broader audience. 

The main benefits can be resumed as follows:

  • Demystification of the concepts surrounding data preservation, by providing a service that is easy to understand and use.
  • Storage costs have been optimized where resources are only consumed as needed (e. g. scale from tenths of  groups of applications to several thousands and back again in approx. 30 minutes)               
  • Security auditing support for system events user login, permission and configuration changes.               
  • Promoting best practices for data models, data workflows, etc. following FAIR/TRUST
  • Cost calculators for cloud, hybrid or on-premise deployment and carbon footprint metrics management.

 

Similarly to Arkivum, detailed models and commercialization plans as well as self-assessment as TDR service providers have been developed by Libnova during the ARCHIVER project and can be used in the context of the European Open Science Cloud (EOSC).