To stimulate an open dialogue with companies interested in the ARCHIVER project, all information given in answers to questions raised by potential suppliers will be documented and published in this FAQ.
This page will be regularly updated as new questions are answered.
Procurement / Legal:
What will be the time period during which contractors have to commercially exploit the results of the PCP before ownership of the results transfers to the Buyers Group ?
This time period has not been fixed yet. However, and as an indication, in HelixNebula Science Cloud, the last PCP project coordinated by CERN, the contractors were given 2 years to commercially exploit their results. The time period for ARCHIVER will be confirmed in the Framework Agreement final draft released with the Request for Tenders.
How will the actual price and virtual price be used to evaluate tenders ?
It is likely that only the actual price will be taken into consideration for the evaluation of the bids. The final evaluation criteria and formula will be indicated in the Request for Tenders.
If firms reply to the Request for Tender in a Consortium, do the firms need to be from the same country?
No. Firms replying to the Request for Tender in a Consortium can be from the same country or from different countries. There is no restriction on where firms responding to the Request for Tender have to be located. However, it is a requirement of the PCP that the majority of the R&D activities, including the main researchers working on the contracts, be located in EU Member States or Horizon 2020 Associated Countries.
Any Results we develop would be significantly based on our Background IP. Considering that, how is it possible for you to use and possibly sublicence the Results?
The Buyers Group will require a sub-licensable licence to use the Results for the purposes of the Framework Agreement and for their own non-commercial use. This would include a sub-licensable right to use any Background IP or Sideground IP owned by the contractor that is necessary for the use of the Results for the afore-mentioned non-commercial purposes.
In addition, the Buyers Group may require a sub-licensable licence to exploit the Results.
This would include a sub-licensable right to exploit any Background IP or Sideground IP owned by the contractor that is necessary for the exploitation of the Results. However, any such licence would be granted under fair and reasonable conditions which may include financial compensation for the contractor. It is also intended that a one year “embargo” period would apply before the Buyers Group may require any such licence.
Please refer to the draft Framework Agreement section 7.3 for further details. In order to distinguish between Background IP and Results, the first deliverable you will have to provide under the PCP contracts is a declaration of your Background IP.
Is it the lead Contractor or every member of the Consortium that needs to fulfil the selection criteria?
For some selection criteria, it will be required that the Consortium as a whole (i.e. just one member of the Consortium) meets the specified criteria. For others, it will be required that each member of the Consortium meets the specified criteria. This will be clearly specified in the Request for Tender documents. We do not plan to impose selection criteria that must be met specifically by the lead Contractor.
Will the bids submitted to the future Request for Tenders be published?
No. The Request for Tenders is published openly, but the submitted bids are treated confidentially. The Request for Tenders will contain details about confidentiality and the circumstances in which the Buyers Group will use and share information in bids.
Contractors keep ownership of the results developed in the PCP project, but it was previously mentioned that open source solutions are favoured. How are these two things compatible?
The contractors keep ownership of the Results developed in the PCP. However, points will be attributed in the evaluation of the bid if the proposed services favour the use of open source licenses. More information about the scoring of bids will be provided in the award criteria in the Request for Tenders.
Right now, in the draft tender documents on your website, the numbers are not provided (XXX) for the weighting of the Award Criteria. Will this information be provided?
Yes, the Buyers Group will agree on the numbers and publish them in the Request for Tenders released in October 2019.
What is the maximum number of tenderers per phase?
We anticipate working with a minimum of four tenderers in the design phase, three in the prototype phase and two in the pilot phase. We may contract with more if the budget is sufficient and a sufficient number of compliant tenders are received.
The requirements from the different deployment scenarios vary a lot. Can we bid for just one of the deployment scenarios?
No, the Tender requirements are the same for all Tenderers. All the selection criteria have to be met, as a minimum basic requirement, and the selection criteria cover the minimum level needed across all deployment scenarios. We would also emphasise that, while the precise needs across deployment scenarios differ, there are also many commonalities. We believe that future (post-PCP) market applications of your solutions, including for other customers, will need to address all of these commonalities to be commercially successful.
That said, we do expect variation in the solutions that are developed during the PCP and it is likely that different Tenderers will want to focus on different aspects of the overall R&D challenge. You will specify in your Tender the area or areas of the R&D challenge you propose to focus on. The award criteria will be used to attribute a quality score to your Tender based on this information. Scoring guidelines and weighting for each award criterion will be indicated in the Request for Tenders so that you can build your Tender accordingly.
Given that there is some variation across deployment scenarios, should our bid be based on a single service or a suite of services?
We do not intend to specify one or the other.
Will the functional specification be based on the specific deployment scenarios presented during the OMC or will it remain more vague?
The functional specification will be more detailed than the draft document currently available on the project website (as of May 2019), but it will not go to the level of the atomic use cases from the planning poker sessions. Although the deployment scenarios are used to illustrate the application of the solution that contractors will develop during the PCP, it is important to remember we are buying R&D services for a general solution that will have applications in a much bigger market than the four ARCHIVER buyers.
The section on business models in the Functional Specification, covering the cost sustainability of the solution to be developed gives the impression that the effort falls only on the providers. This might not give the full picture of the costs.
This is an important point which ARCHIVER does accept. For example, in future phases there will be a deliverable allowing us to start a dialogue about the Total Cost of Service.
Are there any restrictions on using public clouds e.g. AWS?
No, however it is important to bear in mind that at least half of the R&D services have to be located in EU Member States or H2020 Associated Countries, and that some buyers have requirements on where their data is allowed to be held (e.g. CERN data can only be stored in CERN Member States and Associate Member States).
Your draft documents propose that a single tenderer or lead tender cannot participate in another Tender (including as a consortium member). This prevents me from submitting multiple bids. Where does this rule come from and why ?
This was proposed to the project team by the European Commission given the specific context of the ARCHIVER project. It is not a PCP requirement per se. Please note that the requirement does not prevent you from submitting multiple bids. You can participate in as many Tenders as you wish as a Consortium member and/or subcontractor, provided that you do not act as a lead Tenderer or single Tenderer in any other Tender.
Do you expect to re-apply the same types of award criteria in all three phases?
Yes, however the detailed description and the weighting of the various criteria may change from phase to phase.
Are the results of the previous phase taken into account for the evaluation of responses to a call-off?
The results of the previous phase will be used to assess whether or not the contractor is eligible to bid for the next phase.
I understand that there is a binary selection based on the price during the evaluation of the tender. Is that correct?
Yes, the price you propose for the services to be provided in a PCP phase is not allowed to exceed the maximum available budget per contractor for that phase stated in the Request for Tenders. If it does, the bid will be discarded. In addition, the price of compliant bids will be part of the overall ranking of your bid. Lower priced bids will rank higher than higher priced bids, all other things being equal. The evaluation process and the formula used for ranking bids will be fully described in the Request for Tenders.
Which price is used to evaluate and score our bid? The price for the services during the PCP, or the future price at which we will sell the solution we develop during the PCP.
Both. Bids will be ranked on the basis of a “price score” and a “quality score”. The price for the services during the PCP is used to attribute the “price score” to your bid. The future price at which you will sell the services will be indirectly evaluated through our award criteria which are used to attribute the “quality score” to your bid. Specifically, the award criteria will score how cost-effectiveness is being taken into account in the architectural design of the future services, as well as the total cost of the services being developed. The cost effectiveness of the future services is a fundamental part of the R&D challenge posed by ARCHIVER and the corresponding award criteria will be very important in the evaluation of tenders.
Where is the data from the deployment scenarios ingested from? Is the ingest done after data calibration/validation? Are there more scenarios?
Typically, raw data has to be calibrated and validated before ANY scientific process can take place. The goal is not to pour bits as far as possible into a bucket. The goal is to ingest those bits together with all the necessary associated information into a long-term OAIS preservation archive.
Is the replication of the data partial? Are there multiple different scenarios?
There are multiple different cases for data replication. For the majority of the deployment scenarios, the Buyers Group has several copies held within the same system but no external disaster recovery mechanism.
In which scenario(s) are we looking into full remote archive deployments?
All of the proposed deployment scenarios would benefit from full remote archive deployments.
What are the data access patterns requirements (how diverse and how complex for the deployments/use cases presented)?
The data access patterns vary drastically from one deployment scenario to another. There are cases where data are very rarely recalled, e.g.even less than once in a year, but there are also cases in which data might need to be accessed on a daily basis.
Is there a difference between ingest under OAIS and simple ingest?
The Buyers Group have agreed to follow the OAIS reference model as the “best way” of ensuring long-term preservation of data. “Simple ingest” suggests that some or most of the OAIS guidelines would be skipped. That’s not the intention underlying the proposed project.
One of the R&D challenges should be ensuring a link between the data and the research institute that has created the data?
Yes, however there are different behaviours expected on the different deployment scenarios.
The presentation in Geneva mentioned unstructured data. This has implications in managing the personal data. Is it part of the project?
In some use cases, data will have personal data included. One of the use cases is dealing directly with this question. Handling personal data according to European legislation (GDPR) is a requirement in the project.
The open data model is led by the USA since a long time. Is there a model that will be followed in the project?
The EC is pushing for the open data movement, specifically for data produced by the public sector.
There are no archivist organisations in the ARCHIVER consortium. Why?
There is at least one archivist organisation interested in being an Early Adopter of the ARCHIVER resulting services.
The companies are doing a risk evaluation in the planning poker based on someone’s own experience or based on what is available on the market?
As the R&D challenges are complex and no single company can meet them all currently, we need to take into account not only a single experience but also the wider knowledge of the market and the current state-of-the-art.
In the tender documents, it seems that the archive is not running on the Buyers infrastructure. So how is the following atomic use case related “As an Collaboration Data Manager, I can provide a transparent service (F and A from FAIR) to the user by deploying a federated storage environment between multiple research centre archives and commercial archives and by providing catalogues that contain data I own and data that is managed/produced and stored so that Data stored in different locations can be searched and downloaded via the archive in a seamless way, irrespectively of where it is maintained/produced”?
There are different requirements foreseen in each of the deployment scenarios.
We can consider three types of relevant data for this project: structured scientific data, communications data and other supporting data. Is it just scientific data?
All types of data are in scope. More information about the data types is available in the deployment scenario slides from the OMC event in Stansted: https://www.archiver-project.eu/open-market-consultation-event-london-stansted-airport
Are banking organisations going to be early adopters?
ARCHIVER is not talking to banking sector. Early adopters will be public organisations in the research domain.
ARCHIVER wants to achieve Long Term Data Preservation or only Data Preservation? It is important as it has an implication on the format.
Some use cases are based on long time data preservation e.g. BaBar. Some other, e.g. EMBL, are more focused on storage. Some use cases use custom research data formats (such as ROOT), while other use cases use wide-spread formats (such as JPEG, TIFF, PDF). In the latter case, a proper full-scale long-term data preservation handling including format conversions is more necessary than in the former. The award criteria will reflect the respective weights of the different elements of the R&D challenge.
N.B. format conversion of (HEP) scientific data ALSO requires changes in the s/w. We (CERN) have done it at the scale of 1 per mil of the current LHC data. At the time, it took considerable resources to perform, one year to plan and test and one year to execute. Whilst these numbers do not scale with data volume (thanks to advancements in technology), this is NOT a trivial operation!
Please provide more information about the security aspects. For example, VPN used to access browsers, what is meant by encryption, confidentiality, integrity etc.
The buyers group is considering to add more information regarding security aspects in the PCP Contract Notice criteria.
Did you consider splitting this project into multiple parts: one for each of the presented layers?
The solutions the Buyers Group require (and broadly in the research community) have to cover all layers. The structure of R&D layers was made deliberately in order to stage R&D and to keep it realistic during the project lifetime. Layers from 1-3 are considered the minimum R&D objectives. Layer 4 includes the added value, advanced services. Solutions that achieve them will be considered.
More information about the services that CERN currently deploys for the preservation of scientific data can be found at https://indico.cern.ch/event/448571/.
Can you provide any more information about the federated entities/server proxies e.g. user preferences?
There is a current effort to improve the approach for AAI compared to previous projects (e. g. HNSciCloud). There are ongoing discussions across the buyers and GÉANT in order to agree a common approach regarding federated entities with clear specifications definition for contractors.
Cost is trade-off and a limitation. Is it foreseen that the Buyers Group will provide the contractor with their internal archiving costs ?
A budget/design cost target is subject to many factors. Further internal discussions are necessary. The objective is to make progress in this aspect as the project evolves and as solution costs can be compared. In parallel, a deliverable will be asked to contractors reporting on cost implications from architecture (Design Phase) to pre-production service delivery (Pilot).
However, it’s a common “mistake” to focus predominantly on storage costs when these are only one of the factors contributing to the total cost of long-term data preservation. As foreseen in OAIS, long-term implies a change of practices, not only technology but also services. For example media migration can be performed “as a service” with little or no impact on data producer / consumer, however change in practices for what concerns the media service would often require (major) migrations.
As far as cost of preservation versus the quality of preservation is concerned, the cost cannot be lowered beyond the cost of one copy of the data. However, it is possible to lower the quality of the data with low value. Do you expect such functionalities to be provided by the preservation services provider?
We expect to see various solutions, each with varying costs and risks. The Buyers Group accepts to analyse possible trade-offs in carrying certain risks in order to lower costs as well as the opposite: higher cost for certain “valuable” collections.
Users will access the service through a web browser. Do you consider any other possibilities? Also, will accessibility be possible outside of a network?
The Buyers Group wants to be able to access the services through file transfer protocols, not only web interfaces as all of the deployment scenarios would benefit from these functionalities. Security of access through one network only is not a specific requirement, but it would be a benefit in case it’s present. In addition, given the volume of data involved, some bulk upload/download capabilities are required.
According to FAIR principles, metadata needs to be maintained after deletions of the underlying data. What is the amount of metadata vs underlying data? For how long do you want to hold it? Is it immutable?
Yes it is immutable (except for a few deployments). In term of duration, we can assume ‘forever’ (some decades or more, in any case much much longer than the duration of the ARCHIVER project). In terms of amounts, it is not easy to generalise across all deployment scenarios since there are differences This said, it’s anticipated that the metadata is significantly less than the underlying data (less than, or even much less than 1 per mil would be an appropriate yardstick).
FAIR does not give any concrete examples of when this might be useful. Some concrete examples can be provided regarding e.g. the LEP and / or pre-LEP experiments at CERN. For the LEP experiments (1989 - 2000), the total volume of “data” aka the bits (not including documentation, s/w, web pages, newsgroups, etc.) was around 4 x 100TB. Assuming a canonical file size of 200MB (limited by the 3480 cartridge technology used at LEP startup, the file-level metadata was around 1KB. Eventually (perhaps in 2-3 decades when a potential new collider working at the same energies but with much more precision and possibly 10,000 times the statistics), not only the data but also the file-level metadata will become redundant and a higher level and much reduced summary could be retained. Other levels of meta-data also exist, e.g. “run level” (groups of files from an archive point of view).
Are you planning to provide information about each Buyers Group access pattern, the type of data, metadata in scope, etc... ?
This information is summarised in the technical summaries of each of the buyers’ requirements that are available here: https://www.archiver-project.eu/deployment-scenarios-technical-summaries
Q&A related to the Deployment Scenarios for Astronomy (PIC):
What does scrubbing mean?
Data scrubbing is an error correction technique that uses a background task to periodically inspect storage for errors, then correct detected errors using redundant data. Data scrubbing reduces the likelihood that single correctable errors will accumulate, leading to reduced risks of uncorrectable errors. (Adapted from Wikipedia.)
What is the role of LDAP in the context of PIC use cases?
In some deployment scenarios, the user authentication and user authorization for scientists working on a given project is centralized in a single existing ldap server operated by the buyer. In such cases, this server will be made available through the network, using industry-standard secure methodologies, for binding as an Auth/AuthZ provider to the supplier’s servers which provide data access. This binding may be direct, through a supplier provided proxy, or through a supplier provided credential translation service. The end result in all cases should be that users identify themselves through the existing, familiar mechanism in order to gain access to data for which they are authorized and which is stored in the supplier’s service.
Un PIC use cases, ACLs are enforced at the file level, folder level or collection level?
For PIC’s use cases, it is sufficient to enforce Access Control at the folder level, through ACL or any other mechanism with similar functionality. A folder in this context is defined as a convenient way to refer to or interact with a group of files. For PIC’s use cases, collections are defined through metadata queries whose results are lists of files. The user will then attempt to access the files in the list, succeeding if allowed by the permission of the folder where the file is stored. An alternative, acceptable implementation would be to have file-level access control specification, but in this case the supplier would have to provide tools to easily set and modify access control specifications on lists of files resulting from metadata queries.
Q&A related to the Deployment Scenarios for High Energy Physics (CERN):
The CERN Open Data Portal deployment scenario is referring to the XRootD protocol. Can you provide more information?
XRootD software framework is a fully generic suite for fast, low latency and scalable data access, which can serve natively any kind of data, organized as a hierarchical file-system like namespace, based on the concept of directory. More information on XRootD can be found in the relevant webpage: http://xrootd.org/. Please note that the XRootD protocol in the CERN Open Data deployment scenario is only necessary for the "live reuse" use case, and only for the data recall direction. The data upload direction can use any standard protocol such as HTTP. Moreover, in the basic "cloud archiving" use case, only the Service Managers will access the data on the Archive, and the support for XRootD protocol is not necessary.
For BaBar, what’s the restart capability requirement?
It should be possible to restart the ingest process at a reasonable check-point. Maybe this would be implemented by restarting the ingest of the current file, directory or other reasonable point.
Re BaBar: what are the requirements for the access of the data?
The current request is simply for an archive copy of the data to be resilient to the SLAC Directorate’s statement that the data can no longer be hosted at SLAC. Other copies of the data are likely to exist in the short-term and for short-term (re-)analysis it is these copies that are likely to be targeted (hosted at institutes that are members of the BaBar Collaboration and who therefore have the necessary ancillary infrastructure).
Re BaBar: How can you compare data?
Physicists will compare their current work with previous analysis. This includes statistical comparisons with data from other experiments e.g. The analysis from BaBar may be compared with analysis from Belle II. (Some of the BaBar data is unique - further details in the original request for Tina Cartaro). It is important to note that the “comparisons” do not expect bit-level agreement and are often eye-ball comparisons of histograms or other plots. This is the same technique as used to validate new s/w releases within on-going experiments.
Re BaBar: Is the file format such that 1 bit of corruption invalidates the whole file or just a subset?
The question is unclear, not sure if it relates to the failures of the fixity checks. HEP (High Energy Physics) has traditionally used file formats designed for unreliable tapes (e.g. Hydra, Zebra) that were resilient to errors. In case of problems, typically the current “event” would be skipped. As HEP moved away from tape towards disk as the primary support for production and analysis, the I/O software has (probably) lost some of this capability. This highlights the tensions between performance for on-going production and analysis and the lower performance needs of long-term re-usability.
Re BaBar: Please give more details about security for BaBar.
The security is minimal as it is physics data that does not include confidential or personal data.
CERN digital memory: Is it live data or historical data?
Personal information in CERN digital memory. Is there a need for GDPR content review as the data is moved into Archive?
CERN is not subject to GDPR but ensures the adoption of best practices for the processing operations of Personal Data governed by CERN Operational Circular 11 (OC11). This will need to be ensured by your solution.
These projects (BaBar and CERN Digital Memory) seem to not include any R&D.
We believe that proven production functionality at the PB range if complex data types, integrated in the EOSC context ensuring the full preservation life cycle etc do not currently exist. If they do, please provide some references. ARCHIVER is an opportunity to demonstrate that functionality.
Is CERNVM-FS a requirement? Is it accepted if another software can provide the same functionality?
For the CERN Open data "cloud archiving" use case, the Archive does not need to support CernVM-FS, since the data will be accessed only by the Service Managers. For the CERN Open Data "live reuse" use case, the Archive will have to support CernVM-FS in order to serve the virtual machines and the software necessary for running example open data analysis completely decoupled from the usual CERN infrastructure.
PLEASE NOTE that CVMFS and CernVM have been offered by EGI (European Grid Infrastructure) - and hence also the EOSC context - since more than a decade. They are used to snapshot the s/w and the necessary associated environment and are widely used both across as well as outside HEP.
You mentioned in the case of CERN Digital Memory that even the service manager does not have access to the data. Who has access to the material? And why bother to keep this data? Who is setting the security levels?
In general, only a fraction of institutional data (but also personal data and physics datasets) are subject to embargoes. Each experiment has also its own embargo period to release world-wide datasets used for analysis. In CERN Official Archive, rules are precisely defined to set the date of public opening for the historical content that is selected for preservation. These rules which are applied to physical material must be also applied to digital resources. The system administrators shall not be able to read, copy, or disseminate data which have not yet been cleared for public opening (in some cases occurring 50 years after it’s creation).
An archival system that would propose a solution for this issue would clearly be a plus.
Regarding the case of the CERN Digital Memory Deployment Scenario, does the web interface represent the level 4 of the stack?
No. CERN Digital Memory preservation will act as a "dark" archive, only searchable/retrievable by the live information systems administrators.
What is the expectation regarding the handling the metadata and the personal information associated with it ?
The requirement is that the archive conforms with GDPR. This is not really an issue for the Digital Memory archive, but more for the systems that are managing the metadata/data.
For the CERN Open Data Deployment Scenario, there is the open data software reproducibility case on Layer 4. Could you please expand on this?
For the CERN Open Data deployment, the data being archived on the provider infrastructure contains a variety of data types including collision and simulated datasets, software environments, virtual machine images and Docker containers, as well as configuration files and analysis examples exploring the open data.
When users try to reproduce an open data analysis example, they have to download a VM image or a Docker container and start the analysis on their own computer. The data and software necessary for the given analysis is being downloaded "on demand" during runtime from the CERN Open Data portal and from the CERN cloud. This is achieved through the CERNVM-FS service. The use case is included in Layer 4 since it’s considered an advanced use case aiming to the full reproduction of an open data analysis using non-CERN computing infrastructure. The complexity of the use case can progressively grow, from (a) simple hosting of CERNVM-FS data in S3 service on the service provider infrastructure, while CERNVM-FS runtime server still runs at CERN in an hybrid model (i.e. users will still access "cvmfs.cern.ch" as their visible entry point) to (b) hosting both CERNVM-FS data and runtime server on the service provider infrastructure (i.e. users will access "cvmfs.example.com" only). It can go further to an additional level of functionality (c) allowing users to instantiate VMs or Docker containers on the service provider cloud in order to run analyses directly in the cloud (i. e. the service provider would provide compute time to the individual scientist).
Do you expect that computation (reproducibility) to rely on the host cloud or on the CERN cloud? If the answer is the host cloud, then what is the role of the CERN cloud?
For the CERN Open Data analysis reproducibility use case, the computation runs usually on the user's own hardware. The software libraries and the condition databases that are necessary for computation are served from the CERN cloud via CERNVM-FS service. The base requirement of the Level 4 open data analysis reproducibility deployment implies the service provider to serve backend object storage to CERN's CERNVM-FS service. As stated above, the next (intermediate) use case (b), requires the service provider to run CERNVM-FS runtime server as well, for users data access. The most advanced version (c) allows the service provider to offer compute time directly to end users, allowing them to run analysis examples on the service provider cloud, without having to run them locally on their computer achieving “total reproducibility” independent of the underlying infrastructure.
Q&A related to the Deployment Scenarios for Life Sciences (EMBL-EBI):
What is the distribution pattern for the Life Sciences deployment scenario?
Our users come from pretty much everywhere, you can see a live map at https://www.ebi.ac.uk/web/livemap/. Heavily concentrated in Europe, the USA and China, but also from the southern hemisphere. Essentially anyone doing research into genomics, proteomics or related fields is very likely to download data from us at some point, if not regularly.
Is Food and Drug Administration (FDA) regulation relevant for the Life Sciences Deployment Scenarios?
No, though we have strict legal requirements for protecting some of our data - e.g. certain human genome sequences that must be accessed according to well-defined protocols.
In the following atomic use case: “As a user, I can deploy my own instances (development, testing & production) of the archive for multiple communities, e.g. on top of my own infrastructure, so that I can handle the diversity of different communities & use cases and don't have to trust on monolithic instance” are we talking about a software in the local infrastructure or an archive?
This means that we are not tied to using a particular platform for running an instance of an archive. I.e. I want the archive to be delivered as a platform/framework/whatever that can be deployed on any suitable hardware, much the same as I can install kubernetes or Openstack on a variety of base systems. This is for two reasons: 1) I want to avoid vendor lock-in, and 2) I want to be able to deploy an instance that a new community can play with in isolation, so they can experience it and familiarise themselves with it without having to commit.
How access to the information of the EMBL use case is managed?
Lots of data is public. A lot is also confidential. See slide 11 of the EMBL presentation from the Stansted OMC event for more information. In general, the metadata associated to the bulk data is public. Management of the EMBL metadata is outside the scope of ARCHIVER.
Is the time for the embargo period on the EMBL data defined, or is it sometimes linked to an event (e.g. a study publication) without a fixed deadline?
Embargo periods will not necessarily be fixed in time. They could be (e.g. for 6 months from submission of the data), or they could be bound to external events (e.g. until I publish my thesis)
Looking after metadata goes hand in hand with data preservation. How can it be out of scope?
There are at least two forms of metadata, domain-specific, and system-specific. The system specific metadata will be things like the creation time of a file, its size, its checksum, path/URI and name. These are things the archive should know and manage for us. The domain-specific data is things like what data-type it is (DNA sequence, protein sequence, medical image…), and how it was obtained (lab protocols, types of instruments and procedures etc).
We do not expect ARCHIVER to manage our domain-specific metadata. We already have portals which allow searching our metadata in the ways we need, and we do not intend to move away from them anytime soon. Our portals allow users to discover data and then retrieve them by giving them a URI, and it’s there that the ARCHIVE comes into play.
Collaboration aspect: Where are the authorisation is made in a group of collaborators?
Authorisation will typically be at the level of the portal accessing the data. This will be done with standard protocols, SAML, OAUTH etc. So we have our own identity providers, and it is up to any portal that guards the data to authenticate users against those providers.
Group membership will also typically be implemented in the portal itself. An external tool that could manage groups for us would be of interest, I don’t believe we currently have anything like that.
Are the EMBL applications you already have Linux-based or something else?
Authentication services are classic web-based API. The applications behind them are all Linux, or at least overwhelmingly so.
What are the access patterns to EMBL data like?
We don’t have clear information about this, however most of our data is still used. A PB of data downloaded per month with 20 billion requests per month. Because that comes from all over the world we can expect the patterns to be fairly flat, definitely not strongly peaked.
Is EMBL is performing a cleaning of the data ?
A process of curation exists (POSIX) that then is moved to containers. Therefore the cleaning process will not be part of the ARCHIVER project. Once archived, EMBL may provide a new version of the files at some later date, but this will be a new file, with new file-related metadata, and domain-specific metadata specifying its relationship to the previous version of the file. An example of such a file would be the human reference genome. Every so often a new, updated version is released. The old one is deprecated, but not deleted. Many ongoing analyses will still need it, and future analyses may need to come back to it to understand any discrepancies between studies over time.
Does EMBL currently keep data in cloud or on premises?
If we propose something for EMBL to do the curation process as part of our solution?
We would not rule this out. We would be concerned about the financial impact of this. What EMBL is most interested in is you developing a storage and archive platform which is more financially attractive than our current in-house service.
Does EMBL encrypt data to manage access to it or for another reason?
Encryption ensures that the data can be stored elsewhere in a safe way. Encryption key is managed by a party outside of the organisation.
What is the expected functionalities and performance of the storage model?
We are not averse to the idea of tiering but it is not clear to us that this could help because we have a very long tail of access. We are expecting to use mostly warm storage.
What are your expectations beyond the infrastructure level (scale, ingest rate…)? Are you expecting the data to follow the FAIR principles?
FAIR will be implemented in metadata portals to allow data discovery. In addition, FAIR has to be embedded in the applications which for EMBL are outside the scope of ARCHIVER (in the context of the EMBL deployment scenarios).
Do you expect the accounting for individual users?
User types will be grouped, Accounting for different group types will be made differently.
Q&A related to the Deployment Scenarios for Photon-Neutron Sciences (DESY):
In the deployment scenarios from DESY, it is required that a user can manage / add new versions (state) of derived/added data to existing archives. Does this correspond to storing delta?
Yes, but up to the file level (not the bit level) - the delta can have all changed files and new files
What about the human generated data ? And what is your intention as regards personal data ?
The data mentioned on our slides does not contain any personal data, even if generated by a human. All the data is used (potentially) for publishing at the end so it doesn’t contain personal data. Regarding personal data, we have separate solutions in place and therefore it is out of scope of the ARCHIVER project for DESY.
DESY is expecting to pay the services up-front.. Why do you then need a financial dashboard ?
We want to ensure that the option for a pay-as-you-go model exists. It may be needed for some applications or experiments in the future. In addition the financial dashboard is required to monitor the consumed (possibly by different sub-tenants) resources and to verify that this is within the expectations.
In the DESY use case, the layer 1 (i.e. storage) is on-premises. Is this the only use case with such requirement ? Why is this ?
This is a misunderstanding - the layer 1 must not be on-premises. Ideally, solution developed should just be able to attach/use a storage layer wherever it is located (on-prem, in a friendly institute, in a cloud). And should also be able to attach several different storage layers (thus making it hybrid).
Therefore we are asking for all three types of storage layer configurations ‘on-premises only’, ‘off-premises only’ and ‘on- and off-premises’ - the hybrid one.
Q&A related GEANT Connectivity
What Quality of Service (QoS) guarantees does GÉANT have?
As GÉANT is not a commercial network operator, it does not provide guarantees or SLAs but Service Level Targets. GÉANT Service Level Targets can be found here. Information about the QoS can be found in the Monthly Service reports. However the QoS is also dependent on the specific path or network segment. For more information on the process to connect to GÉANT can be found on section 2.2.1 of the Draft Functional Specification.
We can deploy already onto AWS and Azure. There is connectivity from those providers to GÉANT, and an agreement allowing users of GÉANT to purchase from these providers at reduced costs. How does this fit together with ARCHIVER?
GEANT has put in place the IaaS Framework where several cloud providers have been selected. Any country and NREN that is part of the framework could purchase via this Framework. ARCHIVER will not buy resources from the GEANT IaaS framework for this project.
Are they cloud providers that are peering with the GEANT network but not part of the IaaS framework?
Yes, such as AWS (only resellers are present in the GEANT IaaS framework), Exoscale (not present in the GÉANT IaaS framework) and T-Systems (present in the framework but for Germany only).
In HNSciCloud, there were some problems with connecting to the GEANT network on the buyers’ side. The interconnect link was not able to generate IP traffic above 5G. Will there be a dedicated circuit to address this in ARCHIVER?
Not initially. Capacity is reserved for R&I institutions (from the two ends of the GEANT backbone). However, the assumption is there is no need for reserved capacity in the GEANT backbone as there is virtually always free capacity available. Reserved capacity might be needed at the end of the network path. Monitoring will be in place by installing PerfSONAR probes.
If a company is peered with an NREN (GRNET), is it sufficient to be connected to the GEANT network?
There are essentially 3 different ways of peering to GÉANT: connection via the NREN, direct peering to GÉANT or connection at Internet Exchange (IX) locations. Please refer to the presentation on network at the event in Stansted for more information: https://www.archiver-project.eu/open-market-consultation-event-london-stansted-airport
Q&A related to the European Open Science Cloud (EOSC)
What is the process for the industry to interact with the European Open Science Cloud?
The planning of the European Open Science Cloud is still under way and further details can be provided at the end of 2019. The latest information is available in the following URLs:
Five Working Groups have been created (sustainability, landscaping, infrastructure, FAIR, rules of participation). Rules of participation and sustainability WG have a direct impact on the industry. Service providers can give their feedback on the Stakeholder Forum.
As a service provider, I connected to the European Open Science Cloud but since then I didn’t receive any feedback.
A new catalogue is under developed for the European Open Science Cloud.
For further details please check the following URLs:
The ARCHIVER project will actively pursue the integration of its resulting services in the EOSC. As the rules of engagement mature, the ARCHIVER contractors will be informed of the implications.