Technical Overview

How does FEGA work? Here you can find a technical overview of its many components.

Register

In order for a researcher to use the LEGA, it is needed to first get an EGA account by issuing a request to an appropriate Helpdesk (e.g. the Portuguese Helpdesk). Then, the researcher needs to register at the EGA website (managed by the CEGA) an RSA or ED25519 public key, which will be used to perform the upload and/or download of data in the LEGA node(s) the researcher is registered at.

Submitting Data

In order to make a submission to the FEGA, a user starts the ingestion process by making two separate submissions: a data file submission on a chosen LEGA node, and a metadata submission on a country submission portal (in this case, it’s the Portugal submission portal, hosted in CEGA.

Data Submission

This will require the LEGA Node’s public key file, which can be created by copying the text below and create a file with it (save it as ‘LEGA.pub’, for example).

ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIJQJY9fFb6Bft7frHDeCLlQrh102adymVXgc9hlWPYN3 Service.Key@LocalEGA

For the data file submission, a user has to first encrypt the data in C4GH format using the LEGA node’s public key and Crypt4GH. To do that, a user may use the command presented below, with the variables set accordingly:

crypt4gh encrypt --recipient_pk $lega_pubkey_file < $filename > $filename.c4gh

After having the file encrypted, the user may now connect to the inbox SSH File Transfer Protocol (SFTP) port (in this work’s case, 2222) using the command shown below:

sftp -o User=$username -P 2222 -i $privkey_file $lega_ip

The parameters need to be set to match, respectively, the EGA username, the user’s private key file whose public key was previously registered at the CEGA, and the LEGA node IP address. Alternatively, the user might choose to use another tool to use SFTP, such as Filezilla.

After being connected, the user should have access to a personal inbox directory where all uploaded files are stored in LEGA while the metadata submission is not finished. To upload the data file, the user may issue the command put <filename> on the SFTP shell. Incorrect or outdated submissions may also be removed from the inbox using the command rm <filename>, this is not valid for submissions that are already located at the staging or vault, though.

When the file has finished uploading, a routine running on LEGA will try to ingest the file. To do that, it will try to decrypt the file using crypt4gh and the LEGA private key. If for some reason the file is not successfully decrypted (e.g. not a C4GH file, wrong public key used) or a staging pipeline fails to execute, the ingestion process is aborted, the file in staging is deleted and the file will appear in the submission portal as not ingested.

Metadata Submission

If the ingestion of the file succeeds, the user will be able to submit the metadata of the dataset. After the submission is accepted by the Helpdesk and released, the file is imported to the vault and the dataset is findable on the EGA website for accessors to request it.

Before the ingestion of the data files is complete, the files pass through a staging phase, in which pipelines may be run on the files to perform checks, make changes, and scrape relevant information from them.

In the current deployment, there are no pipelines active, but it is still a part of the system.

Encryption & Backup

After the staging has been completed successfully, the file is re-encrypted by re-encrypting the Crypt4GH header. Subsequently, the header and the payload are split, the header (now encrypted with the LEGA public key) is saved into the archive database while the payload is copied to the vault and to a backup.

The backups of the files are saved on a S3 bucket by mounting a S3 filesystem into the backup directory, this replicates the contents from the directory of the host into the S3 server.

The same process is applied to the archive database, which contains the headers containing the symmetric key to decrypt the payload. The backups of these headers should also be stored separately from the backups of the payload to guarantee that an attacker needs to compromise both backups to effectively access the data.

If the data files stored in EGA were of small size, it could be possible to copy a file from the vault to a user’s outbox and change its header so the accessor could decrypt it. Since that is not the case, and the header is very small when compared to the data itself, the LEGA is developed to mount a Filesystem in Userspace (FUSE) mount point on each assessor’s outbox that solves this problem. Using it, for each file a particular accessor has access to, the FUSE presents what seems to be a modified copy of the vault file, but in fact, it computes the bytes to return in each call. For byte addresses that match the header part, it re-encrypts the vault file header on the fly, using the LEGA private key to decrypt it, and then each of the user’s public keys registered at EGA to encrypt it back in a way the user can decrypt the header using one of the keys. If the byte addresses requested are out of the header region, then the server maps the addresses using the difference between the vault and outbox header sizes and returns the payload bytes as they are stored in the vault file.

Note: If the ingestion of a data file fails, the user has to remove the datasets that point to problematic runs, then remove all the problematic runs, delete the file from the inbox and submit the new file correctly.

Accessing Data

There are two methods that can be used to access data in FEGA Portugal: SSH FTP and EGA-QuickView.

SSH File Transfer Protocol

The most immediate way to download files from a LEGA node is by accessing its distribution endpoint using SFTP, downloading it, as usual, using get <filename>, and then decrypting the entire C4GH file using Crypt4GH. This, however, will need the user to download the entire file and decrypt it as a whole when the download concludes.

This method is not efficient because the genomic data files are often large and usually researchers don’t need to read the whole file and when they do, they only need to read some regions of the file at a time. Because of this, using SFTP as the only distribution tool would cause highly significant bandwidth, memory, and CPU overhead on the server and force researchers to acquire great amounts of memory to be able to access the files, even if they just needed a small portion of it.

EGA-QuickView

o circumvent the downsides of using a SFTP-only solution in this context, whose issues are explained in the previous section, EGA-QuickView was developed.

This tool takes advantage of Crypt4GH using block encryption and the fact that the LEGA distribution allows for requests for certain bytes of a file without downloading all the previous ones.

In particular, EGA-QuickView creates a mount point on the file system of the data accessor to a local FUSE that works just like the one used at the LEGA distribution but instead of reading bytes from the vault encrypted with the LEGA public key and re-encrypting them with the accessor’s one, it requests the re-encrypted header from the server, decrypts the key header using a private key provided by the user and then uses the decrypted secret key to return unencrypted file bytes

Can I Create a Node?

Yes, it is possible to create a FEGA Portugal node in your own institution. This node will store the data locally and store the insensitive metadata in central EGA for public display.

Advantages

This can be suited for institutions with very strict data policies, where the data produced inside the institution cannot leave the institution itself, even if encrypted.

Responsibilities

You have to provide and maintain the hardware infrastructure to store the data and install the software. BioData.pt can provide assistance in the installation of the software for its associates.

Hardware Requirements

A Virtual Machine with 4 CPUs, 8GB of RAM and 40GB of disk space should be enough for the installation. Besides this, it is necessary to to set up at least two volumes for file storage (one for the vault and another for backup), that depends on the volume of data your organization produces.

Costs

No cost for BioData.pt associates. Budget on demand for other institutions.