Skip to content Skip to footer

Technical Overview

How does FEGA work? Here you can find a technical overview of its many components.

Start by contacting your institution’s Data Protection Officer (DPO). Ensure you have ethical committee approval documents ready for submission. Completing the Data Processing Agreement (DPA) with the FEGA Portugal node is required.

Register

In order for a researcher to use the LEGA, it is needed to first get an EGA account by issuing a request to an appropriate Helpdesk (e.g. the Portuguese Helpdesk). Then, the researcher needs to register at the EGA website (managed by the CEGA) an RSA or ED25519 public key, which will be used to perform the upload and/or download of data in the LEGA node(s) the researcher is registered at.

Submitting Data

In order to make a submission to the FEGA, a user starts the ingestion process by making two separate submissions: a data file submission on a chosen LEGA node, and a metadata submission on a country submission portal (in this case, it’s the Portugal submission portal, hosted in CEGA.

Prerequisites

You must be able to use a shell terminal. The instructions are directioned towards Linux environments, but other environments can do the same using similar tools.

Install sda-cli

  • Download sda-cli, the tool used to upload data into FEGA. There are different releases available here. If you are using a terminal, you can run the following command to download the Linux version:
    wget https://github.com/NBISweden/sda-cli/releases/download/v0.3.0/sda-cli_.0.3.0_Linux_x86_64.tar.gz
    
  • Extract the tool (substitute the filename, if needed):
    tar -xzvf sda-cli_.0.3.0_Linux_x86_64.tar.gz
    
  • When doing submissions, you will need to call the sda-cli file from the terminal (e.g., using ./sda-cli).
  • (Optional) Add sda-cli to system-wide binaries (you need root access!):
    sudo cp sda-cli /usr/bin
    

    Getting the server’s public key

  • Download the Portuguese LEGA’s public key from https://inbox.ega.biodata.pt/c4gh.pub.pem . You can download it using a web browser or using wget https://inbox.ega.biodata.pt/c4gh.pub.pem . The downloaded file should look like this:
    -----BEGIN CRYPT4GH PUBLIC KEY-----
    B2gV8b0FoVLDz0x156JBpLXdB069w4UTtWYeQf9Yzz4=
    -----END CRYPT4GH PUBLIC KEY-----
    

    Configuration file

  • Create a file named s3cmd.conf using the following template and replace the values of access_key and secret_key with your EGA username:
    [default]
    encoding = UTF-8
    guess_mime_type = True
    use_https = True
    host_base = https://inbox.ega.biodata.pt
    host_bucket = https://inbox.ega.biodata.pt
    human_readable_sizes = True
    multipart_chunk_size_mb = 50
    socket_timeout = 30
    access_key = <EGA USERNAME>
    secret_key = <EGA USERNAME>
    access_token = <ACCESS TOKEN>
    
  • You can leave the access token as is for now, you will get it once you login. The access token is valid for 7 days, so you will need to replace it every now and then.

Uploading a file

Configuring credentials

  • Go to https://login.ega.biodata.pt and login using EGA. If you don’t have EGA credentials, visit this URL.
  • Copy the access token (the long text starting with “eyJ”) and put it in the s3cmd.conf file. Note: Try using triple-click on the text to select everything.
  • You may also click on “Download credentials to upload to the inbox” to get a ready-made configuration file when logging in, but please delete the lines setting check_ssl_hostname and check_ssl_certificate to False.

Uploading

  • After installing sda-cli, having the server’s public key, and configuring a s3cmd.conf file, you are ready to upload a file. Use the following command and check the result:
    ./sda-cli -config s3cmd.conf upload -encrypt-with-key c4gh.pub.pem <FILE TO UPLOAD>
    
  • If an error is returned, check if the all the paths you are using are right and that the access token is still valid.
  • To check if the upload was successful, run the following command to see what files are in your inbox:
    ./sda-cli -config s3cmd.conf list
    
  • If the file you uploaded appears in the list, the file was successfully uploaded!

Metadata Submission

If the ingestion of the file succeeds, the user will be able to submit the metadata of the dataset. After the submission is accepted by the Helpdesk and released, the file is imported to the vault and the dataset is findable on the EGA website for accessors to request it.

Before the ingestion of the data files is complete, the files pass through a staging phase, in which pipelines may be run on the files to perform checks, make changes, and scrape relevant information from them.

In the current deployment, there are no pipelines active, but it is still a part of the system.

Encryption & Backup

After the staging has been completed successfully, the file is re-encrypted by re-encrypting the Crypt4GH header. Subsequently, the header and the payload are split, the header (now encrypted with the LEGA public key) is saved into the archive database while the payload is copied to the vault and to a backup.

The backups of the files are saved on a S3 bucket by mounting a S3 filesystem into the backup directory, this replicates the contents from the directory of the host into the S3 server.

The same process is applied to the archive database, which contains the headers containing the symmetric key to decrypt the payload. The backups of these headers should also be stored separately from the backups of the payload to guarantee that an attacker needs to compromise both backups to effectively access the data.

If the data files stored in EGA were of small size, it could be possible to copy a file from the vault to a user’s outbox and change its header so the accessor could decrypt it. Since that is not the case, and the header is very small when compared to the data itself, the LEGA is developed to mount a Filesystem in Userspace (FUSE) mount point on each assessor’s outbox that solves this problem. Using it, for each file a particular accessor has access to, the FUSE presents what seems to be a modified copy of the vault file, but in fact, it computes the bytes to return in each call. For byte addresses that match the header part, it re-encrypts the vault file header on the fly, using the LEGA private key to decrypt it, and then each of the user’s public keys registered at EGA to encrypt it back in a way the user can decrypt the header using one of the keys. If the byte addresses requested are out of the header region, then the server maps the addresses using the difference between the vault and outbox header sizes and returns the payload bytes as they are stored in the vault file.

Note: If the ingestion of a data file fails, the user has to remove the datasets that point to problematic runs, then remove all the problematic runs, delete the file from the inbox and submit the new file correctly.

Accessing Data

There are two methods that can be used to access data in FEGA Portugal: SSH FTP and EGA-QuickView.

SSH File Transfer Protocol

The most immediate way to download files from a LEGA node is by accessing its distribution endpoint using SFTP, downloading it, as usual, using get <filename>, and then decrypting the entire C4GH file using Crypt4GH. This, however, will need the user to download the entire file and decrypt it as a whole when the download concludes.

This method is not efficient because the genomic data files are often large and usually researchers don’t need to read the whole file and when they do, they only need to read some regions of the file at a time. Because of this, using SFTP as the only distribution tool would cause highly significant bandwidth, memory, and CPU overhead on the server and force researchers to acquire great amounts of memory to be able to access the files, even if they just needed a small portion of it.

EGA-QuickView

o circumvent the downsides of using a SFTP-only solution in this context, whose issues are explained in the previous section, EGA-QuickView was developed.

This tool takes advantage of Crypt4GH using block encryption and the fact that the LEGA distribution allows for requests for certain bytes of a file without downloading all the previous ones.

In particular, EGA-QuickView creates a mount point on the file system of the data accessor to a local FUSE that works just like the one used at the LEGA distribution but instead of reading bytes from the vault encrypted with the LEGA public key and re-encrypting them with the accessor’s one, it requests the re-encrypted header from the server, decrypts the key header using a private key provided by the user and then uses the decrypted secret key to return unencrypted file bytes

Can I Create a Node?

Yes, it is possible to create a FEGA Portugal node in your own institution. This node will store the data locally and store the insensitive metadata in central EGA for public display.

Advantages

This can be suited for institutions with very strict data policies, where the data produced inside the institution cannot leave the institution itself, even if encrypted.

Responsibilities

You have to provide and maintain the hardware infrastructure to store the data and install the software. BioData.pt can provide assistance in the installation of the software for its associates.

Hardware Requirements

A Virtual Machine with 4 CPUs, 8GB of RAM and 40GB of disk space should be enough for the installation. Besides this, it is necessary to to set up at least two volumes for file storage (one for the vault and another for backup), that depends on the volume of data your organization produces.

Costs

No cost for BioData.pt associates. Budget on demand for other institutions.