6.4.2. Sessions

Sessions have been added to vantage6 to provide a way to prepare a dataset that can be re-used in many computation tasks. They are important to ensure that vantage6 can interact with the data in a flexible and reproducible way. Sessions are especially useful when the data is large and querying it from the nodes is slow. Also, it allows for flexible preprocessing of the data and storing the final result reliably so that it can be easily reused.

A session is started by extracting data from one or more of the node’s databases to a dataframe. Subsequently, as many preprocessing steps as necessary can be performed on the dataframe. When the data is preprocessed and ready to be used, as many computations can be done on the data as you wish. It is also possible to preprocess the data further after computation tasks have been executed.

Data that is extracted from a node database is added to the session as a dataframe. Each session contains one or more dataframes. These dataframes can be used for computation tasks. Some computation tasks may use all dataframes in the session, while others may just use one.

Sessions are related to the other entities in the following way:

!theme superhero-outline

rectangle Session
rectangle DataFrame
rectangle Column
rectangle Node
rectangle Collaboration
rectangle Study
rectangle User
rectangle Task

User "1" - "n" Session: \t
Session "n" -- "1" Collaboration
Collaboration "1" - "n" Study: \t
Study "0" - "n" Session
Session "1" - "n" DataFrame: \t
DataFrame "1" - "n" Column: \t
Column "n" - "1" Node: \t
Task "n" -- "1" Session
Task "0" - "1" DataFrame

Dataframes are the representation of the data that will eventually be used in the most important tasks: computation tasks that produce the research results. Dataframes provide the following features:

  • The data is loaded in the dataframes once and can then be used multiple times. This saves the time of having to load the data from the source every time a new task is executed. This is especially useful when the database is large and retrieving the data is slow.

  • Dataframes can be modified using preprocessing tasks. These can, for example, add or remove columns, or filter rows. The latest version of the dataframe is used in the computation tasks. The dataframe keeps track of the last task that modified it.

  • Dataframes can have different permission scopes. You can create dataframes that are for your own use, but you can also share them with other users in your organization or with the entire collaboration. Users with the permission to share sessions with the organization or collaboration may also be able see, modify and delete your own dataframes. In other words, scoping a dataframe to yourself is not a way to keep the dataframe private, but it is only shared with users with higher permissions. Dataframes with an organization or collaboration scope are shared with all users in the organization or collaboration.

  • Dataframes provide a standardized way to store data. This makes it easier to write algorithms that can be used across different collaborations.

  • Data extraction, preprocessing and computation on the data are separated processes. This makes it easier to share algorithms with other projects. It even allows for the different steps to be written in different programming languages. Finally, it is also more secure, as the compute tasks no longer have access to the source data.

  • Dataframes have standardized metadata that they can share. This allows the infrastructure to provide the researchers with information about the data, such as which columns are available, and what the data types of those columns are.

Algorithm Step Types

Every algorithm function that is being executed in a vantage6 network is one of the following actions:

  • data-extraction: function to retrieve the data from the source, and store it in a dataframe.

  • preprocessing: function to modify the dataframe.

  • compute: function to use the dataframe to answer a research question. This can be a machine learning model, a statistical analysis, or any other type of computation.

!theme superhero-outline
skinparam linetype ortho
left to right direction

package "Modify Session" {
    package "Data extraction" {
        rectangle Extract as A
    }
    package "Pre-processing" {
        rectangle "Step 1" as C
        rectangle "Step n" as D
    }
}

package "Compute" {
    rectangle 1 as E
    rectangle 2 as F
    rectangle m as M
}

rectangle HQ as HQ
rectangle Researcher as user

A --> C
C --> D
C --> E
D --> F
D --> M
E --> HQ
F --> HQ
M --> HQ
HQ --> user

Fig. 6.1 An illustration of how the actions are related to each other. In this example, there are n preprocessing steps and m compute steps. First, the data is extracted. Then, the data is pre-processed. Finally, the data is used to compute the research results. Note that in this schema, the first compute task is done after the first preprocessing step - not after n steps. At any point in the preprocessing steps, it is possible to send a task to the current dataframe. It is thus also possible to execute a compute task directly after data extraction. Finally, the results of each compute task are sent to vantage6 HQ, where the researcher can access them.

These actions are managed by the infrastructure. For example, the infrastructure ensures that data extraction functions are the only functions that are allowed to access the source data.

Note

The user interface does not require you to know how these actions are triggered, but the API endpoints used are as follows: compute tasks can be triggered using the /task endpoint, and data extraction and preprocessing actions are triggered with the /session endpoints. In the Python client, the three actions are represented by client.task.create(), client.dataframe.create() and client.dataframe.preprocess(), respectively.

Dependent tasks

As described above, there are tasks that modify the dataframe (data extraction and preprocessing) and tasks that compute on the dataframe (compute). In order to ensure that the dataframe is not modified while another task is using it to compute analysis results, the infrastructure ensures that such tasks are executed in the proper order. This is done by making the tasks dependent on each other.

There are three senarions:

  • A data-extraction task is not dependent on any other task.

  • A preprocessing task is always dependent on the previous preprocessing or, in case there is none, the data-extraction task. But it is also dependent on all compute tasks that have been requested prior to the new preprocessing task.

  • A compute task is always dependent on the last preprocessing task or, in case there is none, the data-extraction task.

!theme superhero-outline
skinparam linetype polyline
left to right direction

rectangle "(1) Data Extraction" as data_extraction
rectangle "(2) Compute 1" as compute_1
rectangle "(3) Pre-processing 1" as preprocessing_1
rectangle "(4) Compute 2" as compute_2
rectangle "(5) Compute 3" as compute_3
rectangle "(6) Pre-processing 2" as preprocessing_2
rectangle "(7) Pre-processing 3" as preprocessing_3
rectangle "(8) Compute 4" as compute_4

data_extraction --> preprocessing_1
data_extraction --> compute_1
compute_1 --> preprocessing_2

preprocessing_1 --> compute_2
preprocessing_1 --> compute_3

compute_3 --> preprocessing_3

preprocessing_1 --> preprocessing_2
preprocessing_2 --> preprocessing_3
preprocessing_3 --> compute_4

Fig. 6.2 Example dependency tasks tree in a single dataframe. Note that (7) is not dependent on (4) as in this case (7) was requested after (4) was completed.

Session storage

When a new session is created, each node creates a new session folder. In this folder, the dataframes and session log are stored. This log keeps track on which action was performed on the dataframe. You can inspect the log on the node by using the command parquet-tools show state.parquet.

The session folder can also be used to share data between different tasks that are not related to sessions, for example, when you need to store a secret key that is used in a successor computation task. In the algorithms you can use the session folder with the environment variable SESSION_FOLDER.