.. _algo-sessions: Sessions -------- Sessions have been added to vantage6 to provide a way to prepare a dataset that can be re-used in many computation tasks. They are important to ensure that vantage6 can interact with the data in a flexible and reproducible way. Sessions are especially useful when the data is large and querying it from the nodes is slow. Also, it allows for flexible preprocessing of the data and storing the final result reliably so that it can be easily reused. A session is started by extracting data from one or more of the node's databases to a dataframe. Subsequently, as many preprocessing steps as necessary can be performed on the dataframe. When the data is preprocessed and ready to be used, as many computations can be done on the data as you wish. It is also possible to preprocess the data further after computation tasks have been executed. Data that is extracted from a node database is added to the session as a dataframe. Each session contains one or more dataframes. These dataframes can be used for computation tasks. Some computation tasks may use all dataframes in the session, while others may just use one. Sessions are related to the other entities in the following way: .. uml:: !theme superhero-outline rectangle Session rectangle DataFrame rectangle Column rectangle Node rectangle Collaboration rectangle Study rectangle User rectangle Task User "1" - "n" Session: \t Session "n" -- "1" Collaboration Collaboration "1" - "n" Study: \t Study "0" - "n" Session Session "1" - "n" DataFrame: \t DataFrame "1" - "n" Column: \t Column "n" - "1" Node: \t Task "n" -- "1" Session Task "0" - "1" DataFrame Dataframes are the representation of the data that will eventually be used in the most important tasks: computation tasks that produce the research results. Dataframes provide the following features: - The data is loaded in the dataframes once and can then be used multiple times. This saves the time of having to load the data from the source every time a new task is executed. This is especially useful when the database is large and retrieving the data is slow. - Dataframes can be modified using preprocessing tasks. These can, for example, add or remove columns, or filter rows. The latest version of the dataframe is used in the computation tasks. The dataframe keeps track of the last task that modified it. - Dataframes can have different permission scopes. You can create dataframes that are for your own use, but you can also share them with other users in your organization or with the entire collaboration. Users with the permission to share sessions with the organization or collaboration may also be able see, modify and delete your own dataframes. In other words, scoping a dataframe to yourself is not a way to keep the dataframe private, but it is only shared with users with higher permissions. Dataframes with an organization or collaboration scope are shared with all users in the organization or collaboration. - Dataframes provide a standardized way to store data. This makes it easier to write algorithms that can be used across different collaborations. - Data extraction, preprocessing and computation on the data are separated processes. This makes it easier to share algorithms with other projects. It even allows for the different steps to be written in different programming languages. Finally, it is also more secure, as the compute tasks no longer have access to the source data. - Dataframes have standardized metadata that they can share. This allows the infrastructure to provide the researchers with information about the data, such as which columns are available, and what the data types of those columns are. Algorithm Step Types ^^^^^^^^^^^^^^^^^^^^ Every algorithm function that is being executed in a vantage6 network is one of the following actions: - ``data-extraction``: function to retrieve the data from the source, and store it in a dataframe. - ``preprocessing``: function to modify the dataframe. - ``compute``: function to use the dataframe to answer a research question. This can be a machine learning model, a statistical analysis, or any other type of computation. .. uml:: :caption: An illustration of how the actions are related to each other. In this example, there are ``n`` preprocessing steps and ``m`` compute steps. First, the data is extracted. Then, the data is pre-processed. Finally, the data is used to compute the research results. Note that in this schema, the first compute task is done after the first preprocessing step - not after ``n`` steps. At any point in the preprocessing steps, it is possible to send a task to the current dataframe. It is thus also possible to execute a compute task directly after data extraction. Finally, the results of each compute task are sent to vantage6 HQ, where the researcher can access them. !theme superhero-outline skinparam linetype ortho left to right direction package "Modify Session" { package "Data extraction" { rectangle Extract as A } package "Pre-processing" { rectangle "Step 1" as C rectangle "Step n" as D } } package "Compute" { rectangle 1 as E rectangle 2 as F rectangle m as M } rectangle HQ as HQ rectangle Researcher as user A --> C C --> D C --> E D --> F D --> M E --> HQ F --> HQ M --> HQ HQ --> user These actions are managed by the infrastructure. For example, the infrastructure ensures that data extraction functions are the only functions that are allowed to access the source data. .. note:: The user interface does not require you to know how these actions are triggered, but the API endpoints used are as follows: ``compute`` tasks can be triggered using the ``/task`` endpoint, and ``data extraction`` and ``preprocessing`` actions are triggered with the ``/session`` endpoints. In the Python client, the three actions are represented by ``client.task.create()``, ``client.dataframe.create()`` and ``client.dataframe.preprocess()``, respectively. Dependent tasks ^^^^^^^^^^^^^^^ As described above, there are tasks that modify the dataframe (``data extraction`` and ``preprocessing``) and tasks that compute on the dataframe (``compute``). In order to ensure that the dataframe is not modified while another task is using it to compute analysis results, the infrastructure ensures that such tasks are executed in the proper order. This is done by making the tasks dependent on each other. There are three senarions: - A ``data-extraction`` task is not dependent on any other task. - A ``preprocessing`` task is *always* dependent on the previous ``preprocessing`` or, in case there is none, the ``data-extraction`` task. But it is also dependent on all ``compute`` tasks that have been requested prior to the new ``preprocessing`` task. - A ``compute`` task is *always* dependent on the last ``preprocessing`` task or, in case there is none, the ``data-extraction`` task. .. uml:: :caption: Example dependency tasks tree in a single dataframe. Note that (7) is not dependent on (4) as in this case (7) was requested after (4) was completed. !theme superhero-outline skinparam linetype polyline left to right direction rectangle "(1) Data Extraction" as data_extraction rectangle "(2) Compute 1" as compute_1 rectangle "(3) Pre-processing 1" as preprocessing_1 rectangle "(4) Compute 2" as compute_2 rectangle "(5) Compute 3" as compute_3 rectangle "(6) Pre-processing 2" as preprocessing_2 rectangle "(7) Pre-processing 3" as preprocessing_3 rectangle "(8) Compute 4" as compute_4 data_extraction --> preprocessing_1 data_extraction --> compute_1 compute_1 --> preprocessing_2 preprocessing_1 --> compute_2 preprocessing_1 --> compute_3 compute_3 --> preprocessing_3 preprocessing_1 --> preprocessing_2 preprocessing_2 --> preprocessing_3 preprocessing_3 --> compute_4 Session storage ^^^^^^^^^^^^^^^ When a new session is created, each node creates a new session folder. In this folder, the dataframes and session log are stored. This log keeps track on which action was performed on the dataframe. You can inspect the log on the node by using the command ``parquet-tools show state.parquet``. The session folder can also be used to share data between different tasks that are not related to sessions, for example, when you need to store a secret key that is used in a successor computation task. In the algorithms you can use the session folder with the environment variable ``SESSION_FOLDER``.