4.3. Algorithm code structure

Note

This information is specific to Python algorithms.

Here we provide some more information on algorithm code is organized. Most of these structures are generated automatically when you create a personalized algorithm starting point. We detail them here so that you understand why the algorithm code is structured as it is, and so that you know how to modify it if necessary.

4.3.1. Defining functions

The functions that will be available to the user have to be defined in the __init__.py file at the base of your algorithm module. Other than that, you have complete freedom in which functions you implement.

Vantage6 algorithms commonly have an orchestator or aggregator part and a remote part. The orchestrator part is responsible for combining the partial results of the remote parts. The remote part is usually executed at each of the nodes included in the analysis. While this structure is common for vantage6 algorithms, it is not required.

You may also define algorithm functions to extract data from the node data sources, or to preprocess data that has been extracted. The most common order of execution is to:

  1. Create a session

  2. Extract data from the node data sources into one (or more) dataframes

  3. Preprocess the dataframes

  4. Run the analyses

By doing this, you ensure that your analyses are all run on the same data, and it saves time in extracting data from the node data sources once instead of once per analysis. More information about sessions can be found in the Sessions section.

If you do follow this structure however, we recommend the following file structure:

my_algorithm/
├── __init__.py
├── central.py
└── partial.py
└── preprocessing.py
└── extraction.py

where __init__.py contains something like the following:

from .central import my_central_function
from .partial import my_partial_function1, my_partial_function2
from .preprocessing import my_preprocessing_function
from .extraction import my_data_extraction_function

The other files obviously contain the implementation of those functions. You may create as many files and functions as you wish, but note that only functions that are imported in __init__.py will be available to the vantage6 user.

4.3.2. Implementing the algorithm functions

Let’s say you are implementing a function called my_function:

def my_function(column_name: str):
    pass

You have complete freedom as to what arguments you define in your function - column_name is just an example. These arguments have to be provided by the user when the algorithm is called. The only restriction to the arguments is that they must be JSON serializable. This is because vantage6 uses JSON to pass arguments to the algorithm. How a user can provide the arguments is explained here for the Python client. In the user interface, a user can provide the arguments by filling in a form.

In many functions you implement, you will want to use the data that is available at the node. This data can be provided to your algorithm function in the following way:

import pandas as pd
from vantage6.algorithm.decorator import dataframe
from vantage6.algorithm.decorator.action import federated

@federated
@dataframe(2)
def my_function(df1: pd.DataFrame, df2: pd.DataFrame, column_name: str):
    pass

The @dataframe(2) decorator indicates that the first two arguments of the function are dataframes that are provided by the vantage6 infrastructure. In this case, the user will have to specify two dataframes when calling the algorithm.

Another useful decorator is the @algorithm_client decorator:

import pandas as pd
from vantage6.client.algorithm_client import AlgorithmClient
from vantage6.algorithm.decorator.algorithm_client import algorithm_client
from vantage6.algorithm.decorator.action import central

@central
@algorithm_client
def my_function(client: AlgorithmClient, column_name: str):
    pass

This decorator provides the algorithm with a client that can be used to interact with the vantage6 HQ. For instance, you can use this client in the central part of an algorithm to create a subtasks for each node with client.task.create(). A full list of all commands that are available can be found in the algorithm client documentation.

Warning

The decorators @dataframe, @algorithm_client and @database_connection each have reserved keywords:

  • mock_data for the @dataframe decorator

  • mock_client for the @algorithm_client decorator

  • mock_uri and mock_type for the @database_connection decorator

These keywords should not be used as argument names in your algorithm functions. The reserved keywords are used by the MockNetwork to mock the data and the algorithm client. This is useful for testing your algorithm locally.

4.3.3. Advanced decorators

A useful decorator for computation tasks is the @metadata decorator:

from vantage6.algorithm.decorator.metadata import (metadata, RunMetaData)


@metadata
def my_function(metadata: RunMetaData, <other_arguments>):
    # The metadata contains a dataclass with the following attributes:
    # task_id, node_id, collaboration_id, organization_id, temporary_directory,
    # output_file, input_file, token, action.
    #
    # They can be easily accessed using the dot notation. For example:
    return metadata.task_id

For some data sources it’s not trivial to construct a dataframe from the data. One of these data sources is the OHDSI OMOP CDM database. For this data source, the @omop_data_extraction is available:

from rpy2.robjects import RS4
from vantage6.algorithm.decorators import omop_data_extraction
from vantage6.algorithm.decorator.ohdsi import OHDSIMetaData

@omop_data_extraction(include_metadata=True)
def my_function(connection: RS4, metadata: OHDSIMetaData,
                <other_arguments>):
    pass

This decorator provides the algorithm with a database connection that can be used to interact with the database. For instance, you can use this connection to execute functions from python-ohdsi package. The include_metadata argument indicates whether the metadata of the database should also be provided.

Note

The returned connection object (RS4) is an R object, mapped to Python using the rpy2, package. This object can be passed directly on to the functions from python-ohdsi <https://python-ohdsi.readthedocs.io/>.

4.3.4. Algorithm wrappers

The vantage6 wrappers are used to simplify the interaction between the algorithm and the node. The wrappers are responsible for translating user input to call the right algorithm method with the right arguments. They also take care of writing the results back to the data source.

As algorithm developer, you do not have to worry about the wrappers. The main point you have to make sure is that the following line is present at the end of your Dockerfile:

CMD python -c "from vantage6.algorithm.tools.wrap import wrap_algorithm; wrap_algorithm()"

The wrap_algorithm function will wrap your algorithm to ensure that the vantage6 algorithm tools are available to it. Note that the wrap_algorithm function will also read the PKG_NAME environment variable from the Dockerfile so make sure that this variable is set correctly.

4.3.5. Dockerfile structure

Once the algorithm code is written, the algorithm needs to be packaged and made available for retrieval by the nodes. The algorithm is packaged in a Docker image. A Docker image is created from a Dockerfile, which acts as a blue-print.

The Dockerfile is already present in the boilerplate code. Usually, the only line that you need to update is the PKG_NAME variable to the name of your algorithm package.