4.3. Algorithm code structure¶
Note
This information is specific to Python algorithms.
Here we provide some more information on algorithm code is organized. Most of these structures are generated automatically when you create a personalized algorithm starting point. We detail them here so that you understand why the algorithm code is structured as it is, and so that you know how to modify it if necessary.
4.3.1. Defining functions¶
The functions that will be available to the user have to be defined in the
__init__.py file at the base of your algorithm module. Other than that,
you have complete freedom in which functions you implement.
Vantage6 algorithms commonly have an orchestator or aggregator part and a remote part. The orchestrator part is responsible for combining the partial results of the remote parts. The remote part is usually executed at each of the nodes included in the analysis. While this structure is common for vantage6 algorithms, it is not required.
You may also define algorithm functions to extract data from the node data sources, or to preprocess data that has been extracted. The most common order of execution is to:
Create a session
Extract data from the node data sources into one (or more) dataframes
Preprocess the dataframes
Run the analyses
By doing this, you ensure that your analyses are all run on the same data, and it saves time in extracting data from the node data sources once instead of once per analysis. More information about sessions can be found in the Sessions section.
If you do follow this structure however, we recommend the following file structure:
my_algorithm/
├── __init__.py
├── central.py
└── partial.py
└── preprocessing.py
└── extraction.py
where __init__.py contains something like the following:
from .central import my_central_function
from .partial import my_partial_function1, my_partial_function2
from .preprocessing import my_preprocessing_function
from .extraction import my_data_extraction_function
The other files obviously contain the implementation of those functions. You may create
as many files and functions as you wish, but note that only functions that are
imported in __init__.py will be available to the vantage6 user.
4.3.2. Implementing the algorithm functions¶
Let’s say you are implementing a function called my_function:
def my_function(column_name: str):
pass
You have complete freedom as to what arguments you define in your function -
column_name is just an example. These arguments have to be provided by the user when
the algorithm is called. The only restriction to the arguments is that they must be
JSON serializable. This is because vantage6 uses JSON to pass arguments to the
algorithm. How a user can provide the arguments is explained
here for the Python client. In the user interface, a user
can provide the arguments by filling in a form.
In many functions you implement, you will want to use the data that is available at the node. This data can be provided to your algorithm function in the following way:
import pandas as pd
from vantage6.algorithm.decorator import dataframe
from vantage6.algorithm.decorator.action import federated
@federated
@dataframe(2)
def my_function(df1: pd.DataFrame, df2: pd.DataFrame, column_name: str):
pass
The @dataframe(2) decorator indicates that the first two arguments of the
function are dataframes that are provided by the vantage6 infrastructure.
In this case, the user will have to specify two dataframes when calling the
algorithm.
Another useful decorator is the @algorithm_client decorator:
import pandas as pd
from vantage6.client.algorithm_client import AlgorithmClient
from vantage6.algorithm.decorator.algorithm_client import algorithm_client
from vantage6.algorithm.decorator.action import central
@central
@algorithm_client
def my_function(client: AlgorithmClient, column_name: str):
pass
This decorator provides the algorithm with a client that can be used to interact
with the vantage6 HQ. For instance, you can use this client in
the central part of an algorithm to create a subtasks for each node with
client.task.create(). A full list of all commands that are available
can be found in the algorithm client documentation.
Warning
The decorators @dataframe, @algorithm_client and @database_connection
each have reserved keywords:
mock_datafor the@dataframedecoratormock_clientfor the@algorithm_clientdecoratormock_uriandmock_typefor the@database_connectiondecorator
These keywords should not be used as argument names in your algorithm functions. The reserved keywords are used by the MockNetwork to mock the data and the algorithm client. This is useful for testing your algorithm locally.
4.3.3. Advanced decorators¶
A useful decorator for computation tasks is the @metadata decorator:
from vantage6.algorithm.decorator.metadata import (metadata, RunMetaData)
@metadata
def my_function(metadata: RunMetaData, <other_arguments>):
# The metadata contains a dataclass with the following attributes:
# task_id, node_id, collaboration_id, organization_id, temporary_directory,
# output_file, input_file, token, action.
#
# They can be easily accessed using the dot notation. For example:
return metadata.task_id
For some data sources it’s not trivial to construct a dataframe from the data.
One of these data sources is the OHDSI OMOP CDM database. For this data source,
the @omop_data_extraction is available:
from rpy2.robjects import RS4
from vantage6.algorithm.decorators import omop_data_extraction
from vantage6.algorithm.decorator.ohdsi import OHDSIMetaData
@omop_data_extraction(include_metadata=True)
def my_function(connection: RS4, metadata: OHDSIMetaData,
<other_arguments>):
pass
This decorator provides the algorithm with a database connection that can be
used to interact with the database. For instance, you can use this connection
to execute functions from
python-ohdsi package. The
include_metadata argument indicates whether the metadata of the database
should also be provided.
Note
The returned connection object (RS4) is an R object, mapped
to Python using the rpy2, package. This
object can be passed directly on to the functions from
python-ohdsi <https://python-ohdsi.readthedocs.io/>.
4.3.4. Algorithm wrappers¶
The vantage6 wrappers are used to simplify the interaction between the algorithm and the node. The wrappers are responsible for translating user input to call the right algorithm method with the right arguments. They also take care of writing the results back to the data source.
As algorithm developer, you do not have to worry about the wrappers. The main
point you have to make sure is that the following line is present at the end of
your Dockerfile:
CMD python -c "from vantage6.algorithm.tools.wrap import wrap_algorithm; wrap_algorithm()"
The wrap_algorithm function will wrap your algorithm to ensure that the
vantage6 algorithm tools are available to it. Note that the wrap_algorithm
function will also read the PKG_NAME environment variable from the
Dockerfile so make sure that this variable is set correctly.
4.3.5. Dockerfile structure¶
Once the algorithm code is written, the algorithm needs to be packaged and made available for retrieval by the nodes. The algorithm is packaged in a Docker image. A Docker image is created from a Dockerfile, which acts as a blue-print.
The Dockerfile is already present in the boilerplate code. Usually, the only
line that you need to update is the PKG_NAME variable to the name of your
algorithm package.