5.3. Algorithm code structure#
Note
These guidelines are Python specific.
Here we provide some more information on algorithm code is organized. Most of these structures are generated automatically when you create a personalized algorithm starting point. We detail them here so that you understand why the algorithm code is structured as it is, and so that you know how to modify it if necessary.
5.3.1. Defining functions#
The functions that will be available to the user have to be defined in the
__init__.py
file at the base of your algorithm module. Other than that,
you have complete freedom in which functions you implement.
Vantage6 algorithms commonly have an orchestator or aggregator part and a remote part. The orchestrator part is responsible for combining the partial results of the remote parts. The remote part is usually executed at each of the nodes included in the analysis. While this structure is common for vantage6 algorithms, it is not required.
If you do follow this structure however, we recommend the following file structure:
my_algorithm/
├── __init__.py
├── central.py
└── partial.py
where __init__.py
contains the following:
from .central import my_central_function
from .partial import my_partial_function
and where central.py
and partial.py
obviously contain the implementation
of those functions.
5.3.2. Implementing the algorithm functions#
Let’s say you are implementing a function called my_function
:
def my_function(column_name: str):
pass
You have complete freedom as to what arguments you define in your function;
column_name
is just an example. Note that these arguments
have to be provided by the user when the algorithm is called. This is explained
here for the Python client.
Often, you will want to use the data that is available at the node. This data can be provided to your algorithm function in the following way:
import pandas as pd
from vantage6.algorithm.tools.decorators import data
@data(2)
def my_function(df1: pd.DataFrame, df2: pd.DataFrame, column_name: str):
pass
The @data(2)
decorator indicates that the first two arguments of the
function are dataframes that should be provided by the vantage6 infrastructure.
In this case, the user would have to specify two databases when calling the
algorithm. Note that depending on the type of the database used, the user may
also have to specify additional parameters such as a SQL query or the name of a
worksheet in an Excel file.
Note that it is also possible to just specify @data()
without an argument -
in that case, a single dataframe is added to the arguments.
For some data sources it’s not trivial to construct a dataframe from the data.
One of these data sources is the OHDSI OMOP CDM database. For this data source,
the @database_connection
is available:
from rpy2.robjects import RS4
from vantage6.algorithm.tools.decorators import (
database_connection, OHDSIMetaData
)
@database_connection(types=["OMOP"], include_metadata=True)
def my_function(connection: RS4, metadata: OHDSIMetaData,
<other_arguments>):
pass
This decorator provides the algorithm with a database connection that can be
used to interact with the database. For instance, you can use this connection
to execute functions from
python-ohdsi package. The
include_metadata
argument indicates whether the metadata of the database
should also be provided. It is possible to connect to multiple databases at
once, but you can also specify a single database by using the types
argument.
from rpy2.robjects import RS4
from vantage6.algorithm.tools.decorators import database_connection
@database_connection(types=["OMOP", "OMOP"], include_metadata=False)
def my_function(connection1: RS4, connection2: Connection,
<other_arguments>):
pass
Note
The @database_connection
decorator is current only available for
OMOP CDM databases. The connection object RS4
is an R object, mapped
to Python using the rpy2, package. This
object can be passed directly on to the functions from
python-ohdsi <https://python-ohdsi.readthedocs.io/>.
Another useful decorator is the @algorithm_client
decorator:
import pandas as pd
from vantage6.client.algorithm_client import AlgorithmClient
from vantage6.algorithm.tools.decorators import algorithm_client, data
@data()
@algorithm_client
def my_function(client: AlgorithmClient, df1: pd.DataFrame, column_name: str):
pass
This decorator provides the algorithm with a client that can be used to interact
with the vantage6 central server. For instance, you can use this client in
the central part of an algorithm to create a subtasks for each node with
client.task.create()
. A full list of all commands that are available
can be found in the algorithm client documentation.
Warning
The decorators @data
and @algorithm_client
each have one reserved
keyword: mock_data
for the @data
decorator and mock_client
for
the @algorithm_client
decorator. These keywords should not be used as
argument names in your algorithm functions.
The reserved keywords are used by the MockAlgorithmClient to mock the data and the algorithm client. This is useful for testing your algorithm locally.
5.3.3. Algorithm wrappers#
The vantage6 wrappers are used to simplify the interaction between the algorithm and the node. The wrappers are responsible for reading the input data from the data source and supplying it to the algorithm. They also take care of writing the results back to the data source.
As algorithm developer, you do not have to worry about the wrappers. The main
point you have to make sure is that the following line is present at the end of
your Dockerfile
:
CMD python -c "from vantage6.algorithm.tools.wrap import wrap_algorithm; wrap_algorithm()"
The wrap_algorithm
function will wrap your algorithm to ensure that the
vantage6 algorithm tools are available to it. Note that the wrap_algorithm
function will also read the PKG_NAME
environment variable from the
Dockerfile
so make sure that this variable is set correctly.
For R, the command is slightly different:
CMD Rscript -e "vtg::docker.wrapper('$PKG_NAME')"
Also, note that when using R, this only works for CSV files.
5.3.4. VPN#
Within vantage6, it is possible to communicate with algorithm instances running
on different nodes via the VPN network feature. Each of
the algorithm instances has their own IP address and port within the VPN
network. In your algorithm code, you can use the AlgorithmClient
to obtain
the IP address and port of other algorithm instances. For example:
from vantage6.client import AlgorithmClient
def my_function(client: AlgorithmClient, ...):
# Get the IP address and port of the algorithm instance with id 1
child_addresses = client.get_child_addresses()
# returns something like:
# [
# {
# 'port': 1234,
# 'ip': 11.22.33.44,
# 'label': 'some_label',
# 'organization_id': 22,
# 'task_id': 333,
# 'parent_id': 332,
# }, ...
# ]
# Do something with the IP address and port
The function get_child_addresses()
gets the VPN addresses of all child
tasks of the current task. Similarly, the function get_parent_address()
is available to get the VPN address of the parent task. Finally, there is
a client function get_addresses()
that returns the VPN addresses of all
algorithm instances that are part of the same task.
VPN communication is only possible if the docker container exposes ports to the VPN network. In the algorithm boilerplate, one port is exposed by default. If you need to expose more ports (e.g. for sending different information to different parts of your algorithm), you can do so by adding lines to the Dockerfile:
# port 8888 is used by the algorithm for communication purposes
EXPOSE 8888
LABEL p8888 = "some-label"
# port 8889 is used by the algorithm for data-exchange
EXPOSE 8889
LABEL p8889 = "some-other-label"
The EXPOSE
command exposes the port to the VPN network. The LABEL
command adds a label to the port. This label returned with the clients’
get_addresses()
function suite. You may specify as many ports as you need.
Note that you must specify the label with p
as prefix followed by the
port number. The vantage6 infrastructure relies on this naming convention.
5.3.5. Dockerfile structure#
Once the algorithm code is written, the algorithm needs to be packaged and made available for retrieval by the nodes. The algorithm is packaged in a Docker image. A Docker image is created from a Dockerfile, which acts as a blue-print.
The Dockerfile is already present in the boilerplate code. Usually, the only
line that you need to update is the PKG_NAME
variable to the name of your
algorithm package.