4.2. Algorithm development step-by-step guide

This page offers a step-by-step guide to develop a vantage6 algorithm. We refer to the algorithm concepts section regularly. In that section, we explain the fundamentals of algorithm containers in more detail than in this guide.

Also, note that this guide is mainly aimed at developers who want to develop their algorithm in Python, although we will make an effort to indicate where this differs from algorithms written in other programming languages. Writing your algorithm in Python is recommended because it is currently the best supported language for vantage6.

4.2.1. Starting point

When starting to develop a new vantage6 algorithm in Python, the easiest way to start is:

v6 algorithm create

Running this command will prompt you to answering some questions, which will result in a personalized starting point or ‘boilerplate’ for your algorithm. After doing so, you will have a new folder with the name of your algorithm, boilerplate code and a checklist in the README.md file that you can follow to complete your algorithm.

4.2.2. Setting up your environment

It is good practice to set up a virtual environment for your algorithm package.

# This code is just a suggestion - there are many ways of doing this.

# go to the algorithm directory
cd /path/to/algorithm

# create a Python environment. Be sure to replace <my-algorithm-env> with
# the name of your environment.
uv venv --python 3.13
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# install the algorithm dependencies
uv sync

Also, it is always good to use a version control system such as git to keep track of your changes. An initial commit of the boilerplate code could be:

cd /path/to/algorithm
git init
git add .
git commit -m "Initial commit"

Note that having your code in a git repository is necessary if you want to update your algorithm at a later stage.

4.2.3. Implementing your algorithm

Your personalized starting point should make clear to you which functions you need to implement - there are TODO comments in the code that indicate where you need to add your own code.

You may wonder why the boilerplate code is structured the way it is. This is explained in the code structure section.

4.2.4. Returning results

Returning the results of you algorithm is rather straightforward. At the end of your algorithm function, you can simply return the results as a dictionary:

def my_function(column_name: str):
    return {
        "result": 42
    }

These results will be returned to the user after the algorithm has finished.

Warning

The results that you return should ideally be JSON serializable. This means that you should not, for example, return a pandas.DataFrame or a numpy.ndarray. Such objects may not be readable to a non-Python-using recipient, or may even be insecure to send over the internet. They should be converted to a JSON-serializable format first (e.g. with df.to_json() in pandas).

4.2.5. Environment variables

The algorithms have access to several environment variables. You can also specify additional environment variables via the algorithm_env option in the node configuration files (see the example node configuration file). You can access environment variables in your functions as follows:

import os

def my_function():
    # environment variable that specifies the input file
    env_var = os.environ["ENV_VAR_SPECIFIED_IN_NODE_CONFIG"]

    # do something with the input file and database URI
    pass

You can view all environment variables that are available to your algorithm by print(os.environ). This includes a number of environment variables that are provided by the vantage6 infrastructure.

4.2.6. Example functions

Below are simple but typical examples of different types of algorithm functions - central, partial, data extraction, and preprocessing functions.

Central function

from vantage6.algorithm.decorator.algorithm_client import algorithm_client
from vantage6.algorithm.decorator.action import central
from vantage6.algorithm.client import AlgorithmClient
from vantage6.algorithm.tools.util import info, error

@central
@algorithm_client
def main(client: AlgorithmClient, *args, **kwargs):
   # Run partial function.
   info("Creating subtask for partial function")
   task = client.task.create(
      method="my_partial_function",
      arguments={
         "function_argument_1": "value_1",
         "function_argument_2": "value_2"
      },
      organizations=[1, 2]
   )

    # wait for the federated part to complete
    # and return
    results = client.wait_for_results(task_id=tesk.get("id"))

    return results

Partial function

import pandas as pd
from vantage6.algorithm.tools.decorator import dataframe
from vantage6.algorithm.decorator.action import federated

@federated
@dataframe(1)
def add_one_and_sum(data: pd.DataFrame, column_name: str):
    # do something with the data
    data[column_name] = data[column_name] + 1

    # return the results
    return {
        "result": sum(data[colum_name].to_list())
    }

Data extraction function

import os
import pandas as pd

from vantage6.algorithm.decorator.action import data_extraction
from vantage6.algorithm.tools.util import info

@data_extraction
def read_csv(db_connection_details: dict):
    info("Extracting data")

    # for a CSV database, the URI is the path to the CSV file
    df = pd.read_csv(db_connection_details["uri"])

 @data_extraction
 def read_sql_database(db_connection_details: dict):
    # for a SQL database, the URI is the connection string. Environment variables
    # such as username+password can be provided in the node configuration file.
    df = pd.read_sql_query(
      db_connection_details["uri"],
      db_connection_details["query"],
      os.getenv("USERNAME"),
      os.getenv("PASSWORD"),
   )

    return df

Note that the USERNAME and PASSWORD environment variables are not provided by the vantage6 infrastructure. They should be added to the node configuration file as explained in the Environment variables section.

Preprocessing function

import pandas as pd

from vantage6.algorithm.decorator.action import preprocessing
from vantage6.algorithm.tools.util import info

@preprocessing
def add_one(df: pd.DataFrame, column_name: str):
    # do some preprocessing with the data
    df[column_name] = df[column_name] + 1
    return df

4.2.7. Functions provided by the vantage6 infrastructure

There are already some data extraction and preprocessing functions provided by the vantage6 infrastructure. These contain the most common data extractions (such as CSV, Excel, Parquet and basic SQL wrappers) and common preprocessing transformations.

You can make these functions available in your algorithm by importing them from the vantage6 algorithm tools:

# in your algorithm's __init__.py file
from vantage6.algorithm.data_extraction import *
from vantage6.algorithm.preprocessing import *

Note

As algorithm developer, you should keep in mind that error messages may contain sensitive information. In Python, we often see Pandas errors when manipulating data, for instance that a certain data value is not a valid date.

To help you keep such sensitive information private, vantage6 provides a decorator that can be used to handle pandas errors. This decorator will catch all pandas errors and return a generic error message. You can use this decorator by adding it to your algorithm function:

from vantage6.algorithm.tools.error_handling import handle_pandas_errors

@handle_pandas_errors
def my_function(data: pd.DataFrame):
   return data

All data extraction and preprocessing functions provided by the vantage6 algorithm tools package are decorated with the handle_pandas_errors decorator, so that any pandas-related error occurring during the execution of the function will be caught and a generic error message will be returned instead of the traceback. Note that this decorator does not catch all errors, so you should still be careful with the data you handle in your algorithm functions.

4.2.8. Testing your algorithm

It can be helpful to test your algorithm outside of a containerized environment using the MockNetwork. This may save time as it does not require you to set up a test infrastructure with a vantage6 hub and nodes, and allows you to test your algorithm without building a Docker image every time. The algorithm boilerplate code comes with a test file that you can use to test your algorithm using the MockNetwork - you can of course extend that to add more or different tests.

The MockNetwork comes with a MockAlgorithmClient and a MockUserClient that have the same interface as the AlgorithmClient and the UserClient, so it should be easy to switch between the two. The following example shows how to use the MockUserClient to test your algorithm:

from vantage6.algorithm.mock.mock_network import MockNetwork
network = MockNetwork(
    module_name="my_algorithm",
    datasets=[
        # datasets for node 1
        {"dataset_1": {"database": "mock_data.csv", "db_type": "csv"}},
        # datasets for node 2
        {"dataset_1": {"database": "mock_data.csv", "db_type": "csv"}},
        # datasets for node 3
        {"dataset_1": {"database": "mock_data.csv", "db_type": "csv"}},
    ],
)
client = network.user_client
client.dataframe.create(
    label="dataset_1", method="my_method", arguments={}
)
client.task.create(
    method="my_method",
    organizations=[0],
    arguments={
        "example_argument": 10
    },
    databases=[{"label": "dataset_1"}]
)
results = client.result.from_task(task.get("id"))
print(results)

Or in case you do not want to test data extraction you can provide a pandas DataFrame instead of a string for the database value:

import pandas as pd
from vantage6.algorithm.mock.mock_network import MockNetwork

network = MockNetwork(
    module_name="my_algorithm",
    datasets=[
        # datasets for node 1
        {"dataset_1": pd.DataFrame({"column_1": [1, 2, 3]})},
        # datasets for node 2
        {"dataset_1": pd.DataFrame({"column_1": [4, 5, 6]})},
        # datasets for node 3
        {"dataset_1": pd.DataFrame({"column_1": [7, 8, 9]})},
    ],
)
client = network.user_client
client.task.create(
    method="my_method",
    organizations=[0],
    arguments={
        "example_argument": 10
    },
    databases=[{"label": "dataset_1"}]
)
results = client.result.from_task(task.get("id"))
print(results)

4.2.9. Writing documentation

It is important that you add documentation of your algorithm so that users know how to use it. In principle, you may choose any format of documentation, and you may choose to host it anywhere you like. However, in our experience it works well to keep your documentation close to your code. We recommend using the readthedocs platform to host your documentation. A template for such documentation can be generated when running the v6 algorithm create command.

Alternatively, you could use a README file - if the documentation is not too extensive, e.g. the algorithm is onlyfor testing purposes, this may be sufficient.

4.2.10. Package & distribute

The algorithm boilerplate comes with a Dockerfile that is a blueprint for creating a Docker image of your algorithm. This Docker image is the package that you will distribute to the nodes.

If you go to the folder containing your algorithm, you will also find the Dockerfile there, immediately at the top directory. You can then build the project as follows:

docker build -t repo/image:tag .

The -t indicated the name of your image. This name is also used as reference where the image is located on the internet. Once the Docker image is created it needs to be uploaded to a registry so that nodes can retrieve it, which you can do by pushing the image:

docker push repo/image:tag

Here are a few examples of how to build and upload your image:

# Build and upload to Docker Hub. Replace <my-user-name> with your Docker
# Hub username and make sure you are logged in with ``docker login``.
docker build -t my-user-name/algorithm-example:latest .
docker push my-user-name/algorithm-example:latest

# Build and upload to private registry. Note that to be able to use this, you need
# to have an account at the registry and be logged in with ``docker login``
docker build -t ghcr.io/vantage6/algorithm/example:latest .
docker push ghcr.io/vantage6/algorithm/example:latest

Now that your algorithm has been uploaded it is available for nodes to retrieve when they need it.

4.2.11. Uploading your algorithm to the algorithm store

To upload your algorithm to the algorithm store, you should generate an algorithm.json file, that contains the metadata of your algorithm, such as, which functions are available, which arguments are needed, etc.

The easiest way to generate this file is to run the following command:

v6 algorithm generate-store-json

That command will help you to generate the appropriate JSON file. Note that type hints and docstrings are important to generate a fully correct JSON file.

Once you have the algorithm.json file, you can upload it to the algorithm store by going to the relevant page in the UI and uploading the file.

4.2.12. Calling your algorithm from vantage6

If you want to test your algorithm in the context of vantage6, you should set up a vantage6 infrastructure. To do that quickly, you can use the v6 sandbox new command, which will create a sandbox environment with a hub and several nodes. Once you have a vantage6 sandbox running, you can create a task for your algorithm. You can do this either via the UI or via the Python client.

It is also possible to test your algorithm by running a test script on a local vantage6 sandbox. This can be done by running the following CLI command:

# Run your own script
v6 test client-script --create-sandbox --script path/to/test_script.py

# OR
# provide task arguments to the default test script
v6 test client-script --create-sandbox --task-arguments "{ 'collaboration': 1, 'organizations': [1], 'name': 'task_name', 'image': 'my_image', 'description': '', 'method': 'my_method', 'arguments': {'column_name': 'my_column'}, 'databases': [{'label': 'db_label'}]}"

Note

For v5.0, you need to have a sandbox that already has extracted dataframes in the database, or the test script should create them. We hope to add features to make this easier in the future.

The commands above will create a sandbox and run the test script on that sandbox. The infrastructure contains a default test script that creates a task where only the arguments for client.task.create have to be provided.

The more flexible, but more complex, option is to write your own test script. In this case, the script should contain the code to run and test the algorithm, and return the execution result. For example, to test the average algorithm, the script could look like this:

from vantage6.client import Client

def run_test():
    # Create a client and authenticate
    client = Client(
        hq_url="http://localhost:30761/hq",
        auth_url="http://localhost:30764"
    )
    client.authenticate()

    # create the task
    task = client.task.create(
        organizations=[1],
        name="test_average_task",
        image="ghcr.io/vantage6/algorithm/demo/average:latest",
        description="",
        method="central_average",
        arguments={"column_name": "Age"},
        session=1,
        collaboration=1,
        databases=[{"dataframe_id": 1}],
        action="central_compute",
    )

    # wait for the task to complete
    task_result = client.wait_for_results(task["id"])

    # verify the result
    assert task_result.get("data")[0].get("result") == '{"average": 27.613448844884488}'

if __name__ == "__main__":
    run_test()

After running the CLI command, sandbox created/started for this test will be stopped/removed unless you specify the --keep flag in the command.

If a dataset different from the default ones is needed, it can be included in the sandbox by specifying the label and the path to the dataset in the --add-dataset argument of the command:

v6 test client-script --script /path/to/test_script.py --create-sandbox --add-dataset my_label /path/to/dataset

If a sandbox configuration exists, but the sandbox is not running, it is possible to start the existing sandbox and run the test script on it:

v6 test client-script --script /path/to/test_script.py --start-sandbox --name my_sandbox

If a the --start-sandbox and the --create-sandbox arguments are not specified, the test script will be executed on the running sandbox - if none are running, an error will be raised.

4.2.13. Updating your algorithm

At some point, there may be changes in the vantage6 infrastructure that require you to update your algorithm. Such changes are made available via the v6 algorithm update command. This command will update your algorithm to the latest version of the vantage6 infrastructure.

You can also use the v6 algorithm update command to update your algorithm if you want to modify your answers to the questionnaire. In that case, you should be sure to commit the changes in git before running the command.