.. _algo-dev-guide:

Algorithm development step-by-step guide
========================================

This page offers a step-by-step guide to develop a vantage6 algorithm.
We refer to the `algorithm concepts <algo-concepts>`_ section
regularly. In that section, we explain the fundamentals of algorithm containers
in more detail than in this guide.

Also, note that this guide is mainly aimed at developers who want to develop
their algorithm in Python, although we will make an effort to indicate where
this differs from algorithms written in other programming languages. Writing your algorithm in
Python is recommended because it is currently the best supported  language for vantage6.

.. _algo-dev-create-algorithm:

Starting point
--------------

When starting to develop a new vantage6 algorithm in Python, the easiest way to
start is:

.. code::

   v6 algorithm create

Running this command will prompt you to answering some questions, which will
result in a personalized starting point or 'boilerplate' for your algorithm.
After doing so, you will have a new folder with the name of your algorithm,
boilerplate code and a checklist in the README.md file that you can follow to
complete your algorithm.

Setting up your environment
---------------------------

It is good practice to set up a virtual environment for your algorithm
package.

.. code:: bash

   # This code is just a suggestion - there are many ways of doing this.

   # go to the algorithm directory
   cd /path/to/algorithm

   # create a Python environment. Be sure to replace <my-algorithm-env> with
   # the name of your environment.
   uv venv --python 3.13
   source .venv/bin/activate  # On Windows: .venv\Scripts\activate

   # install the algorithm dependencies
   uv sync

Also, it is always good to use a version control system such as ``git`` to
keep track of your changes. An initial commit of the boilerplate code could be:

.. code:: bash

   cd /path/to/algorithm
   git init
   git add .
   git commit -m "Initial commit"

Note that having your code in a git repository is necessary if you want to
:ref:`update your algorithm <algo-dev-update-algo>` at a later stage.

Implementing your algorithm
---------------------------

Your personalized starting point should make clear to you which functions you need to
implement - there are `TODO` comments in the code that indicate where you need
to add your own code.

You may wonder why the boilerplate code is structured the way it is. This
is explained in the :ref:`code structure section <algo-code_structure>`.

Returning results
-----------------

Returning the results of you algorithm is rather straightforward. At the end
of your algorithm function, you can simply return the results as a dictionary:

.. code:: python

    def my_function(column_name: str):
        return {
            "result": 42
        }

These results will be returned to the user after the algorithm has finished.

.. warning::

    The results that you return should ideally be JSON serializable. This means that
    you should not, for example, return a ``pandas.DataFrame`` or a
    ``numpy.ndarray``. Such objects may not be readable to a non-Python-using
    recipient, or may even be insecure to send over the internet. They should
    be converted to a JSON-serializable format first (e.g. with ``df.to_json()`` in
    pandas).

.. _algo-env-vars:

Environment variables
---------------------

The algorithms have access to several environment variables. You can also
specify additional environment variables via the ``algorithm_env`` option
in the node configuration files (see the
:ref:`example node configuration file <node-configure-structure>`). You can access
environment variables in your functions as follows:

.. code:: python

   import os

   def my_function():
       # environment variable that specifies the input file
       env_var = os.environ["ENV_VAR_SPECIFIED_IN_NODE_CONFIG"]

       # do something with the input file and database URI
       pass

You can view all environment variables that are available to your algorithm by
``print(os.environ)``. This includes a number of environment variables that are
provided by the vantage6 infrastructure.

Example functions
-----------------

Below are simple but typical examples of different types of algorithm functions -
central, partial, data extraction, and preprocessing functions.

Central function
~~~~~~~~~~~~~~~~

.. code:: python

   from vantage6.algorithm.decorator.algorithm_client import algorithm_client
   from vantage6.algorithm.decorator.action import central
   from vantage6.algorithm.client import AlgorithmClient
   from vantage6.algorithm.tools.util import info, error

   @central
   @algorithm_client
   def main(client: AlgorithmClient, *args, **kwargs):
      # Run partial function.
      info("Creating subtask for partial function")
      task = client.task.create(
         method="my_partial_function",
         arguments={
            "function_argument_1": "value_1",
            "function_argument_2": "value_2"
         },
         organizations=[1, 2]
      )

       # wait for the federated part to complete
       # and return
       results = client.wait_for_results(task_id=tesk.get("id"))

       return results

Partial function
~~~~~~~~~~~~~~~~

.. code:: python

   import pandas as pd
   from vantage6.algorithm.tools.decorator import dataframe
   from vantage6.algorithm.decorator.action import federated

   @federated
   @dataframe(1)
   def add_one_and_sum(data: pd.DataFrame, column_name: str):
       # do something with the data
       data[column_name] = data[column_name] + 1

       # return the results
       return {
           "result": sum(data[colum_name].to_list())
       }

Data extraction function
~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: python

   import os
   import pandas as pd

   from vantage6.algorithm.decorator.action import data_extraction
   from vantage6.algorithm.tools.util import info

   @data_extraction
   def read_csv(db_connection_details: dict):
       info("Extracting data")

       # for a CSV database, the URI is the path to the CSV file
       df = pd.read_csv(db_connection_details["uri"])

    @data_extraction
    def read_sql_database(db_connection_details: dict):
       # for a SQL database, the URI is the connection string. Environment variables
       # such as username+password can be provided in the node configuration file.
       df = pd.read_sql_query(
         db_connection_details["uri"],
         db_connection_details["query"],
         os.getenv("USERNAME"),
         os.getenv("PASSWORD"),
      )

       return df

Note that the ``USERNAME`` and ``PASSWORD`` environment variables are not provided by the
vantage6 infrastructure. They should be added to the node configuration file as
explained in the :ref:`algo-env-vars` section.

Preprocessing function
~~~~~~~~~~~~~~~~~~~~~~

.. code:: python

   import pandas as pd

   from vantage6.algorithm.decorator.action import preprocessing
   from vantage6.algorithm.tools.util import info

   @preprocessing
   def add_one(df: pd.DataFrame, column_name: str):
       # do some preprocessing with the data
       df[column_name] = df[column_name] + 1
       return df

.. _algo-functions-provided:

Functions provided by the vantage6 infrastructure
-------------------------------------------------

There are already some data extraction and preprocessing functions provided by the
vantage6 infrastructure. These contain the most common data extractions (such as
CSV, Excel, Parquet and basic SQL wrappers) and common preprocessing transformations.

You can make these functions available in your algorithm by importing them from the
vantage6 algorithm tools:

.. code:: python

   # in your algorithm's __init__.py file
   from vantage6.algorithm.data_extraction import *
   from vantage6.algorithm.preprocessing import *

.. note::

   As algorithm developer, you should keep in mind that error messages may contain
   sensitive information. In Python, we often see Pandas errors when manipulating data,
   for instance that a certain data value is not a valid date.

   To help you keep such sensitive information private, vantage6
   provides a decorator that can be used to handle pandas errors. This decorator will
   catch all pandas errors and return a generic error message. You can use this
   decorator by adding it to your algorithm function:

   .. code:: python

      from vantage6.algorithm.tools.error_handling import handle_pandas_errors

      @handle_pandas_errors
      def my_function(data: pd.DataFrame):
         return data

   All data extraction and preprocessing functions provided by the vantage6 algorithm
   tools package are decorated with the ``handle_pandas_errors`` decorator, so that any
   pandas-related error occurring during the execution of the function will be caught
   and a generic error message will be returned instead of the traceback. Note that
   this decorator does not catch all errors, so you should still be careful with the
   data you handle in your algorithm functions.

.. _mock-test-algo-dev:

Testing your algorithm
----------------------

It can be helpful to test your algorithm outside of a containerized environment using
the ``MockNetwork``. This may save time as it does not require you to set up a test
infrastructure with a vantage6 hub and nodes, and allows you to test your algorithm
without building a Docker image every time. The algorithm boilerplate code comes with a
test file that you can use to test your algorithm using the ``MockNetwork`` - you can
of course extend that to add more or different tests.

The ``MockNetwork`` comes with a ``MockAlgorithmClient`` and a ``MockUserClient`` that
have the same interface as the ``AlgorithmClient`` and the ``UserClient``, so it should
be easy to switch between the two. The following example shows how to use the
``MockUserClient`` to test your algorithm:

.. code:: python

        from vantage6.algorithm.mock.mock_network import MockNetwork
        network = MockNetwork(
            module_name="my_algorithm",
            datasets=[
                # datasets for node 1
                {"dataset_1": {"database": "mock_data.csv", "db_type": "csv"}},
                # datasets for node 2
                {"dataset_1": {"database": "mock_data.csv", "db_type": "csv"}},
                # datasets for node 3
                {"dataset_1": {"database": "mock_data.csv", "db_type": "csv"}},
            ],
        )
        client = network.user_client
        client.dataframe.create(
            label="dataset_1", method="my_method", arguments={}
        )
        client.task.create(
            method="my_method",
            organizations=[0],
            arguments={
                "example_argument": 10
            },
            databases=[{"label": "dataset_1"}]
        )
        results = client.result.from_task(task.get("id"))
        print(results)

Or in case you do not want to test data extraction you can provide a pandas
DataFrame instead of a string for the database value:

.. code:: python

        import pandas as pd
        from vantage6.algorithm.mock.mock_network import MockNetwork

        network = MockNetwork(
            module_name="my_algorithm",
            datasets=[
                # datasets for node 1
                {"dataset_1": pd.DataFrame({"column_1": [1, 2, 3]})},
                # datasets for node 2
                {"dataset_1": pd.DataFrame({"column_1": [4, 5, 6]})},
                # datasets for node 3
                {"dataset_1": pd.DataFrame({"column_1": [7, 8, 9]})},
            ],
        )
        client = network.user_client
        client.task.create(
            method="my_method",
            organizations=[0],
            arguments={
                "example_argument": 10
            },
            databases=[{"label": "dataset_1"}]
        )
        results = client.result.from_task(task.get("id"))
        print(results)

Writing documentation
---------------------

It is important that you add documentation of your algorithm so that users
know how to use it. In principle, you may choose any format of documentation,
and you may choose to host it anywhere you like. However, in our experience it
works well to keep your documentation close to your code. We recommend using the
``readthedocs`` platform to host your documentation. A template for such documentation
can be generated when running the ``v6 algorithm create`` command.

Alternatively, you could use a ``README`` file - if the documentation is not too
extensive, e.g. the algorithm is onlyfor testing purposes, this may be sufficient.

Package & distribute
--------------------

The algorithm boilerplate comes with a ``Dockerfile`` that is a blueprint for
creating a Docker image of your algorithm. This Docker image is the package
that you will distribute to the nodes.

If you go to the folder containing your algorithm, you will also find the
Dockerfile there, immediately at the top directory. You can then build the
project as follows:

.. code:: bash

   docker build -t repo/image:tag .

The ``-t`` indicated the name of your image. This name is also used as
reference where the image is located on the internet. Once the Docker image is
created it needs to be uploaded to a registry so that nodes can retrieve it,
which you can do by pushing the image:

.. code:: bash

   docker push repo/image:tag

Here are a few examples of how to build and upload your image:

.. code:: bash

    # Build and upload to Docker Hub. Replace <my-user-name> with your Docker
    # Hub username and make sure you are logged in with ``docker login``.
    docker build -t my-user-name/algorithm-example:latest .
    docker push my-user-name/algorithm-example:latest

    # Build and upload to private registry. Note that to be able to use this, you need
    # to have an account at the registry and be logged in with ``docker login``
    docker build -t ghcr.io/vantage6/algorithm/example:latest .
    docker push ghcr.io/vantage6/algorithm/example:latest

Now that your algorithm has been uploaded it is available for nodes to retrieve
when they need it.

Uploading your algorithm to the algorithm store
-----------------------------------------------

To upload your algorithm to the algorithm store, you should generate an
``algorithm.json`` file, that contains the metadata of your algorithm, such as,
which functions are available, which arguments are needed, etc.

The easiest way to generate this file is to run the following command:

.. code:: bash

   v6 algorithm generate-store-json

That command will help you to generate the appropriate JSON file. Note that type hints
and docstrings are important to generate a fully correct JSON file.

Once you have the ``algorithm.json`` file, you can upload it to the algorithm store
by going to the relevant page in the UI and uploading the file.

Calling your algorithm from vantage6
------------------------------------

If you want to test your algorithm in the context of vantage6, you should
set up a vantage6 infrastructure. To do that quickly, you can use the ``v6 sandbox new``
command, which will create a sandbox environment with a hub and several nodes.
Once you have a vantage6 sandbox running, you can create a task for
your algorithm. You can do this either via the :ref:`UI <ui>` or via the
:ref:`Python client <pyclient-create-task>`.

It is also possible to test your algorithm by running a test script on a local
vantage6 :ref:`sandbox <local-test>`. This can be done by running
the following CLI command:

.. code:: bash

   # Run your own script
   v6 test client-script --create-sandbox --script path/to/test_script.py

   # OR
   # provide task arguments to the default test script
   v6 test client-script --create-sandbox --task-arguments "{ 'collaboration': 1, 'organizations': [1], 'name': 'task_name', 'image': 'my_image', 'description': '', 'method': 'my_method', 'arguments': {'column_name': 'my_column'}, 'databases': [{'label': 'db_label'}]}"

.. note::

    For v5.0, you need to have a sandbox that already has extracted dataframes in the
    database, or the test script should create them. We hope to add features to make
    this easier in the future.


The commands above will create a sandbox and run the test script on that sandbox. The
infrastructure contains a default test script that creates a task where only the
arguments for ``client.task.create`` have to be provided.

The more flexible, but more complex, option is to write your own test script.
In this case, the script should contain the code to run and test the algorithm, and
return the execution result. For example, to test the average algorithm, the script
could look like this:

.. code:: python

    from vantage6.client import Client

    def run_test():
        # Create a client and authenticate
        client = Client(
            hq_url="http://localhost:30761/hq",
            auth_url="http://localhost:30764"
        )
        client.authenticate()

        # create the task
        task = client.task.create(
            organizations=[1],
            name="test_average_task",
            image="ghcr.io/vantage6/algorithm/demo/average:latest",
            description="",
            method="central_average",
            arguments={"column_name": "Age"},
            session=1,
            collaboration=1,
            databases=[{"dataframe_id": 1}],
            action="central_compute",
        )

        # wait for the task to complete
        task_result = client.wait_for_results(task["id"])

        # verify the result
        assert task_result.get("data")[0].get("result") == '{"average": 27.613448844884488}'

    if __name__ == "__main__":
        run_test()

After running the CLI command, sandbox created/started for this test will be
stopped/removed unless you specify the ``--keep`` flag in the command.

If a dataset different from the default ones is needed, it can be included in the
sandbox by specifying the label and the path to the dataset in the ``--add-dataset``
argument of the command:

.. code:: bash

   v6 test client-script --script /path/to/test_script.py --create-sandbox --add-dataset my_label /path/to/dataset

If a sandbox configuration exists, but the sandbox is not running, it is possible
to start the existing sandbox and run the test script on it:

.. code:: bash

   v6 test client-script --script /path/to/test_script.py --start-sandbox --name my_sandbox

If a the ``--start-sandbox`` and the ``--create-sandbox`` arguments are not specified,
the test script will be executed on the running sandbox - if none are running, an error
will be raised.


.. _algo-dev-update-algo:

Updating your algorithm
-----------------------

At some point, there may be changes in the vantage6 infrastructure that require
you to update your algorithm. Such changes are made available via
the ``v6 algorithm update`` command. This command will update your algorithm
to the latest version of the vantage6 infrastructure.

You can also use the ``v6 algorithm update`` command to update your algorithm
if you want to modify your answers to the questionnaire. In that case, you
should be sure to commit the changes in ``git`` before running the command.