.. _algo-concepts: Algorithm concepts ================== This page details the concepts used in vantage6 algorithms. Understanding these concepts is helpful when you to create your own algorithms. A guide to develop your own algorithms can be found in the :ref:`algo-dev-guide`. Algorithms are executed at the vantage6 node. The node receives a computation task from the vantage6 HQ. The node will then retrieve the algorithm, execute it and return the results to HQ. Algorithms are shared using `Docker images `_ which are stored in a :ref:`Docker image registry `. The node downloads the algorithms from the Docker registry. In the following sections we explain the fundamentals of algorithm containers. 1. `Algorithm structure`_: The typical structure of an algorithm 2. `Input & output`_: Interface between the node and algorithm container 3. `Wrappers`_: Library to simplify and standardized the node-algorithm input and output 4. `Child containers`_: Creating subtasks from an algorithm container 5. `Networking`_: Communicate with other algorithm containers and the vantage6 HQ 6. `Cross language`_: Communicate across languages with data serialization Algorithm structure ------------------- Multi-party analyses commonly have a central part and a remote part. The remote part is responsible for the actual analysis and the central part is often responsible for aggregating the partial results of the remote parts. An alternative to aggregating is orchestration, where the central part does not combine the partial results itself, but instead orchestrates the remote parts in a certain way that also leads to a global result. Of course, the central part may also do both aggregation and orchestration. In vantage6, we refer to the orchestration part as the **central function** and the remote part as the **partial function** or **federated function**. A common pattern for a central function would be: 1. Start remote tasks for the partial models from all participants 2. Wait for the remote tasks to complete and obtain the partial models 3. Combine the partial models to a global model For iterative algorithms, step 1-3 can be repeated in a single central function until the model converges. In vantage6, it is also possible to run only the partial parts of the analysis on the nodes and combine them on your own machine, but it is usually preferable to run the central part within vantage6, because then you don't have to keep your machine running during the analysis, you do not need to run the central part yourself, and the results are stored on the HQ, so they may also be accessed by other users. .. note:: Central functions also run at a node and *not* at HQ. For more information, see `here `_. Input & output -------------- The algorithm runs in an isolated environment at the data station. It is important to limit the connectivity and accessability of an algorithm run for security reasons. For instance, by default, algorithms cannot access the internet, which would prevent any malicious algorithms from sending data over the internet. In order for the algorithm to do its work, it is provided with several environment variables and file mounts. These file mounts are provided by the vantage6 infrastructure, so the algorithm developer does not need to provide these files themselves. .. TODO we might want to move this to a more technical section of the docs .. as it is not very relevant to most readers The available file mounts are: *Input* The input file contains the user defined input. Usually, this consists of the algorithm method that should be called together with its arguments. The user specifies this when a task is created. *Output* The algorithm writes its output to this file. When the docker container exits, the contents of this file will be sent back to the vantage6 HQ. The file mounts are mapped to environment variables that are described in the :ref:`algo-env-vars` section. .. TODO the above statement is not correct anymore, that section contains info .. on the env vars that node config file can set. See #2522 .. _wrapper-concepts: Wrappers -------- The vantage6 algorithm wrappers simplify and standardize the interaction between algorithm and node. The algorithm wrapper does the following: - Supply the appropriate environment variables and file mounts to your algorithm. - Select the appropriate algorithm function to run and call it with the appropriate arguments. - Write the output of your algorithm function to the output file The wrapper is language specific and currently only supports Python. Extending this to other languages should however be straightforward. .. figure:: /images/algorithm_wrapper.png The algorithm wrapper handles algorithm input and output. .. _algo-child-containers: Child containers ---------------- When a user creates a task, one or more nodes spawn an algorithm container. These algorithm containers can create new tasks themselves. Only central functions are supplied with a JWT token (see `Input & output`_). This token can be used to communicate with the vantage6 HQ. In case you use an algorithm wrapper, you can supply an ``AlgorithmClient`` using the :ref:`appropriate decorator `. A child container can be a parent container itself. There is no limit to the amount of task layers that can be created, but this is exceedingly rare. It is common to have only a single parent container which handles multiple child containers. .. figure:: /images/container_hierarchy.png Each container can spawn new containers in the network. Only central function containers are provided with a unique token which they can use to communicate to the vantage6 HQ. The token to which the central function containers have access supplies limited permissions to the container. For example, the token can only be used to create child tasks in the same collaboration. Networking ---------- The algorithm containers are deployed in an isolated network to reduce their exposure. Hence, by default, the algorithm containers cannot reach the internet. The main exception is that the vantage6 HQ is reachable through a local proxy service. In the algorithm you can use the ``HOST``, ``POST`` and ``API_PATH`` to find the address of the HQ. Algorithm containers can be granted access to the internet by whitelisting addresses and domains in your :ref:`node configuration `. This can for instance be used to allow the algorithm to reach a database in the local network. This feature should be used with caution - whitelisting unsafe addresses and domains can expose your data to attacks. Cross language -------------- Because algorithms are run as Docker images they can be written in any programming language. This is an advantage as developers can use their preferred language for the problem they need to solve. .. warning:: The wrappers are only available for Python, so when you use different language you need to handle the IO yourself. Consult the `Input & Output`_ section on what the node supplies to your algorithm container. When data is exchanged between the user and the algorithm they both need to be able to read the data. When the algorithm uses a language specific serialization (e.g. a ``pickle`` in the case of Python or ``RData`` in the case of R) the user needs to use the same language to read the results. Therefore, we recommend using a type of serialization that is not specific to a language. In the vantage6 infrastructure wrappers, JSON is used. .. note:: Communication between algorithm containers can use language specific serialization as long as the different parts of the algorithm use the same language.