.. _algo-run-context:

Run context
-----------

The ``run_context`` feature is an experimental step towards a more standard way
of providing an algorithm with everything it needs to carry out its specific
job.

The idea behind ``run_context`` is to expose an algorithm-readable JSON file
that conveys that information. It is meant for the algorithm itself to read.
It is **not** yet meant to be a stable cross-platform standard or a complete
description of the underlying executor.

Why this exists
^^^^^^^^^^^^^^^

The current way that context (which inputs, their paths, output path, arguments,
etc.) is conveyed by the node to the algorithm container is through a mix of
(base32-encoded) environment variables and written files. This creates a tight
coupling between the algorithm wrapper (``vantage6-algorithm-tools``) and the
platform. This means using other languages to write algorithms for vantage6 is
harder. It also means reusing algorithms written for vantage6 for other
platforms can be trickier.

Apart from other platforms and languages, some potential new features like
allowing the researcher to specify the role of a dataset for an algorithm or
to attach richer metadata to inputs and outputs were pushing the sensible
limits of what environment variables should perhaps reasonably be used for.


Current scope
^^^^^^^^^^^^^

When the experimental node configuration option ``run_context_file`` is enabled,
the node writes a ``run_context.json`` file for each algorithm run and exposes
its path inside the algorithm container via the ``RUN_CONTEXT_FILE`` environment
variable.

Algorithms can read it, or a more portable algorithm wrapper can be written to
make use of it. The current json file tries to describe, in a compact way:

- the selected algorithm entrypoint
- the positional and named arguments for that entrypoint
- the inputs made available to the algorithm
- the outputs the algorithm is expected to write
- a small amount of runtime metadata about where it's being executed
- other vantage6-specific metadata

This feature is intentionally minimal and still under development. The exact
field names and contents may change.

Early example
^^^^^^^^^^^^^

A very small demo algorithm using this approach is available here:
https://github.com/mdw-nl/average-py

It uses a lightweight Python helper library for reading and dispatching
``run_context.json`` here:
https://github.com/mdw-nl/run-context-py

This is conceptually similar to what ``vantage6-algorithm-tools`` does for the
current vantage6 runtime, but with the algorithm reading from
``RUN_CONTEXT_FILE`` instead of depending on the current vantage6-specific
environment-variable and file conventions.

Ideally, in the future, information necessary to communicate with other nodes
needed for the aggregator component of an algorithm could also be "standardized"
and included here.

Illustrative example
^^^^^^^^^^^^^^^^^^^^

The following example shows the current shape of the file. The comments are
explanatory only and are not part of the actual JSON file.

.. code:: javascript

    {
      // At the moment, this is a proposal/experiment, hence version 0.1.
      "schema_version": "0.1",
      // Method selected for this run. The running algorithm can use this to
      // pick a starting point within it
      "entrypoint": {
        "name": "my_function"
      },
      // Positional and named arguments from task input
      "arguments": {
        "positional": [],
        "named": {
          "column_name": "my_column"
        }
      },
      // Minimal runtime identity of the executing node. Provides some extra
      // context that the algorithm might need to use for its operation
      "executor": {
        "id": 17,
        "kind": "vantage6-node"
      },
      // Data sources the algorithm should use
      "inputs": [
        {
          "id": "default",
          "uri": "/mnt/data/default.csv",
          "type": "csv",
          // Eventually it might be better to move 'arguments'.* a level up
          "arguments": {
            "bind": "dataset"
          }
        },
        {
          // For example, a config file for the dataset itself
          "id": "default_config",
          "uri": "/mnt/data/default.yaml",
          "type": "other",
          // Eventually it might be better to move 'arguments'.* a level up
          "arguments": {
            "bind": "config"
          }
        }
      ],
      // For example, the researcher could have selected this input with:
      // {"label": "somelabel", "arguments": {"bind": "dataset"}}
      // Locations where results should be written
      "outputs": [
        {
          "id": "result",
          "uri": "/mnt/data/task_123/output"
        }
      ],
      // vantage6-specific run/task metadata
      // Perhaps, if this run context "standard" proves useful, some of these
      // keys can move upward out of this extra/out-of-standard
      // vantage6-specific section
      "x-vantage6": {
        "run_id": 10,
        "task_id": 123,
        "collaboration_id": 1,
        "token_file": "/mnt/data/task_123/token",
        "api_proxy": {
          "host": "http://proxyserver",
          "port": "8080",
          "api_path": ""
        },
        "temporary_directory": "/mnt/tmp"
      }
    }

How this relates to the current vantage6 runtime
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If the researcher supplied ``arguments`` for a selected database when
creating the task, those values are included in the corresponding
``inputs[*].arguments`` object. Other top-level selector fields such as ``query``,
``sheet_name`` and ``preprocessing`` are not currently copied into run
context.

As mentioned, vantage6 algorithms already receive runtime context through the
existing vantage6-specific mechanism of written files and environment variables.
For example, some of the input/output paths and connection details that the node
shares with the algorithm container via environment variables as of version
v4.13.7 look something like this:

- ``INPUT_FILE``
- ``OUTPUT_FILE``
- ``HOST``
- ``USER_REQUESTED_DATABASE_LABELS``
- ``<DB_LABEL>_DATABASE_URI``

In the current node implementation, most node-provided environment variable
values are base32-encoded before they are injected into the algorithm
container. The Python algorithm tools decode them again when the wrapper starts.
Algorithms that use the standard vantage6 Python wrappers therefore continue to
use the existing interface unchanged.

The ``RUN_CONTEXT_FILE`` path is added to that existing list of environment
variables. Unlike most node-provided environment variable values, this path is
intentionally left plain so that algorithms can read it directly without
needing vantage6-specific awareness. Note that enabling
``run_context_file`` does not change the current way of passing information to
the algorithm container via files and environment variables. It is only an
addition; existing algorithms using ``vantage6-algorithm-tools`` should
continue to work as before. The only difference is that ``run_context.json``
will be created as well.

Current limitations
^^^^^^^^^^^^^^^^^^^

At the moment, the experimental ``run_context`` support does not yet model
several existing features, including:

- VPN and port-forwarding details
- SSH tunnel metadata
- whitelist / Squid proxy policy details
- linked Docker service aliases
- full environment-variable compatibility details
- etc.

In addition, the current ``run_context`` implementation only supports
file-based databases. If the feature is enabled and a task requests a non-file
database, the task start fails. This is done for security reasons, e.g. to
prevent database secrets from being present in additional files that may persist
for some time.

We have opened a discussion on this topic where more information can be found:
https://github.com/orgs/vantage6/discussions/2556