6.3.1. Run context

The run_context feature is an experimental step towards a more standard way of providing an algorithm with everything it needs to carry out its specific job.

The idea behind run_context is to expose an algorithm-readable JSON file that conveys that information. It is meant for the algorithm itself to read. It is not yet meant to be a stable cross-platform standard or a complete description of the underlying executor.

Why this exists

The current way that context (which inputs, their paths, output path, arguments, etc.) is conveyed by the node to the algorithm container is through a mix of (base32-encoded) environment variables and written files. This creates a tight coupling between the algorithm wrapper (vantage6-algorithm-tools) and the platform. This means using other languages to write algorithms for vantage6 is harder. It also means reusing algorithms written for vantage6 for other platforms can be trickier.

Apart from other platforms and languages, some potential new features like allowing the researcher to specify the role of a dataset for an algorithm or to attach richer metadata to inputs and outputs were pushing the sensible limits of what environment variables should perhaps reasonably be used for.

Current scope

When the experimental node configuration option run_context_file is enabled, the node writes a run_context.json file for each algorithm run and exposes its path inside the algorithm container via the RUN_CONTEXT_FILE environment variable.

Algorithms can read it, or a more portable algorithm wrapper can be written to make use of it. The current json file tries to describe, in a compact way:

  • the selected algorithm entrypoint

  • the positional and named arguments for that entrypoint

  • the inputs made available to the algorithm

  • the outputs the algorithm is expected to write

  • a small amount of runtime metadata about where it’s being executed

  • other vantage6-specific metadata

This feature is intentionally minimal and still under development. The exact field names and contents may change.

Early example

A very small demo algorithm using this approach is available here: https://github.com/mdw-nl/average-py

It uses a lightweight Python helper library for reading and dispatching run_context.json here: https://github.com/mdw-nl/run-context-py

This is conceptually similar to what vantage6-algorithm-tools does for the current vantage6 runtime, but with the algorithm reading from RUN_CONTEXT_FILE instead of depending on the current vantage6-specific environment-variable and file conventions.

Ideally, in the future, information necessary to communicate with other nodes needed for the aggregator component of an algorithm could also be “standardized” and included here.

Illustrative example

The following example shows the current shape of the file. The comments are explanatory only and are not part of the actual JSON file.

{
  // At the moment, this is a proposal/experiment, hence version 0.1.
  "schema_version": "0.1",
  // Method selected for this run. The running algorithm can use this to
  // pick a starting point within it
  "entrypoint": {
    "name": "my_function"
  },
  // Positional and named arguments from task input
  "arguments": {
    "positional": [],
    "named": {
      "column_name": "my_column"
    }
  },
  // Minimal runtime identity of the executing node. Provides some extra
  // context that the algorithm might need to use for its operation
  "executor": {
    "id": 17,
    "kind": "vantage6-node"
  },
  // Data sources the algorithm should use
  "inputs": [
    {
      "id": "default",
      "uri": "/mnt/data/default.csv",
      "type": "csv",
      // Eventually it might be better to move 'arguments'.* a level up
      "arguments": {
        "bind": "dataset"
      }
    },
    {
      // For example, a config file for the dataset itself
      "id": "default_config",
      "uri": "/mnt/data/default.yaml",
      "type": "other",
      // Eventually it might be better to move 'arguments'.* a level up
      "arguments": {
        "bind": "config"
      }
    }
  ],
  // For example, the researcher could have selected this input with:
  // {"label": "somelabel", "arguments": {"bind": "dataset"}}
  // Locations where results should be written
  "outputs": [
    {
      "id": "result",
      "uri": "/mnt/data/task_123/output"
    }
  ],
  // vantage6-specific run/task metadata
  // Perhaps, if this run context "standard" proves useful, some of these
  // keys can move upward out of this extra/out-of-standard
  // vantage6-specific section
  "x-vantage6": {
    "run_id": 10,
    "task_id": 123,
    "collaboration_id": 1,
    "token_file": "/mnt/data/task_123/token",
    "api_proxy": {
      "host": "http://proxyserver",
      "port": "8080",
      "api_path": ""
    },
    "temporary_directory": "/mnt/tmp"
  }
}

How this relates to the current vantage6 runtime

If the researcher supplied arguments for a selected database when creating the task, those values are included in the corresponding inputs[*].arguments object. Other top-level selector fields such as query, sheet_name and preprocessing are not currently copied into run context.

As mentioned, vantage6 algorithms already receive runtime context through the existing vantage6-specific mechanism of written files and environment variables. For example, some of the input/output paths and connection details that the node shares with the algorithm container via environment variables as of version v4.13.7 look something like this:

  • INPUT_FILE

  • OUTPUT_FILE

  • HOST

  • USER_REQUESTED_DATABASE_LABELS

  • <DB_LABEL>_DATABASE_URI

In the current node implementation, most node-provided environment variable values are base32-encoded before they are injected into the algorithm container. The Python algorithm tools decode them again when the wrapper starts. Algorithms that use the standard vantage6 Python wrappers therefore continue to use the existing interface unchanged.

The RUN_CONTEXT_FILE path is added to that existing list of environment variables. Unlike most node-provided environment variable values, this path is intentionally left plain so that algorithms can read it directly without needing vantage6-specific awareness. Note that enabling run_context_file does not change the current way of passing information to the algorithm container via files and environment variables. It is only an addition; existing algorithms using vantage6-algorithm-tools should continue to work as before. The only difference is that run_context.json will be created as well.

Current limitations

At the moment, the experimental run_context support does not yet model several existing features, including:

  • VPN and port-forwarding details

  • SSH tunnel metadata

  • whitelist / Squid proxy policy details

  • linked Docker service aliases

  • full environment-variable compatibility details

  • etc.

In addition, the current run_context implementation only supports file-based databases. If the feature is enabled and a task requests a non-file database, the task start fails. This is done for security reasons, e.g. to prevent database secrets from being present in additional files that may persist for some time.

We have opened a discussion on this topic where more information can be found: https://github.com/orgs/vantage6/discussions/2556