6.3.1. Run context¶
The run_context feature is an experimental step towards a more standard way
of providing an algorithm with everything it needs to carry out its specific
job.
The idea behind run_context is to expose an algorithm-readable JSON file
that conveys that information. It is meant for the algorithm itself to read.
It is not yet meant to be a stable cross-platform standard or a complete
description of the underlying executor.
Why this exists¶
The current way that context (which inputs, their paths, output path, arguments,
etc.) is conveyed by the node to the algorithm container is through a mix of
(base32-encoded) environment variables and written files. This creates a tight
coupling between the algorithm wrapper (vantage6-algorithm-tools) and the
platform. This means using other languages to write algorithms for vantage6 is
harder. It also means reusing algorithms written for vantage6 for other
platforms can be trickier.
Apart from other platforms and languages, some potential new features like allowing the researcher to specify the role of a dataset for an algorithm or to attach richer metadata to inputs and outputs were pushing the sensible limits of what environment variables should perhaps reasonably be used for.
Current scope¶
When the experimental node configuration option run_context_file is enabled,
the node writes a run_context.json file for each algorithm run and exposes
its path inside the algorithm container via the RUN_CONTEXT_FILE environment
variable.
Algorithms can read it, or a more portable algorithm wrapper can be written to make use of it. The current json file tries to describe, in a compact way:
the selected algorithm entrypoint
the positional and named arguments for that entrypoint
the inputs made available to the algorithm
the outputs the algorithm is expected to write
a small amount of runtime metadata about where it’s being executed
other vantage6-specific metadata
This feature is intentionally minimal and still under development. The exact field names and contents may change.
Early example¶
A very small demo algorithm using this approach is available here: https://github.com/mdw-nl/average-py
It uses a lightweight Python helper library for reading and dispatching
run_context.json here:
https://github.com/mdw-nl/run-context-py
This is conceptually similar to what vantage6-algorithm-tools does for the
current vantage6 runtime, but with the algorithm reading from
RUN_CONTEXT_FILE instead of depending on the current vantage6-specific
environment-variable and file conventions.
Ideally, in the future, information necessary to communicate with other nodes needed for the aggregator component of an algorithm could also be “standardized” and included here.
Illustrative example¶
The following example shows the current shape of the file. The comments are explanatory only and are not part of the actual JSON file.
{
// At the moment, this is a proposal/experiment, hence version 0.1.
"schema_version": "0.1",
// Method selected for this run. The running algorithm can use this to
// pick a starting point within it
"entrypoint": {
"name": "my_function"
},
// Positional and named arguments from task input
"arguments": {
"positional": [],
"named": {
"column_name": "my_column"
}
},
// Minimal runtime identity of the executing node. Provides some extra
// context that the algorithm might need to use for its operation
"executor": {
"id": 17,
"kind": "vantage6-node"
},
// Data sources the algorithm should use
"inputs": [
{
"id": "default",
"uri": "/mnt/data/default.csv",
"type": "csv",
// Eventually it might be better to move 'arguments'.* a level up
"arguments": {
"bind": "dataset"
}
},
{
// For example, a config file for the dataset itself
"id": "default_config",
"uri": "/mnt/data/default.yaml",
"type": "other",
// Eventually it might be better to move 'arguments'.* a level up
"arguments": {
"bind": "config"
}
}
],
// For example, the researcher could have selected this input with:
// {"label": "somelabel", "arguments": {"bind": "dataset"}}
// Locations where results should be written
"outputs": [
{
"id": "result",
"uri": "/mnt/data/task_123/output"
}
],
// vantage6-specific run/task metadata
// Perhaps, if this run context "standard" proves useful, some of these
// keys can move upward out of this extra/out-of-standard
// vantage6-specific section
"x-vantage6": {
"run_id": 10,
"task_id": 123,
"collaboration_id": 1,
"token_file": "/mnt/data/task_123/token",
"api_proxy": {
"host": "http://proxyserver",
"port": "8080",
"api_path": ""
},
"temporary_directory": "/mnt/tmp"
}
}
How this relates to the current vantage6 runtime¶
If the researcher supplied arguments for a selected database when
creating the task, those values are included in the corresponding
inputs[*].arguments object. Other top-level selector fields such as query,
sheet_name and preprocessing are not currently copied into run
context.
As mentioned, vantage6 algorithms already receive runtime context through the existing vantage6-specific mechanism of written files and environment variables. For example, some of the input/output paths and connection details that the node shares with the algorithm container via environment variables as of version v4.13.7 look something like this:
INPUT_FILEOUTPUT_FILEHOSTUSER_REQUESTED_DATABASE_LABELS<DB_LABEL>_DATABASE_URI
In the current node implementation, most node-provided environment variable values are base32-encoded before they are injected into the algorithm container. The Python algorithm tools decode them again when the wrapper starts. Algorithms that use the standard vantage6 Python wrappers therefore continue to use the existing interface unchanged.
The RUN_CONTEXT_FILE path is added to that existing list of environment
variables. Unlike most node-provided environment variable values, this path is
intentionally left plain so that algorithms can read it directly without
needing vantage6-specific awareness. Note that enabling
run_context_file does not change the current way of passing information to
the algorithm container via files and environment variables. It is only an
addition; existing algorithms using vantage6-algorithm-tools should
continue to work as before. The only difference is that run_context.json
will be created as well.
Current limitations¶
At the moment, the experimental run_context support does not yet model
several existing features, including:
VPN and port-forwarding details
SSH tunnel metadata
whitelist / Squid proxy policy details
linked Docker service aliases
full environment-variable compatibility details
etc.
In addition, the current run_context implementation only supports
file-based databases. If the feature is enabled and a task requests a non-file
database, the task start fails. This is done for security reasons, e.g. to
prevent database secrets from being present in additional files that may persist
for some time.
We have opened a discussion on this topic where more information can be found: https://github.com/orgs/vantage6/discussions/2556