2.5. Configure

The vantage6-node requires a configuration file to run. This is a yaml file with a specific format.

The next sections describes how to configure the node. It first provides a few quick answers on setting up your node, then shows an example of all configuration file options, and finally explains where your vantage6 configuration files are stored.

2.5.1. How to create a configuration file

The easiest way to create an initial configuration file is via: v6 node new. This allows you to configure the basic settings. For more advanced configuration options, which are listed below, you can view the example configuration file.

2.5.2. Where is my configuration file?

To see where your configuration file is located, you can use the following command

v6 node files

Warning

This command will not work if you have put your configuration file in a custom location. Also, you may need to specify the --system flag if you put your configuration file in the system folder.

2.5.3. All configuration options

The following configuration file is an example that intends to list all possible configuration options.

You can download this file here: node_config.yaml

# API key used to authenticate the node at the server. The API key is generated when
# the node is registered at the server. If you lost the API key you can generate a new
# one (if you have sufficient permissions), in the UI under Administration -> Node
# -> Reset API key, or by using the API endpoint POST /recover/node.
api_key: ***

# URL of the vantage6 server
server_url: https://cotopaxi.vantage6.ai

# port the server listens to
port: 443

# API path prefix that the server uses. Usually '/api' or an empty string
api_path: ''

# subnet of the VPN server
vpn_subnet: 10.76.0.0/16

# set the devices the algorithm container is allowed to request.
algorithm_device_requests:
  gpu: false

# Add additional environment variables to the algorithm containers. In case
# you want to supply database specific environment (e.g. usernames and
# passwords) you should use `env` key in the `database` section of this
# configuration file.
# OPTIONAL
algorithm_env:

  # in this example the environment variable 'player' has
  # the value 'Alice' inside the algorithm container
  player: Alice

# Add additional environment variables to the node container. This can be useful
# if you need to modify the configuration of certain python libraries that the
# node uses. For example, if you want to use a custom CA bundle for the requests
# library you can specify it here.
node_extra_env:
  REQUESTS_CA_BUNDLE: /etc/ssl/certs/ca-certificates.crt

# Add additional volumes to the node container. This can be useful if you need
# to mount a custom CA bundle for the requests library for example.
node_extra_mounts:
  - /etc/ssl/certs/ca-certificates.crt:/etc/ssl/certs/ca-certificates.crt:ro

node_extra_hosts:
  # In Linux (no Docker Desktop) you can use this (special) mapping to access
  # the host from the node.
  # See: https://docs.docker.com/reference/cli/docker/container/run/#add-host
  host.docker.internal: host-gateway
  # For testing purposes, it can also be used to map a public domain to a
  # private IP address, allowing you to avoid breaking TLS hostname verification
  v6server.example.com: 192.168.1.10

# specify custom Docker images to use for starting the different
# components.
# OPTIONAL
images:
  node: ghcr.io/vantage6/infrastructure/node:cotopaxi
  alpine: ghcr.io/vantage6/infrastructure/alpine
  vpn_client: ghcr.io/vantage6/infrastructure/vpn-client
  network_config: ghcr.io/vantage6/infrastructure/vpn-configurator
  ssh_tunnel: ghcr.io/vantage6/infrastructure/ssh-tunnel
  squid: ghcr.io/vantage6/infrastructure/squid

# path or endpoint to the local data source. The client can request a
# certain database by using its label. The type is used by the
# auto_wrapper method used by algorithms. This way the algorithm wrapper
# knows how to read the data from the source. The auto_wrapper currently
# supports: 'csv', 'parquet', 'sql', 'sparql', 'excel', 'omop'. You can
# also user 'folder' to mount an entire data folder as a single database.
# If your algorithm does not use the wrapper and you have a different type of
# data source you can specify 'other'.
databases:
  - label: default
    uri: C:\data\datafile.csv
    type: csv
    # OPTIONAL: controls how file/folder databases are exposed to algorithm
    # containers.
    # - copy (default): copy file data into the node data volume
    # - ro: bind-mount host path read-only into algorithm containers
    # NOTE: For now, `ro` has tested support for Linux hosts only. On
    # Windows/macOS hosts, use `copy` (default).
    # Use `ro` only for file/folder-backed data sources.
    mount_mode: copy

  - label: omop
    uri: jdbc:postgresql://host.docker.internal:5454/postgres
    type: omop
    # additional environment variables that are passed to the algorithm
    # containers (or their wrapper). This can be used to for usernames
    # and passwords for example. Note that these environment variables are
    # only passed to the algorithm container when the user requests that
    # database. In case you want to pass some environment variable to all
    # algorithms regard less of the data source the user specifies you can
    # use the `algorithm_env` setting.
    env:
      user: admin@admin.com
      password: admin
      dbms: postgresql
      cdm_database: postgres
      cdm_schema: public
      results_schema: results

  # For folder mounts, directory on the host will be mounted under /mnt/<label>.
  # In the example below, the folder `/path/to/share/with/container` will be
  # made available as `/mnt/persistent`. The folder will be mounted read/write.
  # **WARNING**: do *NOT* use 'data' as label.
  - label: persistent
    type: folder
    uri: /path/to/share/with/container
    # OPTIONAL: set to `ro` to mount this host folder read-only in algorithm
    # containers. For now, `ro` has tested support for Linux hosts only. On
    # Windows/macOS hosts, use `copy` (default).
    mount_mode: ro


# end-to-end encryption settings
encryption:

  # whenever encryption is enabled or not. This should be the same
  # as the `encrypted` setting of the collaboration to which this
  # node belongs.
  enabled: false

  # location to the private key file
  private_key: /path/to/private_key.pem

# Define who is allowed to run which algorithms on this node.
policies:
  # Control which algorithm images are allowed to run on this node. This is
  # expected to be a valid regular expression. If you don't specify this, all algorithm
  # images are allowed to run on this node (unless other policies restrict this).
  allowed_algorithms:
    - ^ghcr\.io/vantage6/algorithm/[a-zA-Z]+/[a-zA-Z]+$
    - ^ghcr\.io/vantage6/algorithm/glm$

  # It is also possible to allow all algorithms from particular algorithm stores. Set
  # these stores here. They may be strings or regular expressions. If you don't specify
  # this, algorithms from any store are allowed (unless other policies prevent this).
  allowed_algorithm_stores:
    # allow all algorithms from the vantage6 community store
    - https://store.cotopaxi.vantage6.ai
    # allow any store that is a subdomain of vantage6.ai
    - ^https://[a-z]+\.vantage6\.ai$

  # If you define both `allowed_algorithm_stores` and `allowed_algorithms`, you can
  # choose to only allow algorithms that both policies allow, or you can allow
  # algorithms that either of them allows. By default, only algorithms that are given
  # in *both* the `allowed_algorithms` and `allowed_algorithm_stores` are allowed by
  # setting this to the default value `false`.
  allow_either_whitelist_or_store: false

  # Define which users are allowed to run algorithms on your node by their ID
  allowed_users:
    - 2
  # Define which organizations are allowed to run images on your node by
  # their ID or name
  allowed_organizations:
    - 6
    - root

  # The basics algorithm (ghcr.io/vantage6/algorithm/basics:latest) is whitelisted
  # by default. It is used to collect column names in the User Interface to
  # facilitate task creation. Set to false to disable this.
  allow_basics_algorithm: true

  # Set to true to always require that the algorithm image is successfully pulled. This
  # ensures that no potentially outdated local images are used if internet connection
  # is not available. This option should be set to false if you are testing with local
  # algorithm images. Default value is true.
  require_algorithm_pull: true

# credentials used to login to private Docker registries
docker_registries:
  - registry: docker-registry.org
    username: docker-registry-user
    password: docker-registry-password

# Create SSH Tunnel to connect algorithms to external data sources. The
# `hostname` and `tunnel:bind:port` can be used by the algorithm
# container to connect to the external data source. This is the address
# you need to use in the `databases` section of the configuration file!
ssh-tunnels:

  # Hostname to be used within the internal network. I.e. this is the
  # hostname that the algorithm uses to connect to the data source. Make
  # sure this is unique and the same as what you specified in the
  # `databases` section of the configuration file.
  - hostname: my-data-source

    # SSH configuration of the remote machine
    ssh:

      # Hostname or ip of the remote machine, in case it is the docker
      # host you can use `host.docker.internal` for Windows and MacOS.
      # In the case of Linux you can use `172.17.0.1` (the ip of the
      # docker bridge on the host)
      host: host.docker.internal
      port: 22

      # fingerprint of the remote machine. This is used to verify the
      # authenticity of the remote machine.
      fingerprint: "ssh-rsa ..."

      # Username and private key to use for authentication on the remote
      # machine
      identity:
        username: username
        key: /path/to/private_key.pem

    # Once the SSH connection is established, a tunnel is created to
    # forward traffic from the local machine to the remote machine.
    tunnel:

      # The port and ip on the tunnel container. The ip is always
      # 0.0.0.0 as we want the algorithm container to be able to
      # connect.
      bind:
        ip: 0.0.0.0
        port: 8000

      # The port and ip on the remote machine. If the data source runs
      # on this machine, the ip most likely is 127.0.0.1.
      dest:
        ip: 127.0.0.1
        port: 8000

# Whitelist URLs and/or IP addresses that the algorithm containers are allowed to reach
# using the http protocol. Note that the addresses given below are examples and should
# be replaced with the actual addresses that you want to whitelist.
whitelist:
  domains:
    - google.com
    - github.com
    - host.docker.internal # docker host ip (windows/mac)
  ips:
    - 172.17.0.1 # docker bridge ip (linux)
    - 8.8.8.8
  ports:
    - 443

# Containers that are defined here are linked to the algorithm containers and
# can therefore be accessed when by the algorithm when it is running. Note that
# for using this option, the container with 'container_name' should already be
# started before the node is started. Also, if you are using this option together with
# the `whitelist` option, make sure to whitelist the `container_label` under `ips`,
# as well as the port(s) that you want to reach on the container.
docker_services:
    container_label: container_name

# Settings for the logger
logging:
  # Controls the logging output level. Could be one of the following
  # levels: CRITICAL, ERROR, WARNING, INFO, DEBUG, NOTSET
  level:        DEBUG

  # whenever the output needs to be shown in the console
  use_console:  true

  # The number of log files that are kept, used by RotatingFileHandler
  backup_count: 5

  # Size kb of a single log file, used by RotatingFileHandler
  max_size:     1024

  # Format: input for logging.Formatter,
  format:       "%(asctime)s - %(name)-14s - %(levelname)-8s - %(message)s"
  datefmt:      "%Y-%m-%d %H:%M:%S"

  # (optional) set the individual log levels per logger name, for example
  # mute some loggers that are too verbose.
  loggers:
    - name: urllib3
      level: warning
    - name: requests
      level: warning
    - name: engineio.client
      level: warning
    - name: docker.utils.config
      level: warning
    - name: docker.auth
      level: warning

# Additional debug flags
debug:

  # Set to `true` to enable the Flask/socketio into debug mode.
  socketio: false

  # Set to `true` to set the Flask app used for the LOCAL proxy service
  # into debug mode
  proxy_server: false


# directory where local task files (input/output) are stored
task_dir: C:\Users\<your-user>\AppData\Local\vantage6\node\mydir

# Whether or not your node shares some configuration (e.g. which images are
# allowed to run on your node) with the central server. This can be useful
# for other organizations in your collaboration to understand why a task
# is not completed. Obviously, no sensitive data is shared. Default true
share_config: true


# Whether or not to share algorithm logs with the server. Otherwise they will
# only be displayed as part of the node logs. Default is true.
# NOTE: It's recommented to set this to false when using real data
share_algorithm_logs: false

# EXPERIMENTAL: current scope is intentionally minimal.
# Whether or not to write a run context file for each algorithm run and
# expose its path as RUN_CONTEXT_FILE inside the algorithm container.
# At the moment, run context does not model several node runtime features
# (e.g. VPN, SSH tunnels, whitelist/Squid proxy policy, etc).
# Also, run context currently supports only file-based databases. If a task
# requests a non-file database while this feature is enabled, task startup fails.
# Default is false.
run_context_file: false

# Prometheus settings, for sending system metadata to the server.
prometheus:
  # Whether or not to enable Prometheus reporting. Default is false.
  enabled: false

  # Interval (in seconds) at which the node sends system metadata to the server.
  # This should align with the Prometheus scrape_interval to avoid stale data.
  report_interval_seconds: 45

2.5.4. Configuration file location

The directory where the configuration file is stored depends on your operating system (OS). It is possible to store the configuration file at system or at user level. By default, node configuration files are stored at user level, which makes this configuration available only for your user.

The default directories per OS are as follows:

Operating System

System-folder

User-folder

Windows

C:\ProgramData\vantage\node\

C:\Users\<user>\AppData\Local\vantage\node\

MacOS

/Library/Application/Support/vantage6/node/

/Users/<user>/Library/Application Support/vantage6/node/

Linux

/etc/vantage6/node/

/home/<user>/.config/vantage6/node/

Note

The command v6 node looks in these directories by default. However, it is possible to use any directory and specify the location with the --config flag. But note that doing that requires you to specify the --config flag every time you execute a v6 node command!

Similarly, you can put your node configuration file in the system folder by using the --system flag. Note that in that case, you have to specify the --system flag for all v6 node commands.

2.5.5. Security

As a data owner it is important that you take the necessary steps to protect your data. Vantage6 allows algorithms to run on your data and share the results with other parties. It is important that you review the algorithms before allowing them to run on your data.

Once you approved the algorithm, it is important that you can verify that the approved algorithm is the algorithm that runs on your data. There are two important steps to be taken to accomplish this:

  • Setting policies on the allowed algorithms in the policies section of the node-configuration file. You can specify a list of regex expressions here. Some examples of what you could define (note that these examples overlap so in practice you would not use all of them):

    policies:
       allowed_algorithms:
          - ^ghcr\.io/vantage6/algorithm/[a-zA-Z]+/[a-zA-Z]+
          - ^ghcr\.io/vantage6/algorithm/glm$
          - ^ghcr\.io/vantage6/algorithm/glm@sha256:82becede498899ec668628e7cb0ad87b6e1c371cb8a1e597d83a47fac21d6af3$
       allowed_algorithm_stores:
          - https://store.cotopaxi.vantage6.ai
    

    These four examples lead to the following restrictions: 1. ^ghcr\.io/vantage6/algorithm/[a-zA-Z]+/[a-zA-Z]+: allow all images

    from the ghcr.io/vantage6 registry that start with algorithm/

    1. ^ghcr\.io/vantage6/algorithm/glm$: only allow the GLM image, but all builds of this image

    2. ^ghcr\.io/vantage6/algorithm/glm@sha256:82becede498899ec668628e7cb0ad87b6e1c371cb8a1e597d83a47fac21d6af3$ a1e597d83a47fac21d6af3$: allows only this specific build from the GLM image to run on your data

    3. https://store.cotopaxi.vantage6.ai: allow all algorithms from the cotopaxi algorithm store

    Note that you can also define regular expressions for the algorithm stores, and that you can combine the two policies. The section Configuring algorithm access to the data below explains the considerations you need to take into account when setting these policies.

  • Enable DOCKER_CONTENT_TRUST to verify the origin of the image. For more details see the documentation from Docker.

Warning

By enabling DOCKER_CONTENT_TRUST you might not be able to use certain algorithms. You can check this by verifying that the images you want to be used are signed.

Configuring algorithm access to the data

As explained above, you can specify a list of allowed algorithms in the configuration file of the data station. Only algorithms specified on that list, by providing the names of the Docker images of these algorithms, are allowed to run on the data station. Also, you can specify the exact (non-forgeable) hash (i.e. version) of the trusted algorithm. Note that this process requires manual updates to the data station configuration, as well as a data station restart, each time that a new algorithm is approved or an existing algorithm is updated.

It is also possible to allow a set of algorithms at once by providing a pattern, i.e. a regular expression. This makes it e.g. possible to allow a certain directory with algorithms. The disadvantage of this approach is that if an attacker (or IT personnel with malintent) manages to get access to that directory, a malicious algorithm that would be put there, would pass the filter of allowed algorithms. Similarly, specifying single algorithms without hashes would not be fully secure if an attacker can access that address.

A third possibility is to allow algorithms from a trusted algorithm store. The benefit of this is that the algorithm store already manages the algorithms currently allowed including most up-to-date version information. When the algorithm is updated, the store will tell the node automatically to only allow the new version. The disadvantage of this approach is that if an attacker gains access to the store, the node is not protected from malicious algorithms.

The safest policy regarding allowed algorithms is to specify an exact list of all allowed algorithms, including the version (specified by the image hash), at the node. However, this also entails a significant maintenance burden if the algorithms are updated frequently. Institutes following this policy would have to log in to their data station for every algorithm change to update the allowed algorithm configuration. Although this is a quick update, it would still require a manual action every time. Also, as a manual action, it is error prone. Errors will probably prevent the algorithm from running successfully on that node. Alternatively, manual errors may lead to security concerns, but this is less likely.

If your project has a separate algorithm store and image registry, a good alternative is to define two policies at the node, that ascertain restrictions on both the algorithm store and the registry. One policy defines that only algorithms from the projects’s own algorithm store are allowed and the other policy only allows algorithms from the project’s own image registry. That way, an attacker would need to gain access to both the private registry, the algorithm store and the server before being able to send a malicious task. The probability of a successful attack on all of these components is much lower than a successful attack on a single component.

2.5.6. Logging

To configure the logger, look at the logging section in the example configuration file in All configuration options.

Useful commands:

  1. v6 node files: shows you where the log file is stored

  2. v6 node attach: shows live logs of a running server in your current console. This can also be achieved when starting the node with v6 node start --attach