2.4. Configure¶

The vantage6 node requires a configuration file to run. This is a yaml file with a specific format, which is used to start the node using its Helm chart.

The next sections describes how to configure the node. It first provides a few quick answers on setting up your node, then shows an example of all configuration file options. Next, we explain where your vantage6 configuration files are stored, and finally, we explain how to configure the node so that your data is protected optimally.

2.4.1. How to create a configuration file?¶

The easiest way to create an initial configuration file is via: v6 node new. This allows you to configure the basic settings. For more advanced configuration options, which are listed below, you can view the example configuration file.

2.4.2. Where is my configuration file?¶

To see where your configuration file is located, you can use the following command

v6 node files

Warning

This command will not work if you have put your configuration file in a custom location. Also, you may need to specify the --system flag if you put your configuration file in the system folder.

2.4.3. All configuration options¶

The following configuration file is an example that intends to list all possible configuration options.

You can download this file here: node_config.yaml

# Set the namespace in which the node will be deployed. Optional, by default the
# Release.Namespace is used
namespace: vantage6-node

node:

  # The name of the node.
  name: node-1

  # The API key of the node.
  apiKey: abcdefghijklmnopqrstuvwxyz1234567890

  # The image to use for the node.
  image: ghcr.io/vantage6/infrastructure/node:latest

  # Node image pull policy. Possible values are: Always (default), IfNotPresent, Never
  imagePullPolicy: Always

  # The URL, port and API path to the vantage6 HQ
  hq:
    url: http://my-vantage6-hq.org
    port: 7601
    path: /api

  # Namespace in which the task kubernetes resources are created. This must be a
  # namespace where the node has access to create pods.
  taskNamespace: vantage6-tasks

  # Kubernetes node name, used for local persistent volumes
  k8sNodeName: docker-desktop

  # Keycloak configuration to authenticate the node.
  keycloak:
    url: http://my-vantage6-auth.org
    realm: vantage6

  encryption:
    # Whether encryption is enabled or not. This should be the same as the `encrypted`
    # setting of the collaboration to which this node belongs.
    enabled: false

    # Location to the private key file. Required if encryption is enabled.
    private_key: /path/to/private_key.pem

  # Whether kubernetes secrets should be created for the node. If set to False, it will
  # be expected that the secrets are already created. By default true.
  createSecrets: true

  # Port for the node proxy to run on
  proxyPort: 7654

  # Storage settings on host of the node machine. This defines where the database is
  # stored as well as the task directory (which will contain the input/output of the
  # tasks).
  persistence:
    tasks:
      storageClass: local-storage
      size: 10Gi
      hostPath: /path/to/where/task/files/are/stored
    database:
      storageClass: local-storage
      size: 5Gi

  # Database settings
  databases:
    # File-based databases - these must be existing local files on the node machine.
    # They will be mounted in the node container.
    fileBased:
    - name: olympic_athletes_db
      uri: /my/local/path/to/data/olympic_athletes_2016.csv
      type: csv
      volumePath: /my/local/path/to/data
      originalName: olympic_athletes_2016.csv
      # OPTIONAL: controls how file/folder databases are exposed to algorithm
      # containers.
      # - copy (default): copy file data into the node data volume
      # - ro: bind-mount host path read-only into algorithm containers
      # NOTE: For now, `ro` has tested support for Linux hosts only. On
      # Windows/macOS hosts, use `copy` (default).
      # Use `ro` only for file/folder-backed data sources.
      mount_mode: copy
    # Service-based databases - these are databases that are running on the node
    # machine. They will be accessed using the URI.
    serviceBased:
    - name: myPostgres
      uri: postgres://vantage6-postgres:5432/vantage6
      type: other

      # The environment variables set here will be passed to the algorithm containers as
      # DATABASE_<NAME>_<ENV_VAR_NAME>. In this example, the node will set the
      # DATABASE_MYPOSTGRES_USER and DATABASE_MYPOSTGRES_PASSWORD environment variables.
      env:
        USER: vantage6
        PASSWORD: vantage6
      # It is also possible not to specify the details of the service-based databases
      # in this configuration file, but elsewhere in kubernetes (e.g. a parent chart's
      # values.yaml file).
      #
      # In this case, the node expects the following environment variables to be set:
      #
      # DATABASE_LABELS: comma-separated list of database names
      # DATABASE_<NAME>_URI: URI of the database
      # DATABASE_<NAME>_TYPE: type of the database
      #
      # Optionally, you can also specify additional environment variables for each
      # database by setting DATABASE_<NAME>_<ENV_VAR_NAME> variables, for example:
      #
      # DATABASE_<NAME>_MY_ENV_VAR: my_value
      #
      # It is recommended to do this through Kubernetes secrets.

  # Whether or not your node shares some configuration (e.g. which images are allowed
  # to run on your node) with HQ. This can be useful for other organizations in your
  # collaboration to understand why a task is not completed. Obviously, no sensitive
  # data is shared. Default true.
  share_config: true

  # Whether or not to share algorithm logs with HQ. Otherwise they will only be
  # displayed as part of the node logs. Default is true.
  # NOTE: It's recommented to set this to false when using sensitive data
  share_algorithm_logs: true

  # Define who is allowed to run which algorithms on this node.
  policies:
    # Control which algorithm images are allowed to run on this node. This is
    # expected to be a valid regular expression. It's important to set this policy when
    # using sensitive data to control which algorithms are allowed to run on the node.
    allowed_algorithms:
      - ^ghcr\.io/vantage6/algorithm/[a-zA-Z]+/[a-zA-Z]+$
      - ^myalgorithm\.ai/some-algorithm

    # It is also possible to allow all algorithms from a particular algorithm store. Set
    # these stores here. They may be strings or regular expressions. If you don't
    # specify this, algorithms from any store are allowed (unless other policies prevent
    # this).
    allowed_algorithm_stores:
      # allow all algorithms from the vantage6 community store
      - https://store.uluru.vantage6.ai
      # allow any store that is a subdomain of vantage6.ai
      - ^https://[a-z]+\.vantage6\.ai$

    # If you define both `allowed_algorithm_stores` and `allowed_algorithms`, you can
    # choose to only allow algorithms that both policies allow, or you can allow
    # algorithms that either of them allows. By default, this is False: only algorithms
    # that are given in *both* the `allowed_algorithms` and `allowed_algorithm_stores`
    # are allowed.
    allow_either_whitelist_or_store: false

    # Define which users are allowed to run algorithms on your node by their ID
    allowed_users:
      - 2
    # Define which organizations are allowed to run images on your node by
    # their ID or name
    allowed_organizations:
      - 6
      - root

    # Whether or not to always require that the algorithm image is successfully pulled
    # before running it. This ensures that no potentially outdated local images are used
    # if internet connection is not available. Default value is true.
    require_algorithm_pull: true

  # Whitelisting settings to specify external addresses algorithms are allowed to
  # connect to. Only Kubernetes network policies are supported. See:
  # https://kubernetes.io/docs/concepts/services-networking/network-policies
  # Only add egress destinations that are absolutely necessary for your algorithm
  # to function (for example, a required database server). Whitelists are defined
  # per task type. The allowed task types are: data_extraction, preprocessing,
  # federated_compute, central_compute, postprocessing.
  #
  # Each entry in the whitelist defines one allowed connection path. Multiple entries
  # are OR'd together (traffic is allowed if it matches any entry). You can specify:
  # - namespaceSelector: to allow connections to all pods in a namespace or a subset of
  #   pods in a namespace when used in combination with podSelector
  # - podSelector: to allow connections to pods matching labels (make sure that the
  #   namespace of this pod is whitelisted)
  # - ipBlock: to allow connections to IP addresses or CIDR ranges (e.g. a database
  #   server that is not part of the cluster)
  whitelist:
    data_extraction:
      egress:
      # Whitelist the namespace 'frontend' and 'backend'
      - namespaceSelector:
        matchExpressions:
        - key: namespace
          operator: In
          values: ["frontend", "backend"]
      # Whitelist the namespace 'ops' and the pod (within this namespace) with the
      # label 'app: ledger'
      - namespaceSelector:
          matchLabels:
            kubernetes.io/metadata.name: ops
        podSelector:
          matchLabels:
            app: ledger
    central_compute:
      egress:
        # Whitelist the IP block 10.20.0.0/16
        - ipBlock:
            cidr: 10.20.0.0/16

  # Prometheus settings, for sending system metadata to HQ.
  prometheus:
    # Whether or not to enable Prometheus reporting. Default is false.
    enabled: false

    # Interval (in seconds) at which the node sends system metadata to HQ.
    # This should align with the Prometheus scrape_interval to avoid stale data.
    report_interval_seconds: 45

  # If your project uses a private Docker registry for algorithm images, you can specify
  # them here so that the node can login and pull private images.
  private_docker_registries:
    - registry: my-private-registry.com
      username: my-username
      password: my-password

  # Logging settings for the node.
  logging:
    # Controls the logging output level. Could be one of the following
    # levels: CRITICAL, ERROR, WARNING, INFO, DEBUG, NOTSET
    level: DEBUG

    # Location to the log file
    file: node.log

    # Size in kb of a single log file
    max_size: 1024
    use_console: true

    # Date format for the log file
    datefmt: "%Y-%m-%d %H:%M:%S"

    # Format for the log file
    format: "%(asctime)s - %(name)-14s - %(levelname)-8s - %(message)s"

    # Storage configuration for logs. Storage size is set to 128M by default,
    # and storage class is k8s storage class that is used for the node.
    storageSize: "128M"
    storageClass: "local-storage"

    # Host path for storing the logs (required for local-storage PV)
    volumeHostPath: "/var/log/vantage6-node"

    # Maximum number of log files to keep. Log files are rotated when the size of the
    # current log file exceeds `max_size`.
    backup_count: 5

    # Loggers to include in the log file
    loggers:
    - level: warning
      name: urllib3
    - level: warning
      name: socketIO-client
    - level: warning
      name: socketio.server
    - level: warning
      name: engineio.server
    - level: warning
      name: sqlalchemy.engine

  # Additional debug flags
  debug:
    # Set to `true` to enable the Flask/socketio into debug mode.
    socketio: false
    # Set to `true` to set the Flask app used for the LOCAL proxy service
    # into debug mode
    proxy_server: false

  # Development settings - these should ONLY be used when running `v6 dev` environment,
  # which will use these settings automatically.
  dev:
    # Set extension for the task directory. In the development environment, the task
    # directory is mounted as a volume and shared by multiple nodes. This extension is
    # added to the task directory to avoid conflicts in storing the task results for
    # this node. It should not be used outside of the ``v6 dev`` environment.
    task_dir_extension: my_node_1

2.4.4. Configuration file location¶

The directory where the configuration file is stored depends on your operating system (OS). It is possible to store the configuration file at system or at user level. By default, node configuration files are stored at user level, which makes this configuration available only for your user.

The default directories per OS are as follows:

Operating System	System-folder	User-folder
Windows	`C:\ProgramData\vantage\node\`	`C:\Users\<user>\AppData\Local\vantage\node\`
MacOS	`/Library/Application/Support/vantage6/node/`	`/Users/<user>/Library/Application Support/vantage6/node/`
Linux	`/etc/vantage6/node/`	`/home/<user>/.config/vantage6/node/`

Note

The command v6 node looks in these directories by default. However, it is possible to use any directory and specify the location with the --config flag. But note that doing that requires you to specify the --config flag every time you execute a v6 node command!

Similarly, you can put your node configuration file in the system folder by using the --system flag. Note that in that case, you have to specify the --system flag for all v6 node commands.

2.4.5. Security¶

As a data owner it is important that you take the necessary steps to protect your data. Vantage6 allows algorithms to run on your data and share the results with other parties. Vantage6 offers maximum flexibility when it comes to algorithms, but this also means that you (or someone you trust) need to validate that algorithms do what they promise and are secure before allowing them to run on your data.

Once you approved the algorithm, it is important that you can verify that the approved algorithm is the algorithm that runs on your data. There are two important steps to be taken to accomplish this:

Setting policies on the allowed algorithms in the policies section of the node-configuration file. You can specify a list of regex expressions here. Some examples of what you could define (note that these examples overlap so in practice you would not use all of them):
```
policies:
   allowed_algorithms:
      - ^ghcr\.io/vantage6/algorithm/[a-zA-Z]+/[a-zA-Z]+
      - ^ghcr\.io/vantage6/algorithm/glm$
      - ^ghcr\.io/vantage6/algorithm/glm@sha256:82becede498899ec668628e7cb0ad87b6e1c371cb8a1e597d83a47fac21d6af3$
   allowed_algorithm_stores:
      - https://store.uluru.vantage6.ai
```
These four examples lead to the following restrictions: 1. ^ghcr\.io/vantage6/algorithm/[a-zA-Z]+/[a-zA-Z]+: allow all images

from the ghcr.io/vantage6 registry that start with algorithm/
1. ^ghcr\.io/vantage6/algorithm/glm$: only allow the GLM image, but all builds of this image
2. ^ghcr\.io/vantage6/algorithm/glm@sha256:82becede498899ec668628e7cb0ad87b6e1c371cb8a1e597d83a47fac21d6af3$ a1e597d83a47fac21d6af3$: allows only this specific build from the GLM image to run on your data
3. https://store.uluru.vantage6.ai: allow all algorithms from the Uluru community algorithm store. Only the most recent version of the algorithm uploaded to the store will be allowed to run on your data.
By default, only algorithms are allowed to run that fulfill both the allowed_algorithms and allowed_algorithm_stores policies. You can change this by setting the allow_either_whitelist_or_store policy to true.

Note that you can also define regular expressions for the algorithm stores, and that you can combine the two policies. The section Configuring algorithm access to the data below explains the considerations you need to take into account when setting these policies.
Enable DOCKER_CONTENT_TRUST to verify the origin of the image. For more details see the documentation from Docker.

Warning

By enabling DOCKER_CONTENT_TRUST you might not be able to use certain algorithms. You can check this by verifying that the images you want to use are signed.

Configuring algorithm access to the data¶

As explained above, you can specify a list of allowed algorithms in the configuration file of the data station. Only algorithms specified on that list, by providing the names of the images of these algorithms, are allowed to run on the data station. Also, you can specify the exact (non-forgeable) hash (i.e. version) of the trusted algorithm. Note that this process requires manual updates to the data station configuration, as well as a data station restart, each time that a new algorithm is approved or an existing algorithm is updated.

It is also possible to allow a set of algorithms at once by providing a pattern, i.e. a regular expression. This makes it e.g. possible to allow a certain directory with algorithms. The disadvantage of this approach is that if an attacker (or IT personnel with malintent) manages to get access to that directory, a malicious algorithm that would be put there, would pass the filter of allowed algorithms. Similarly, specifying single algorithms without hashes would not be fully secure if an attacker can access that address.

A third possibility is to allow algorithms from a trusted algorithm store. The benefit of this is that the algorithm store already manages the algorithms currently allowed including most up-to-date version information. When the algorithm is updated, the store will tell the node automatically to only allow the new version. The disadvantage of this approach is that if an attacker gains access to the store, the node is not protected from malicious algorithms.

The safest policy regarding allowed algorithms is to specify an exact list of all allowed algorithms, including the version (specified by the image hash), at the node. However, this also entails a significant maintenance burden if the algorithms are updated frequently. Institutes following this policy would have to log in to their data station for every algorithm change to update the allowed algorithm configuration. Although this is a quick update, it would still require a manual action every time. Also, as a manual action, it is error prone. Errors will probably prevent the algorithm from running successfully on that node. Alternatively, manual errors may lead to security concerns, but this is less likely.

If your project has a separate algorithm store and image registry, a good alternative is to define two policies at the node, that ascertain restrictions on both the algorithm store and the registry. One policy defines that only algorithms from the projects’s own algorithm store are allowed and the other policy only allows algorithms from the project’s own image registry. That way, an attacker would need to gain access to both the private registry, the algorithm store and HQ before being able to send a malicious task. The probability of a successful attack on all of these components is much lower than a successful attack on a single component.

2.4.6. Logging¶

To configure the logger, look at the logging section in the example configuration file in All configuration options.

Useful commands:

v6 node files: shows you where the log file is stored
v6 node attach: shows live logs of a running node in your current console. This can also be achieved when starting the node with v6 node start --attach