.. _main-intro: Introduction ============ In many research projects, data is distributed across multiple organizations. This makes it difficult to perform analyses that require data from multiple sources, as the data owners don't want to share their data with others. Vantage6 is a platform that enables privacy-enhancing analyses on distributed data. It allows organizations to collaborate on analyses while only sharing aggregated results, not the raw data. As a user, you can use vantage6 to run your algorithms on sensitive data. In order to create the tasks to run your algorithms, you need to understand how vantage6 works. We will first explain the basic architecture of vantage6, followed by a description of the resources available in vantage6. Using those concepts, we will explain give an example of a simple algorithm and explain how it is run within vantage6. .. _vantage6-components-intro: Vantage6 components ------------------- In vantage6, a **client** can pose a question to the **headquarters** (HQ). The headquarters knows which organizations work together in a network. Each organization contributes one **node** to the network that may contain sensitive data. HQ alerts the nodes of new questions. Nodes then fetch **algorithms** to answer the question. When the algorithm completes, the node sends the non-sensitive, aggregated results back to HQ. .. _architecture-figure: .. uml:: !theme superhero-outline left to right direction skinparam nodesep 80 skinparam ranksep 80 rectangle client as "Client" rectangle hq as "HQ" client --> hq rectangle node1 as "Node 1" rectangle node2 as "Node 2" hq <-- node1 hq <-- node2 .. TODO reactivate this dashed node line when communication between nodes is implemented .. node1 <-r[dashed]-> node2 The roles of these vantage6 components are as follows: * **Headquarters** coordinates communication with clients and nodes. HQ tracks the status of the computation requests, stores the results, and keeps track of who is allowed to do what. * **Node(s)** have access to data and execute algorithms * **Clients** (i.e. users or applications) request computations from the nodes via the client * **Algorithms** are scripts that are run on the sensitive data. Each algorithm is packaged in a container image; the node pulls the image from a container registry and runs it on the local data. Note that the node owner can control which algorithms are allowed to run on their data. Headquarters is part of the vantage6 **hub**, which is the collection of all central components of the vantage6 infrastructure. Apart from HQ, the hub contains the following important components: - **Authentication service**: The authentication service for the vantage6 network. - **Algorithm store**: A place to share vantage6 algorithms (optional). - **User interface**: A web interface to use vantage6 (optional). In addition, the hub can also spin up more services, such as message brokers, databases, and monitoring services, if the configuration specifies so. On a technical level, vantage6 may be seen as a container orchestration tool for privacy-preserving analyses. It deploys a network of containerized applications that together ensure insights can be exchanged without sharing record-level data. .. _components: Vantage6 resources ------------------ There are several entities in vantage6, such as users, organizations, tasks, etc. These entities are created by users that have sufficient permission to do so and are stored in a database that is managed by HQ. This process ensures that the right people have the right access to the right actions, and that organizations can only collaborate with each other if they agree to do so. The following statements and the figure below should help you understand their relationships. - A **collaboration** is a collection of one or more **organizations**. - For each collaboration, each participating organization needs a **node** to compute tasks. When a collaboration is created, accounts are also created for the nodes so that they can securely communicate with HQ. - Collaborations can contain **studies**. A study is a subset of organizations from the collaboration that are involved in a specific research question. By setting up studies, it can be easier to send tasks to a subset of the organizations in a collaboration and to keep track of the results of these analyses. - Each organization has zero or more **users**. - The permissions of the user are defined by the assigned **rules**. - It is possible to collect multiple rules into a **role**, which can also be assigned to a user. - Each collaboration can contain multiple **sessions** in which data may be analysed. A session can contain multiple **dataframes**. A dataframe is a collection of data retrieved from the original source database that is stored on the node. A dataframe can be modified by additional user defined **preprocessing** steps and can be used as input for **tasks**. - Users can create **tasks** for one or more organizations within a collaboration and session. Tasks lead to the execution of the algorithms. - A task should produce an algorithm **run** for each organization involved in the task. The **results** are part of such an algorithm run. The following schema is a *simplified* version of the database. The `1-n`, `0-n` and `n-n` relationships describe one-to-many, zero-to-many and many-to-many relationships, respectively. .. uml:: !theme superhero-outline skinparam nodesep 100 skinparam ranksep 100 left to right direction skinparam linetype polyline rectangle Collaboration rectangle Node rectangle Organization rectangle Session rectangle DataFrame rectangle Study rectangle Task rectangle Result rectangle User rectangle Role rectangle Rule Collaboration "1" -- "n" Node Collaboration "n" -- "n" Organization Collaboration "1" -- "n" Study Collaboration "1" - "n" Session Collaboration "1" -- "n" Task Study "n" -left- "n" Organization Study "1" -right- "n" Task Task "n" -right- "1" Session Node "n" -right- "1" Organization Organization "1" -- "n" User Organization "0" -- "n" Role Task "1" - "n" Result Session "n" -left- "1" User Session "1" -- "n" DataFrame User "n" -left- "n" Role Role "n" -- "n" Rule User "n" -- "n" Rule A simple federated average algorithm ------------------------------------ To compute an average, you usually sum all the values and divide them by the number of values. In Python, this can be done as follows: .. code:: python x = [1,2,3,4,5] average = sum(x) / len(x) In a federated data set the values for `x` are distributed over multiple locations. Let's assume `x` is split into two parties: .. code:: python a = [1,2,3] b = [4,5] In this case we can compute the average as: .. code:: python average = (sum(a) + sum(b))/(len(a) + len(b)) The goal is to compute the average without sharing the individual numbers. In the case of an average algorithm, each node therefore shares only the sum and the number of elements in the dataset. By summing the sums and dividing by the sum of the number of elements, the average can be found. This way, the individual numbers are never shared: .. code:: python # on node 1 a = [1,2,3] return {"sum": sum(a), "count": len(a)} # on node 2 b = [4,5] return {"sum": sum(b), "count": len(b)} # computing the average of both nodes average = (node_1["sum"] + node_2["sum"]) / (node_1["count"] + node_2["count"]) How algorithms work in vantage6 ------------------------------- The average algorithm explained above can be separated in a central part and a federated part. The federated part uses the data to compute the sum and the number of elements. The central part is the aggregation of these results. In order to do so, it is also responsible to start the federated parts and to collecting their results. Note that for more complex algorithms, this can be an iterative process: the central part can send new tasks to the federated parts based on the results of the previous round of federated tasks. .. figure:: /images/algorithm_central_and_subtasks.png :alt: Algorithm hierarchy :align: center Common task hierarchy in vantage6. The user (left) creates a task for the central part of the algorithm (pink hexagon). The central part creates subtasks for the federated parts (green hexagons). When the subtasks are finished, the central part collects the results and computes the final result, which is then available to the user. Now, let's see how this works in vantage6. It is easy to confuse HQ with the central part of the algorithm: HQ is the central part of the infrastructure but not the place where the central part of the algorithm is executed (:numref:`algorithm-flow`). The central part is actually executed at one of the nodes, because it gives more flexibility: for instance, an algorithm may need heavy compute resources to do the aggregation, and it is better to do this at a node that has these resources rather than having to upgrade HQ's resources whenever a new algorithm needs more resources. .. figure:: /images/task_journey.png :name: algorithm-flow :alt: algorithm-flow :align: center The flow of the average algorithm in vantage6. The user creates a task for the central part of the algorithm. This is registered at HQ, and leads to the creation of a central algorithm container on one of the nodes. The central algorithm then creates subtasks for the federated parts of the algorithm, which again are registered at HQ. All nodes for which the subtask is intended start their work by executing the federated part of the algorithm. The nodes send the results back to HQ, from where they are picked up by the central algorithm. The central algorithm then computes the final result and sends it to HQ, where the user can retrieve it. .. note:: It is also possible for the user to create the subtasks directly, and to compute the central part of the algorithm themselves. However, this is not the most common approach as it is generally easier to let the central algorithm do the work. How to run algorithms in vantage6 --------------------------------- Once you have set up a vantage6 hub and nodes, you are ready to run your algorithms. You can create tasks from the :ref:`web interface `, the :ref:`Python client ` or by interacting with the :ref:`API ` directly. There are a number of public algorithms available from the :ref:`community algorithm store `. :ref:`Linking this store ` to your HQ will allow you to quickly get a set of algorithms that you can run on your nodes. You can also develop your own vantage6 algorithms. The only requirement is that you package the algorithm in a container image that vantage6 can run. The focus of vantage6 is on setting up an infrastructure to run algorithms on sensitive data and ensuring that the data is kept private - the algorithm implementation is kept highly flexible. The freedom in defining the code also allows you to use federated learning libraries such as `PySyft `_, `TensorFlow `_ or `Flower `_ within your vantage6 algorithm. Also, it is not only possible to run federated algorithms, but also MPC algorithms or other protocols. .. note:: Vantage6 tries to limit the definition of algorithms as little as possible. This means that within a project, it should be established which algorithms are allowed to run on the nodes. Review of this code - or trust in persons that have created the algorithm - is the responsibility of each node owner. They are ultimately in control over which algorithms are run on their data. Vantage6 is designed to be as flexible as possible, so you can use any programming language and any libraries you like. Python is the most common language to use within the vantage6 community, and also has the most :ref:`tools ` available to help you with algorithm development.