.. _algo-code_structure: Algorithm code structure ======================== .. note:: These guidelines are Python specific. Here we provide some more information on algorithm code is organized. Most of these structures are generated automatically when you create a :ref:`personalized algorithm starting point `. We detail them here so that you understand why the algorithm code is structured as it is, and so that you know how to modify it if necessary. Defining functions ------------------ The functions that will be available to the user have to be defined in the ``__init__.py`` file at the base of your algorithm module. Other than that, you have complete freedom in which functions you implement. Vantage6 algorithms commonly have an orchestator or aggregator part and a remote part. The orchestrator part is responsible for combining the partial results of the remote parts. The remote part is usually executed at each of the nodes included in the analysis. While this structure is common for vantage6 algorithms, it is not required. If you do follow this structure however, we recommend the following file structure: .. code:: bash my_algorithm/ ├── __init__.py ├── central.py └── partial.py where ``__init__.py`` contains the following: .. code:: python from .central import my_central_function from .partial import my_partial_function and where ``central.py`` and ``partial.py`` obviously contain the implementation of those functions. .. _implementing-decorators: Implementing the algorithm functions ------------------------------------ Let's say you are implementing a function called ``my_function``: .. code:: python def my_function(column_name: str): pass You have complete freedom as to what arguments you define in your function; ``column_name`` is just an example. Note that these arguments have to be provided by the user when the algorithm is called. This is explained :ref:`here ` for the Python client. Often, you will want to use the data that is available at the node. This data can be provided to your algorithm function in the following way: .. code:: python import pandas as pd from vantage6.algorithm.tools.decorators import data @data(2) def my_function(df1: pd.DataFrame, df2: pd.DataFrame, column_name: str): pass The ``@data(2)`` decorator indicates that the first two arguments of the function are dataframes that should be provided by the vantage6 infrastructure. In this case, the user would have to specify two databases when calling the algorithm. Note that depending on the type of the database used, the user may also have to specify additional parameters such as a SQL query or the name of a worksheet in an Excel file. Note that it is also possible to just specify ``@data()`` without an argument - in that case, a single dataframe is added to the arguments. For some data sources it's not trivial to construct a dataframe from the data. One of these data sources is the OHDSI OMOP CDM database. For this data source, the ``@database_connection`` is available: .. code:: python from rpy2.robjects import RS4 from vantage6.algorithm.tools.decorators import ( database_connection, OHDSIMetaData ) @database_connection(types=["OMOP"], include_metadata=True) def my_function(connection: RS4, metadata: OHDSIMetaData, ): pass This decorator provides the algorithm with a database connection that can be used to interact with the database. For instance, you can use this connection to execute functions from `python-ohdsi `_ package. The ``include_metadata`` argument indicates whether the metadata of the database should also be provided. It is possible to connect to multiple databases at once, but you can also specify a single database by using the ``types`` argument. .. code:: python from rpy2.robjects import RS4 from vantage6.algorithm.tools.decorators import database_connection @database_connection(types=["OMOP", "OMOP"], include_metadata=False) def my_function(connection1: RS4, connection2: Connection, ): pass .. note:: The ``@database_connection`` decorator is current only available for OMOP CDM databases. The connection object ``RS4`` is an R object, mapped to Python using the `rpy2 `_, package. This object can be passed directly on to the functions from `python-ohdsi `. Another useful decorator is the ``@algorithm_client`` decorator: .. code:: python import pandas as pd from vantage6.client.algorithm_client import AlgorithmClient from vantage6.algorithm.tools.decorators import algorithm_client, data @data() @algorithm_client def my_function(client: AlgorithmClient, df1: pd.DataFrame, column_name: str): pass This decorator provides the algorithm with a client that can be used to interact with the vantage6 central server. For instance, you can use this client in the central part of an algorithm to create a subtasks for each node with ``client.task.create()``. A full list of all commands that are available can be found in the :ref:`algorithm client documentation `. .. warning:: The decorators ``@data`` and ``@algorithm_client`` each have one reserved keyword: ``mock_data`` for the ``@data`` decorator and ``mock_client`` for the ``@algorithm_client`` decorator. These keywords should not be used as argument names in your algorithm functions. The reserved keywords are used by the :ref:`MockAlgorithmClient ` to mock the data and the algorithm client. This is useful for testing your algorithm locally. Algorithm wrappers ------------------ The vantage6 :ref:`wrappers ` are used to simplify the interaction between the algorithm and the node. The wrappers are responsible for reading the input data from the data source and supplying it to the algorithm. They also take care of writing the results back to the data source. As algorithm developer, you do not have to worry about the wrappers. The main point you have to make sure is that the following line is present at the end of your ``Dockerfile``: .. code:: docker CMD python -c "from vantage6.algorithm.tools.wrap import wrap_algorithm; wrap_algorithm()" The ``wrap_algorithm`` function will wrap your algorithm to ensure that the vantage6 algorithm tools are available to it. Note that the ``wrap_algorithm`` function will also read the ``PKG_NAME`` environment variable from the ``Dockerfile`` so make sure that this variable is set correctly. For R, the command is slightly different: .. code:: r CMD Rscript -e "vtg::docker.wrapper('$PKG_NAME')" Also, note that when using R, this only works for CSV files. .. _vpn-in-algo-dev: VPN --- Within vantage6, it is possible to communicate with algorithm instances running on different nodes via the :ref:`VPN network feature `. Each of the algorithm instances has their own IP address and port within the VPN network. In your algorithm code, you can use the ``AlgorithmClient`` to obtain the IP address and port of other algorithm instances. For example: .. code:: python from vantage6.client import AlgorithmClient def my_function(client: AlgorithmClient, ...): # Get the IP address and port of the algorithm instance with id 1 child_addresses = client.get_child_addresses() # returns something like: # [ # { # 'port': 1234, # 'ip': 11.22.33.44, # 'label': 'some_label', # 'organization_id': 22, # 'task_id': 333, # 'parent_id': 332, # }, ... # ] # Do something with the IP address and port The function ``get_child_addresses()`` gets the VPN addresses of all child tasks of the current task. Similarly, the function ``get_parent_address()`` is available to get the VPN address of the parent task. Finally, there is a client function ``get_addresses()`` that returns the VPN addresses of all algorithm instances that are part of the same task. VPN communication is only possible if the docker container exposes ports to the VPN network. In the algorithm boilerplate, one port is exposed by default. If you need to expose more ports (e.g. for sending different information to different parts of your algorithm), you can do so by adding lines to the Dockerfile: .. code:: bash # port 8888 is used by the algorithm for communication purposes EXPOSE 8888 LABEL p8888 = "some-label" # port 8889 is used by the algorithm for data-exchange EXPOSE 8889 LABEL p8889 = "some-other-label" The ``EXPOSE`` command exposes the port to the VPN network. The ``LABEL`` command adds a label to the port. This label returned with the clients' ``get_addresses()`` function suite. You may specify as many ports as you need. Note that you *must* specify the label with ``p`` as prefix followed by the port number. The vantage6 infrastructure relies on this naming convention. Dockerfile structure -------------------- Once the algorithm code is written, the algorithm needs to be packaged and made available for retrieval by the nodes. The algorithm is packaged in a Docker image. A Docker image is created from a Dockerfile, which acts as a blue-print. The Dockerfile is already present in the boilerplate code. Usually, the only line that you need to update is the ``PKG_NAME`` variable to the name of your algorithm package.