Running analyses

Here's what's to it ...

A little background of what's happening under the hood

As explained in previous sections, running an algorithm consists of the following steps:

  • Send a task to the nodes in your collaboration, consisting of:

    • The docker image you want the nodes to run (the docker image contains the algorithm you're interested in)

    • Any input you might want to provide to the docker image

  • Each node will execute the docker image.

    • The code/algorithm in the docker image will have access to:

      • The node's data

      • The input that was provided in the task

    • The node returns the result returned by the algorithm

Of course, there are few people that would actually like to run a regression (or another statistical analysis) this way. Therefore, the algorithms that have been written for the infrastructure have been packaged in such a way that using them is manageable.

So how, then?

As a researcher you would normally fire up your (least) favorite programming language and do something like this if you'd want to check which columns are in your data:

# This assumes the package devtools is installed:
devtools::install_github('mellesies/vtg.basic', subdir='src')
# Load the SEER dataset located in the vtg.basic package
data('SEER', package='vtg.basic')
# Print all the column names in the dataset
print( colnames(SEER) )
# Expected output:
# [1] "Age" "Race2" "Race3" "Mar2" "Mar3" "Mar4" "Mar5"
# [8] "Mar9" "Hist8520" "hist8522" "hist8480" "hist8501" "hist8201" "hist8211"
#[15] "grade" "ts" "nne" "npn" "er2" "er4" "Time"
#[22] "Censor" #

In a federated situation, you won't have direct access to the data. Instead, you'd have to instruct the nodes to run a docker image that returns the list of column names. Since we'd also have to communicate with the infrastructure, we need two things:

  • A Docker image (with software that returns the column names)

  • A client to facilitate communication

Fortunately, a docker image that returns is already available and using it is not too difficult:

# This assumes the package 'devtools' is installed and will automatically
# install the package 'vtg'.
devtools::install_github('mellesies/vtg.basic', subdir='src')
# Function to create a client
setup.client <- function() {
# Username/password should be provided by the administrator of
# the server.
username <- ""
password <- "password"
host <- ''
api_path <- ''
# Create the client & authenticate
client <- vtg::Client$new(host, api_path=api_path)
client$authenticate(username, password)
# Create a client
client <- setup.client()
# Get a list of available collaborations
print( client$getCollaborations() )
# Should output something like this:
# id name
# Instruct the client to use collaboration "PIPELINE".
# Since vtg.basic exports function names that collide with built-in functions,
# it's probably better to not attach the package, but call functions with a prefix instead.
# Should output something like this:
# [[1]]
# [1] "Age" "Race2" "Race3" "Mar2" "Mar3" "Mar4" "Mar5"
# [8] "Mar9" "Hist8520" "hist8522" "hist8480" "hist8501" "hist8201" "hist8211"
# [15] "grade" "ts" "nne" "npn" "er2" "er4" "Time"
# [22] "Censor"
# [[2]]
# [1] "Age" "Race2" "Race3" "Mar2" "Mar3" "Mar4" "Mar5"
# [8] "Mar9" "Hist8520" "hist8522" "hist8480" "hist8501" "hist8201" "hist8211"
# [15] "grade" "ts" "nne" "npn" "er2" "er4" "Time"
# [22] "Censor"
import vantage6.client as vtgclient
# Setup local host
HOST = 'http://vserver'
PORT = 5000
IMAGE = ''
USERNAME = 'yourusername'
PASSWORD = 'yourpassword'
# ID of a collaboration that includes your organization
# IDs of the organizations where you wish to run
# the task. They need to be part of the
# collaboration.
client = vtgclient.Client(HOST, PORT)
client.authenticate(USERNAME, PASSWORD)
# If you have configured end-to-end encryption between nodes you need to
# uncomment the following line and fill in the path to your rsa private key
# client.setup_encryption('path/to/rsa_private_key')
# Retrieve the colummn names of the datasets at all datastations
task = client.post_task('mytask', image='',
input_={'method': 'column_names'})