Federated learning is not a quick-fix for every situation that requires privacy preserving computation. Below you'll find a general list of steps that's required to start (& finish) a successful project (assuming horizontally-partitioned data).
Setting up a (formal or informal) collaboration between several organizations is the first step. Within a collaboration, organizations can decide which data they want to make available for analysis and which algorithms they want to allow to be run on their data. The collaboration would also decide what to do with any knowledge obtained from the performed analyses (i.e., what to do with any resulting intellectual property).
Although it's possible to mix-and-match types of data storage (e.g. site A uses a
.csv file, while site B uses a relational database), this makes algorithm development much harder since the code would have to cater for these different sources.
Current algorithms available for the vantage6 infrastructure (e.g., the Chi-squared and the Cox Proporional Hazards model) expect
.csv files as the data source since generally they are easy to work with. While this is not suitable for more complex or permanent setups, rewriting them to support a different source is not difficult.
The implementation of the vantage6 server is agnostic when it comes to data storage used by the nodes.
Typically a research question determines which data is required to answer it. Knowing what the goal of any investigation is, makes the next steps much easier.
Determining the data dictionary consists of two parts:
Specifying the list of variables to include in the dataset
Defining the exact meaning of each of the variables
For example, for a dataset that describes body measurements and vital signs, one could decide to include the variables
Defining the exact meaning of the variables involves two things (maybe even deciding to use the same standard). First, all parties within the collaboration agree to use the same units (e.g., for
length one could decide to use
cm instead of
mm). For categorical variables this would mean deciding on the value sets. Second, the collaboration would have to make sure that all variables are actually comparable. For example, make sure that all sites measured
blood pressure in the same way (e.g., lying down).
Lastly, the parties need to install the proper software, as described in the next section.