Vertical vs. Horizontal Federated Learning

Federated learning has emerged as a groundbreaking approach to train machine learning models on decentralized data residing on edge devices or servers. This approach ensures data privacy as the raw data never leaves its source. Instead, models are trained locally and only model updates are aggregated. Within this paradigm, two main branches exist: Horizontal Federated Learning (HFL) and Vertical Federated Learning (VFL). Understanding their differences is crucial for selecting the appropriate strategy for a given scenario.

Understanding the Fundamentals:

Horizontal Federated Learning (HFL): Also known as sample-based federated learning, HFL is suitable when different data sources share the same feature space but differ in the sample space. In simpler terms, the datasets across different participants contain similar attributes but describe different individuals or entities. Think of different hospitals collecting similar patient data (e.g., age, blood pressure, medical history) but focusing on different patient populations.

Vertical Federated Learning (VFL): Also referred to as feature-based federated learning, VFL is applicable when different data sources share the same sample space but differ in the feature space. This means that the datasets across participants contain information about the same entities, but each participant holds different attributes. For example, a bank, an e-commerce company, and a social media platform might all have information about the same set of users, but the bank has financial data, the e-commerce company has purchase history, and the social media platform has demographic and interest data.

Key Differences Summarized:

Feature	Horizontal Federated Learning (HFL)	Vertical Federated Learning (VFL)
Data Similarity	Shared Feature Space, Different Samples	Shared Sample Space, Different Features
Data Partitioning	Horizontally partitioned	Vertically partitioned
Focus	Expanding the training dataset with more samples	Expanding the training dataset with more features
Privacy Concerns	Primarily protects sample privacy	Protects both sample and feature privacy
Communication Complexity	Relatively lower	Relatively higher

Deep Dive into Horizontal Federated Learning (HFL):

Process:

Initialization: A central server initializes a global model.
Distribution: The server distributes the model to participating clients.
Local Training: Each client trains the model locally using its own dataset.
Update Aggregation: Clients send model updates (e.g., gradients, weights) back to the server.
Global Model Update: The server aggregates the updates, often using techniques like federated averaging, to create a new, improved global model.
Iteration: Steps 2-5 are repeated iteratively until the model converges or a pre-defined stopping criterion is met.

Advantages:

Scalability: Easily scales to a large number of clients with distributed datasets.
Privacy: Raw data remains on clients' devices, protecting sample privacy.
Reduced Communication Costs: Clients typically send only model updates, which are much smaller than the raw datasets.

Disadvantages:

Data Heterogeneity: Differences in data distribution across clients (non-IID data) can negatively impact model performance.
Client Selection Bias: If the participation of clients is biased, the global model might not generalize well to the entire population.
Vulnerability to Model Poisoning: Malicious clients can inject corrupted updates, potentially compromising the global model.

Example: Healthcare - Predicting Patient Readmission

Imagine several hospitals across different regions collaborating to predict patient readmission rates. Each hospital has patient data containing features like age, medical history, diagnoses, and treatment plans. They all share a similar feature space but have different patient populations.

HFL Application: Each hospital trains a model (e.g., logistic regression, neural network) locally using its own patient data. They then send the model updates to a central server. The server aggregates these updates to create a global model that benefits from the collective knowledge of all hospitals, improving prediction accuracy while keeping patient data private within each hospital's premises.

Deep Dive into Vertical Federated Learning (VFL):

Process:

VFL is more complex than HFL, often involving a secure multi-party computation (MPC) or homomorphic encryption (HE) to protect feature privacy. A simplified overview of the process is:

Entity Alignment: Participants identify the overlapping samples (entities) between their datasets. This might involve private set intersection (PSI) techniques.
Encryption (Optional): Data may be encrypted using HE or other secure techniques to further enhance privacy.
Collaborative Training:
- One participant (often the one with the label data - called the 'active party') acts as the driver. It initiates the training process.
- Other participants (holding different features) act as the 'passive parties'. They provide their features during training.
- Training proceeds in iterations, with the active party computing gradients and sending intermediate results to the passive parties.
- Passive parties calculate their contributions to the gradients based on their features and send them back to the active party.
- MPC or HE techniques are used to ensure that the data of each party remains private during gradient computation.
Model Update: The active party updates the model based on the aggregated gradients.
Iteration: Steps 3-4 are repeated iteratively until the model converges.

Advantages:

Feature Privacy: Protects the features held by each participant. No participant sees the raw features of others.
Data Integration: Enables collaboration between organizations with complementary datasets without sharing the underlying data.

Disadvantages:

Complexity: VFL is technically more complex than HFL, requiring sophisticated secure computation techniques.
Communication Overhead: Requires more rounds of communication between participants, especially with MPC or HE.
Entity Alignment Challenges: Accurately identifying overlapping samples across different datasets can be challenging and potentially introduce bias.
Trust Assumptions: Relies on assumptions about the trustworthiness of the participants and the security of the underlying secure computation protocols.

Example: Finance - Credit Risk Assessment

A bank (Party A) has customers' financial transaction history and credit scores. An e-commerce platform (Party B) has the same customers' purchase history and browsing behavior. They both want to collaboratively build a more accurate credit risk assessment model.

VFL Application:

The bank, holding the credit scores (labels), acts as the active party.
The e-commerce platform, holding purchase history and browsing behavior, acts as the passive party.
They first perform entity alignment to identify the overlapping customers.
The training process involves secure computation techniques (e.g., MPC) to ensure that the bank doesn't see the customers' purchase history, and the e-commerce platform doesn't see the customers' credit scores.
The bank trains a model that incorporates the features from both the bank and the e-commerce platform, resulting in a more accurate and robust credit risk assessment model.

Choosing between HFL and VFL:

The choice between HFL and VFL depends heavily on the data characteristics and the specific problem you are trying to solve:

HFL: Choose HFL when your data sources share the same features but differ in the individuals/entities they describe. Think of situations where you want to leverage data from multiple sources that collect similar information about different populations (e.g., healthcare, retail).
VFL: Choose VFL when your data sources have information about the same individuals/entities but hold different types of data (different features). This is suitable when organizations want to collaborate and leverage their complementary data to build a richer model (e.g., finance, advertising).

Beyond HFL and VFL: Hybrid Approaches:

It's also possible to combine HFL and VFL in a hybrid approach to address more complex scenarios. For example, different hospitals (HFL) might collaborate on predicting patient readmission, and then a bank and an e-commerce company (VFL) could collaborate to improve credit risk assessment using the knowledge learned from the hospitals' readmission models.

Horizontal and Vertical Federated Learning offer powerful tools for training machine learning models on decentralized data while preserving data privacy. HFL excels when data is partitioned horizontally (similar features, different samples), while VFL shines when data is partitioned vertically (different features, same samples). Understanding their strengths, weaknesses, and practical considerations is crucial for selecting the appropriate approach for your specific federated learning application. The choice hinges on data characteristics, privacy requirements, computational constraints, and the goals of collaboration between participating organizations. As the field evolves, hybrid approaches that combine HFL and VFL may become increasingly important for tackling complex real-world challenges.

Alphanome.AI