Get Demo

Horizontal & Vertical
Federated Learning

PLAY PAUSE
0:00
/
PLAY PAUSE

Isolated Data, Unified Goal: Federated Learning as the Future of AI

Federated Learning is a rapidly advancing branch of machine learning, encompassing various approaches for distributed collaboration.

Depending on how data is shared among participants and how the global model is constructed, Federated Learning can be classified into two key types:

Horizontal
Federated Learning (HFL)

Participants share the same feature space but have different data samples.

For instance, this approach can be used to train models based on customer data from multiple banks operating in different regions.

Party A
Party B
sample space
feature space
label

It can take the following forms:

Cross-silo Federated Learning

In this type, data owners collaborate to build a more accurate global model than their local data could achieve individually.

It assumes participants have the necessary computational resources and perform independent global model inference after training.

Cross-device Federated Learning

In this scenario, the global model is fine-tuned directly on devices that generate the data. The primary stakeholder is the provider of the original model, which is updated as it interacts with new data.

The secondary stakeholder is the device owner, who benefits from a more accurate outcome as the model adapts to their specific data.

Vertical
Federated Learning (VFL)

This approach applies when participants have different features for the same set of entities.

For example, a bank and an online retailer may have information on the same users, but the bank holds financial transaction data, while the retailer has information on shopping preferences.

Party B
Party A
sample space
feature space
label

Key Comparative
Features of HFL and VFL

FeaturesHFLVFL
Data PartitioningEntity spaceFeature space
Model TrainingEach client can train a local modelTraining is not possible without combining local dataset parts
Training ScenarioCross-device, cross-siloCross-silo
Data TransmittedModel parametersIntermediate results
Local InformationDataClient models, data
Outcome for ParticipantsGlobal modelClient models
Independent InferenceYesNo

Practical examples of usage are often more effective for understanding than theoretical explanations.

Let’s explore the key features of the mentioned types using a binary classification task as an example.

The experimental dataset, prepared from publicly available data on Kaggle, contains 36,457 unique records of credit card holders.

Each customer is described by 15 features and a class label: 0 indicates no overdue debt, while 1 indicates an overdue payment.

The targets are fairly balanced:

01 52%
00 48%

The feature description of credit card holders consists of the following fields:

CODE_GENDERGender
FLAG_OWN_CARIs there a car
FLAG_OWN_REALTYIs there a property
CNT_CHILDRENNumber of children
AMT_INCOME_TOTALAnnual income
NAME_INCOME_TYPEIncome category
NAME_EDUCATION_TYPEEducation level
NAME_FAMILY_STATUSMarital status
NAME_HOUSING_TYPEWay of living
CUSTOMER_AGEBirthday
EMPLOYED_LENGTHDuration of employment
FLAG_WORK_PHONEIs there a work phone
FLAG_PHONEIs there a phone
FLAG_EMAILIs there an email
OCCUPATION_TYPEOccupation
STATUSCredit status

Let’s assume that, based on accumulated data, the financial organization aims to predict the creditworthiness of new clients.

To compare results, we will use the XGBClassifier from the XGBoost library in each case.

The training will be conducted on a sample of 30,000 records, while the remaining 6,457 records will form a common test set for all scenarios.

Case 1:
Baseline―Centralized Learning

Let’s consider a scenario where several branches of a bank have accumulated the entire dataset. The first step is to consolidate the data into a single location.

Then, we can train the XGBoost classifier to categorize the data. The model trained using this method achieves an accuracy of 0.657 on the test set.

This conservative approach to training has clear advantages: transparency and relatively high accuracy. However, it also requires data to be centralized, which often presents challenges.

plus
XGBoost
result
Train accuracy Test accuracy
0.823 0.657

Case 2:
Training on Local Datasets Only

The data is divided among several owners, meaning that two banks have collected similar information and each decides to train locally on their respective datasets, each containing 15,000 records of bank customers.

By simulating this situation and training the classifier on a randomly selected half of the records, we obtain a trained model with accuracy metrics of 0.595 on the test set.

There is a noticeable decline in quality on the test set compared to better performance on the training set, indicating overfitting due to the reduced representativeness of the training data.

plus
XGBoost
result
Train accuracy Test accuracy
0.892 0.595

Case 3:
Horizontal Federated Learning

In this scenario, data is divided among several owners, specifically Bank A and Bank B, each possessing 15,000 records of customers.

Both banks are interested in developing a more accurate model for better customer classification, so they decide to collaborate on distributed training using Horizontal Federated Learning (HFL).

In this case, the metrics of the trained model are closer to those achieved with centralized learning.  Additionally, each client ultimately receives a global model that they can use independently.

However, this framework poses privacy risks to the training data, especially in cases of malicious intent or curiosity from the opposing party or server. Numerous studies demonstrate the possibility of reconstructing training data by analyzing update parameters.

Furthermore, one participant could intentionally degrade the learning outcomes.

Global Model Bank A Bank B
Train accuracy Test accuracy
0.737 0.635

Case4:
Vertical Federated Learning

In this scenario, the data is divided among several owners. Client A, a bank, has collected information about financial and property characteristics, while Client B possesses social characteristics.

Specifically, dataset A contains 30,000 records with the fields [FLAG_OWN_CAR, FLAG_OWN_REALTY, AMT_INCOME_TOTAL, NAME_INCOME_TYPE, EMPLOYED_LENGTH, FLAG_WORK_PHONE, OCCUPATION_TYPE, STATUS], whereas Client B holds information on the fields [CODE_GENDER, CNT_CHILDREN, NAME_EDUCATION_TYPE, NAME_FAMILY_STATUS, CUSTOMER_AGE, NAME_HOUSING_TYPE, FLAG_PHONE, FLAG_EMAIL] for the same set of individuals.

Client B does not have access to the target variables, which are only available to Client A. Therefore, Client B cannot independently use their data to train models. This situation can be resolved using Vertical Federated Learning (VFL) technology.

Bank A Bank B
Train accuracy Test accuracy
0.802 0.652

Addressing Privacy Risks
in Federated Learning

The first challenge faced by participants is the need to establish correspondences between records from both parties without revealing the datasets themselves, which is known as the secure set intersection problem.

Similar to HFL, if we consider the potential for dishonesty from the associated party, we cannot assume that the training data is secure.

A key distinction of VFL is that the inference of the trained model must also be performed collaboratively, which creates further interdependence between the parties and increases the risk of leakage of newly incoming information.

Thus, by addressing the challenge of central learning, federated learning enables models to be trained to achieve comparable quality metrics.

However, there exists an extensive list of threats that necessitates a comprehensive approach to ensuring data privacy during federated learning.

This includes employing various cryptographic methods such as encryption, integrity checks, and authentication of updates, as well as implementing mechanisms to detect and block suspicious activities during the training process.

The implementation of effective solutions requires expertise in customizing training scenarios to achieve both practical value from the method and the security of the data.

Share your feedback
on Discord community

Are you interested in FL (Federated Learning) in a secure implementation?
What challenges have you faced in implementing Federated Learning solutions?
How can we improve our discussions on Federated Learning?

We welcome your suggestions and feedback in our Discord community.

Community