Isolated Data, Unified Goal: Federated Learning as the Future of AI
Federated Learning is a rapidly advancing branch of machine learning, encompassing various approaches for distributed collaboration.
Depending on how data is shared among participants and how the global model is constructed, Federated Learning can be classified into two key types:
Participants share the same feature space but have different data samples.
For instance, this approach can be used to train models based on customer data from multiple banks operating in different regions.
In this type, data owners collaborate to build a more accurate global model than their local data could achieve individually.
It assumes participants have the necessary computational resources and perform independent global model inference after training.
In this scenario, the global model is fine-tuned directly on devices that generate the data. The primary stakeholder is the provider of the original model, which is updated as it interacts with new data.
The secondary stakeholder is the device owner, who benefits from a more accurate outcome as the model adapts to their specific data.
This approach applies when participants have different features for the same set of entities.
For example, a bank and an online retailer may have information on the same users, but the bank holds financial transaction data, while the retailer has information on shopping preferences.
Features | HFL | VFL |
---|---|---|
Data Partitioning | Entity space | Feature space |
Model Training | Each client can train a local model | Training is not possible without combining local dataset parts |
Training Scenario | Cross-device, cross-silo | Cross-silo |
Data Transmitted | Model parameters | Intermediate results |
Local Information | Data | Client models, data |
Outcome for Participants | Global model | Client models |
Independent Inference | Yes | No |
Practical examples of usage are often more effective for understanding than theoretical explanations.
Let’s explore the key features of the mentioned types using a binary classification task as an example.
The experimental dataset, prepared from publicly available data on Kaggle, contains 36,457 unique records of credit card holders.
Each customer is described by 15 features and a class label: 0 indicates no overdue debt, while 1 indicates an overdue payment.
The targets are fairly balanced:
CODE_GENDER | Gender |
FLAG_OWN_CAR | Is there a car |
FLAG_OWN_REALTY | Is there a property |
CNT_CHILDREN | Number of children |
AMT_INCOME_TOTAL | Annual income |
NAME_INCOME_TYPE | Income category |
NAME_EDUCATION_TYPE | Education level |
NAME_FAMILY_STATUS | Marital status |
NAME_HOUSING_TYPE | Way of living |
CUSTOMER_AGE | Birthday |
EMPLOYED_LENGTH | Duration of employment |
FLAG_WORK_PHONE | Is there a work phone |
FLAG_PHONE | Is there a phone |
FLAG_EMAIL | Is there an email |
OCCUPATION_TYPE | Occupation |
STATUS | Credit status |
Let’s assume that, based on accumulated data, the financial organization aims to predict the creditworthiness of new clients.
To compare results, we will use the XGBClassifier from the XGBoost library in each case.
The training will be conducted on a sample of 30,000 records, while the remaining 6,457 records will form a common test set for all scenarios.
Let’s consider a scenario where several branches of a bank have accumulated the entire dataset. The first step is to consolidate the data into a single location.
Then, we can train the XGBoost classifier to categorize the data. The model trained using this method achieves an accuracy of 0.657 on the test set.
This conservative approach to training has clear advantages: transparency and relatively high accuracy. However, it also requires data to be centralized, which often presents challenges.
Train accuracy | Test accuracy |
---|---|
0.823 | 0.657 |
The data is divided among several owners, meaning that two banks have collected similar information and each decides to train locally on their respective datasets, each containing 15,000 records of bank customers.
By simulating this situation and training the classifier on a randomly selected half of the records, we obtain a trained model with accuracy metrics of 0.595 on the test set.
There is a noticeable decline in quality on the test set compared to better performance on the training set, indicating overfitting due to the reduced representativeness of the training data.
Train accuracy | Test accuracy |
---|---|
0.892 | 0.595 |
In this scenario, data is divided among several owners, specifically Bank A and Bank B, each possessing 15,000 records of customers.
Both banks are interested in developing a more accurate model for better customer classification, so they decide to collaborate on distributed training using Horizontal Federated Learning (HFL).
In this case, the metrics of the trained model are closer to those achieved with centralized learning. Additionally, each client ultimately receives a global model that they can use independently.
However, this framework poses privacy risks to the training data, especially in cases of malicious intent or curiosity from the opposing party or server. Numerous studies demonstrate the possibility of reconstructing training data by analyzing update parameters.
Furthermore, one participant could intentionally degrade the learning outcomes.
Train accuracy | Test accuracy |
---|---|
0.737 | 0.635 |
In this scenario, the data is divided among several owners. Client A, a bank, has collected information about financial and property characteristics, while Client B possesses social characteristics.
Specifically, dataset A contains 30,000 records with the fields [FLAG_OWN_CAR, FLAG_OWN_REALTY, AMT_INCOME_TOTAL, NAME_INCOME_TYPE, EMPLOYED_LENGTH, FLAG_WORK_PHONE, OCCUPATION_TYPE, STATUS], whereas Client B holds information on the fields [CODE_GENDER, CNT_CHILDREN, NAME_EDUCATION_TYPE, NAME_FAMILY_STATUS, CUSTOMER_AGE, NAME_HOUSING_TYPE, FLAG_PHONE, FLAG_EMAIL] for the same set of individuals.
Client B does not have access to the target variables, which are only available to Client A. Therefore, Client B cannot independently use their data to train models. This situation can be resolved using Vertical Federated Learning (VFL) technology.
Train accuracy | Test accuracy |
---|---|
0.802 | 0.652 |
The first challenge faced by participants is the need to establish correspondences between records from both parties without revealing the datasets themselves, which is known as the secure set intersection problem.
Similar to HFL, if we consider the potential for dishonesty from the associated party, we cannot assume that the training data is secure.
A key distinction of VFL is that the inference of the trained model must also be performed collaboratively, which creates further interdependence between the parties and increases the risk of leakage of newly incoming information.
Thus, by addressing the challenge of central learning, federated learning enables models to be trained to achieve comparable quality metrics.
However, there exists an extensive list of threats that necessitates a comprehensive approach to ensuring data privacy during federated learning.
This includes employing various cryptographic methods such as encryption, integrity checks, and authentication of updates, as well as implementing mechanisms to detect and block suspicious activities during the training process.
The implementation of effective solutions requires expertise in customizing training scenarios to achieve both practical value from the method and the security of the data.