Blog

Medical Confidentiality and Data Privacy in Machine Learning

PLAY PAUSE

0:00

PLAY PAUSE

Date

11 July 2024

Viewed

3,855

Company news

Medical Confidentiality and Data Privacy in Machine Learning

Q: What encryption techniques protect medical data during ML training?

Five techniques are commonly used. Fully Homomorphic Encryption (FHE) allows computation directly on encrypted data — slow but mathematically airtight. Federated Learning (FL) moves the computation to the data so raw records never leave the source. Functional Encryption allows specific functions to be computed on ciphertext while keeping the data hidden. Secure Multi-Party Computation (MPC) splits data across parties so no single party sees the whole. Functional Transformation (Guardora's Veils) is a lightweight non-cryptographic alternative for specific tabular use cases. Real systems combine several.

Machine learning in healthcare must protect patient health information (PHI) at every stage — transfer, storage, training, fine-tuning, inference, and the trained model itself.

Standard anonymization works when training data cannot be traced back to individuals, but many clinical ML scenarios require both privacy AND access to sensitive raw data.

Guardora combines five techniques: Functional Transformation (Veils), Fully Homomorphic Encryption, Federated Learning, Functional Encryption, and Secure Multi-Party Computation, to train models on confidential medical data without exposing it.

Healthcare is one of the fields where machine learning and confidential computing are in high demand. Medical organizations, as custodians of protected health information, bear legal and ethical responsibilities. They must prevent unauthorized access to data, which could lead to breaches of confidentiality.

This responsibility, along with the risk of financial and reputational consequences, has created an environment where data custodians are highly reluctant to share patient information or allow access to it.

What We Don’t Do at Guardora?

In MedTech, it is well-known that AI and data privacy enhancement techniques are often applied separately.

Indeed, in many cases, personally identifiable information (PII) is simply anonymized and not used for training ML models. The data used for training ML models generally cannot be traced back to individual patients. In such instances, it's sufficient to clean the textual information containing personal data like names, cities, and test collection sites. There are market solutions available for this kind of anonymization.

What We Do at Guardora?

At Guardora, we specialize in combining these two approaches, where sensitive data must first be protected and then used for training ML models.

There are cases where such protection is required at every stage of data handling:

Transfer
Storage
ML model training
Quality validation of the trained models
Model fine-tuning
Returning results (inference)
Protecting the models themselves as intellectual property

What Techniques Do We Use?

Healthcare is perhaps the most diverse field in terms of the types of data used. Unstructured medical records, anamnesis notes, diagnoses, dates, numerical, categorical, and binary features, texts, tables, numerical and time series, DICOM format, images, videos, and audio, as well as various ML architectures—all require a combination of various methods and privacy-enhancing computing protocols.

Guardora's solutions are based on employing a wide array of approaches:

Functional Transformation (Veils)
Fully Homomorphic Encryption
Federated Learning
Functional Encryption
Secure Multi-Party Computation

Case Studies

Here are several case studies from the healthcare sector that require maintaining the confidentiality of sensitive data while training ML models. These are some of the scenarios we have encountered at Guardora:

Developing, enhancing, and validating clinical ML algorithms on datasets owned by different entities.
Securely utilizing data in cost-effective and universal Cloud solutions, as opposed to lengthy, complex, and expensive On-premise implementations.
Accessing high-quality, diverse datasets representing global patient populations to ensure that algorithms provide equally accurate results regardless of the data collection equipment, patient demographics, clinical settings, or other social factors. To meet this standard, algorithm developers must have access to data that is representative of the scenarios they will encounter during deployment in various clinical environments.
Protecting intellectual property and ML algorithms of potential competitors during drug discovery research. New technologies, such as CRISPR, are revolutionizing gene editing research for diseases like diabetes and cancer. However, these innovations bring about new security challenges, necessitating full encryption of data even during processing.
Human genome data is increasingly being protected as personal information worldwide, making confidential computing with such data potentially a mandatory legal requirement.
Fetal biometry. Predictive analysis of fetal ultrasound images and videos.
Analysis, diagnosis, and predictive models for radiology, MRI, and fMRI, including the detection of semiotic signs.
Text extraction and classification.
Predictive and diagnostic models that directly interpret the extracted data.
Predictive analytics.
Clinical decision support systems.
Management decision support systems.
Systems for extracting data from unstructured medical records.
Systems for creating digital patient profiles.
Real-world clinical practice research.
Sale of protected datasets or data enrichment (when Company “A” uploads data and Company “B” receives an enriched segment, or reverse enrichment when Company “C” adds to the benefit of Companies “A” and “B”).
Linking electronic health records with geolocation as a significant predictor of cancer development, since carcinogens can be geographically specific. Harmful emissions from certain enterprises can increase cancer rates among the local population.
Telemedicine and remote patient monitoring. Intelligent patient safety monitoring and quality of care assessment using computer vision algorithms.
Classification and counting of cells in digitized peripheral blood and bone marrow smears.
Detection of diabetic retinopathy symptoms in fundus images.
Dental health analysis and monitoring progression.
Computer vision tasks: segmentation, regression, reconstruction, depending on the type of pathology.

Challenges

Explore the list of current challenges the market needs to address to create high-quality products:

Developing an ML solution that integrates data from multiple owners while ensuring the security of each owner's data.
Ensuring data security during ML model training outside the trusted environment of the data owner, such as in the Сloud.
Ensuring data security during the use (inference) of an ML model deployed outside the trusted environment of the data owner.
Protecting ML models hosted on public resources from unauthorized use and parameter theft.

Not all market participants can guarantee security at the network and physical levels. Therefore, at Guardora, we offer solutions at the algorithmic and protocol levels.

If this topic interests you as a data owner or developer, join our community on Discord and participate in the discussion of these pressing issues.

Frequently Asked Questions

How can machine learning be used on patient data without violating HIPAA?

HIPAA (Health Insurance Portability and Accountability Act) restricts disclosure of protected health information (PHI) but does not prohibit ML on healthcare data.

Compliant approaches include: training on properly de-identified datasets (Safe Harbor or Expert Determination methods); using HIPAA-compliant cloud infrastructure with Business Associate Agreements; and applying privacy-preserving ML techniques — federated learning so PHI never leaves the covered entity's perimeter, homomorphic encryption so PHI is processed only in encrypted form, and secure multi-party computation so no single party sees combined PHI.

What is privacy-preserving machine learning (PPML) in healthcare?

Privacy-preserving machine learning (PPML) in healthcare is a set of cryptographic and architectural techniques that let teams train and deploy ML models on patient data without exposing the raw records to ML engineers, cloud providers, or partner organizations.

The core PPML techniques are federated learning, homomorphic encryption, functional encryption, secure multi-party computation, and functional transformation (such as Guardora's Veils). Most production healthcare ML systems combine two or more of these techniques.

How do hospitals share medical data for ML without privacy risks?

Hospitals can collaborate on ML without sharing raw records by using federated learning (each hospital trains a local model, only model updates leave the perimeter) or secure multi-party computation (data is split into encrypted shares across parties and joint computation reveals only the result).

Vertical federated learning works when hospitals hold different feature types about the same patients — for example, a primary care provider with diagnostic history and a wearables vendor with vitals. Cryptographic protocols protect both raw data and inference-time queries.

What encryption techniques protect medical data during ML training?

Five techniques are commonly used.

Fully Homomorphic Encryption (FHE) allows computation directly on encrypted data — slow but mathematically airtight.

Federated Learning (FL) moves the computation to the data so raw records never leave the source.

Functional Encryption allows specific functions to be computed on ciphertext while keeping the data hidden.

Secure Multi-Party Computation (MPC) splits data across parties so no single party sees the whole.

Functional Transformation (Guardora's Veils) is a lightweight non-cryptographic alternative for specific tabular use cases. Real systems combine several.

Can federated learning be used for DICOM and medical imaging?

Yes. DICOM (Digital Imaging and Communications in Medicine) files contain both pixel data and embedded PHI in headers. Federated learning for DICOM-based tasks (radiology, MRI, fMRI, CT, fundus imaging, ultrasound including fetal biometry) is one of the most active areas of healthcare ML, because medical imaging models require diverse training data across institutions, but image transfer triggers privacy and copyright concerns. Each hospital trains a local CNN or transformer on its own DICOM archive; only model weights are aggregated centrally.

What healthcare ML use cases require confidential computing?

Confidential computing is required when training data cannot be anonymized without losing utility, or when inference involves PHI. Examples include: cross-institution clinical algorithm development; cloud-based ML on protected datasets; access to global patient populations for demographic fairness; CRISPR and drug-discovery research (intellectual property + sensitive data); genome data analysis (increasingly classified as personal data under GDPR/152-FZ); fetal ultrasound and biometric analysis; radiology and MRI interpretation; electronic health record extraction; cancer prediction with geolocation; telemedicine; diabetic retinopathy detection; dental progression monitoring.

How is genome data protected when used for ML?

Genome data is uniquely identifying — even partial DNA sequences can re-identify an individual. Modern privacy frameworks (GDPR, Russia's 152-FZ, evolving US state laws) increasingly classify genome data as personal data requiring strict protection. ML on genome data therefore typically uses: federated learning across research institutions (each consortium member keeps its sequences locally); homomorphic encryption for cloud-based variant analysis; differential privacy when releasing aggregate statistics. Guardora supports cryptographic and protocol-level protections suitable for genome research workflows.

How does Guardora protect patient data in ML projects?

Guardora provides infrastructure at the algorithmic and protocol levels — independent of (and complementary to) network and physical security. The platform combines five techniques: Functional Transformation (Veils), Fully Homomorphic Encryption, Federated Learning, Functional Encryption, and Secure Multi-Party Computation. Healthcare scenarios we've worked on include cross-institution clinical algorithm validation, cloud-based ML on protected datasets, genome research, fetal biometry, radiology AI, EHR extraction, telemedicine analytics, and protected dataset commerce. Solutions can be deployed on-premise or in compliant cloud environments.

Medical Confidentiality and Data Privacy in Machine Learning

Frequently Asked Questions

Latest Articles

Confidential Computing Won the Round. But the Market May Have Overpriced the Cost of Trust…

Hardware-Based Privacy Is Outpacing Regulation: What It Means for PPML Adoption in 2026

Federated Fine-Tuning Tools in 2026: Guardora FFT vs. Flower vs. NVIDIA FLARE