FAQ | Guardora

What problem does your product solve?

The widespread adoption of machine learning (ML) algorithms is hindered by the reluctance or inability of data owners to share their data with ML developers due to potential threats such as data leaks, theft, and misuse. Data privacy in AI will be a key issue in the coming decade.

Who can benefit from your product?

Data owners and ML teams training models on sensitive data, as well as cloud service providers and participants in multi-party computations.

What do you mean by sensitive data?

Confidential, personal, and other information (intellectual property, know-how, secret field data) used in training machine learning models, including the model itself as a result of intellectual labor.

What data privacy enhancement approaches do you use?

We employ a combination of three secure computation techniques and two privacy-preserving protocols to safeguard sensitive data throughout the machine learning process.

Secure Computation Techniques:
• Fully Homomorphic Encryption
• Functional Transformation
• Functional Encryption

Privacy-Preserving Protocols:
• Federated Learning
• Secure Multi-Party Computation

At what stages of the data pipeline does your product provide protection?

The level of protection varies depending on the chosen approach. The minimum protection covers data transmission, model training, and model quality evaluation. The maximum protection encompasses data transmission, storage, model training, model quality evaluation, and result retrieval.

What types of data does your product work with?

Our product supports a wide range of data formats, including text, numbers, tables, time series, images, audio, video, and geolocation data.

Which ML architectures does your product work with?

Our product seamlessly integrates with various ML architectures, including logistic regression, decision trees, neural networks, and generalized linear models.

How long would it take to break your protection using conventional means?

The computational effort required to crack our security measures far exceeds the estimated lifespan of the universe, rendering them virtually impenetrable.

Why don't you publish test results based on open-source datasets from the web?

We offer a range of privacy-enhancing approaches with varying trade-offs between performance and security. Some methods preserve model performance and training time but return the model results in the open, while others may impact training and inference time but maintain result confidentiality.

The choice of approach depends on the client's specific requirements. Testing all solutions on open datasets and standard network architectures provides limited insights, as the performance will inevitably vary based on the client's model architecture, desired security level, input data characteristics, and other factors.

Do you provide data preparation, collection, and labeling services?

Data preparation (collection and labeling) is the responsibility of the data owner seeking to implement an ML solution. Our focus lies in ensuring the security of sensitive data during transmission to external computing resources or third parties.

What is the impact of dataset size?

Dataset size plays a significant but not decisive role.

How do you stand out from the competition?

When evaluating competitors, it's crucial to compare specific companies and their offerings. To make an informed decision, consider the following questions:

a) Practical Value vs. Theoretical Research:
Do the company's solutions address real-world problems or are they primarily focused on theoretical research?

b) Media Content Processing Efficiency:
Can the company's products handle media content without compromising processing time or increasing data volume?

c) Adaptability and Applicability:
How easily can the company's solutions be adapted for practical applications and real-world neural network modeling? Are they compatible with a wide range of neural network architectures?

d) Open-Source Licensing:
If the company claims to offer open-source solutions, does the licensing truly allow for commercial use?

e) Diversity of Methods and Protocols:
Does the company employ a variety of methods and protocols, or does it rely on a single approach for all cases? Do they adhere to principles of flexibility, effective combination, and avoiding redundancy?

f) Suitability of Data Protection:
Is the company's data protection approach tailored for machine learning? Does it involve actual encryption or merely data substitution within documents? Is it simply additional traffic encryption, with data decrypted before machine learning? Or are encrypted data completely excluded from model training?

Comprehensive comparative analysis can only be conducted after addressing all these questions.

How do you differ from simple hashing (anonymization, depersonalization, ID assignment)?

Hashing is a technique for anonymizing personal data in text information. However, hashed data cannot be directly used in machine learning and serves only as a linking entity. Our product, in contrast, protects sensitive data that is actively utilized in machine learning processes.

Classic hashing involves applying a cryptographic hash function to arbitrary data, generating a fixed-length hash value. This transformation is irreversible, akin to passing data through a meat grinder. Currently, there are no known approaches to leverage hashing in machine learning. It does not have the properties we need, unlike our proposed defense methods.

In contrast to hashed data's inapplicability for ML model training, encryption using Fully Homomorphic Encryption (FHE) enables operations on fully protected data throughout all stages: transmission, training, storage, quality evaluation, and result retrieval.

Do your methods require additional protection in the form of Trusted Execution Environments (TEEs)?

Our data protection methods are designed to function effectively even without relying on Trusted Execution Environments (TEEs).

Why do I need additional encryption between the customer and me as a data processor?

Additional encryption between the customer and the data processor is crucial in scenarios where:

• The customer is unwilling or unable to share data openly, especially when dealing with sensitive personal information such as facial images, passport scans, or medical records.
• The data processor, as the developer, has concerns about transmitting data to external computing environments (cloud services) due to potential data leakage risks.

Why do we need you if major cloud providers are already prohibited from storing datasets?

Data breaches can occur during processing, not just during storage. Relying solely on storage restrictions is insufficient for data protection.

What operations are supported by FHE―linear layers, activation functions, neural networks, or something else?

Fully Homomorphic Encryption (FHE) is applicable to neural networks and other machine learning algorithms that can be represented as a composition of polynomial functions. For instance, neural networks composed of convolutional layers, fully connected layers, average pooling layers, and polynomial activation functions (or activation functions that can be approximated by polynomials) can be used in conjunction with FHE. Logistic regression and Fisher's linear discriminant analysis are examples of other machine learning algorithms that can be applied in combination with FHE. We also see potential in the synergy between FHE and the recently proposed neural network architecture called KAN, where splines―piecewise polynomial functions defined by different polynomials over different intervals―form the basis of the network.

How does the FL protocol work?

Federated Learning (FL) enables collaborative training of a neural network (NN) by multiple data owners. Each data owner trains the shared NN architecture on their own trusted resources using their private data, and they interactively exchange auxiliary information with a publicly accessible resource to construct a unified NN model.

How does the SMPC protocol work?

Secure Multi-Party Computation (SMPC) is a specialized protocol for collaboration among multiple parties to jointly train an ML algorithm using data owned by multiple parties. In this protocol, parties exchange information derived from their data, which does not reveal the original data itself but allows for training the ML algorithm. Training can be conducted on the resources of one or more (possibly all) participating parties or on external resources not owned by any party.

How are new encrypted data integrated into existing unencrypted data?

The approach for integrating encrypted data with unencrypted data depends on the specific solution used:
• Veils-based Approach: if our Veils approach is employed, models trained on Veils-protected data can be applied to unencrypted data if the same Veils transformation is pre-applied to the unencrypted data.
• FHE-based Approach: for other approaches, particularly when an existing solution for unencrypted data needs to be adapted for encrypted data, Fully Homomorphic Encryption (FHE) can be utilized.
• Separate Model Training and Aggregation: alternatively, separate models can be trained for encrypted and unencrypted data, and their outputs can be aggregated in a suitable manner.
Direct integration of encrypted data into unencrypted data is not feasible due to the fundamental differences in their numerical representation domains.

What are the security parameters of your solutions, in terms of the impossibility of recovering the original data?

Our software enables the creation of different security levels for different confidential data.

Did you use your own implementation of homomorphic encryption? Or did you use one of the known ones?

Currently, we utilize available open-source implementations of homomorphic encryption schemes. However, we may explore developing our own custom implementation in the nearest future.

Have you been able to solve the speed problems of homomorphic encryption? This can be a serious limitation for use in production environments.

The speed of encryption by algorithms that have homomorphism is indeed lower than commonly known encryption algorithms. But it is the performance of operations on the encrypted data that contributes the most. Therefore, the speed can be acceptable if the complexity of the required operations is low.

To our knowledge, homomorphic encryption does not support all types of mathematical operations. Have you made any progress in this area?

This is a pressing constraint; those functions that cannot be represented as a composition of valid operations or acceptably approximated by them are not supported.

Is it possible to encrypt a training set in a way that preserves the internal structure of the data or allows for parameter ranking relative to each other for use in model training?

Generating synthetic data appears to be the most promising approach for protecting original data while preserving its structure and ranking.

There is a concern about the amount of data that can be used for initial analysis. For example, if it is passport photos, the security department says that it is okay to use 1000 passports for the training set, but 10,000 is not.

When using our Veils-based solution, both 1000 passports and 10,000 passports, or any number of passports, will be protected.

How does your service integrate into the customer's MLOps pipeline?

If data scientists are using non-secure data now and plan to switch to secured data, then in any case the existing pipeline is modified. Slightly different grids, ways to pre and post process the data. This is compensation for the additional feature, but all these changes do not dramatically change the look of the work.

Do you provide ML and DS services?

No, we only provide a product that helps protect the data. You can use this data in ML and DS on your own afterwards.

Can your system encrypt only sensitive data within a document instead of the entire document?

Our product is designed to protect sensitive data during machine learning and data science processes. It is not intended for data anonymization, which involves removing or masking sensitive data from datasets that will not be used for training ML models.

What do we need to deploy on our side to encrypt videos for transmission?

You will need to deploy the docker container, specify a folder for input videos, a folder for output data, and generate a secret conversion key.

What computing resources are required for encryption?

A regular computer or laptop will be sufficient, the availability of a GPU is welcome.

Will models trained on synthetic data perform well on real data?

Yes, synthetic data is generated to ensure that the trained model performs well on real data.

How do you ensure the security of your product against vulnerabilities?

We proactively monitor and apply classic and innovative vulnerability scanning techniques.

How are you better than crypto enclaves?

We are engaged in the development of data protection methods that can work independently of access rights differentiation systems.

Is it possible to perform feature engineering on protected data?

Here we can talk about two options:

1. Providing the data scientist with a sample of real data to get acquainted with and evaluate the properties in order to analyze it and use it to tune weights, logic, features, stuff like that.

2. Providing the data scientist with synthetically generated data that reproduces the statistical properties of the original data.

Data scientists often need to visualize data to understand relationships before training models. How does data protection work in this case?

If the data scientist is a representative of the data owner, the data scientist works with open data.
If the data scientist is a representative of a third-party organization without the right to access the original data, our product will not allow the data scientist to view the data.

Can your product work with LLM (Large Language Models)?

We are doing research on supporting self-learning LLMs.

In the case of closed-architecture LLMs, there is no way to quantize for homomorphic inference, nor to retrain in order to fuse with Veils.

Do I need to know about encryption to use your product?

No, our product is designed to be user-friendly and accessible to individuals with varying levels of technical expertise, including those without a background in encryption.

Is your product applicable for scenarios where the data owner does not need ML models?

Yes, in this case the owner's data can be used in a secure form to train an ML algorithm for a third party, who will then use the algorithm.

How does it work overall?

The data is transformed in a special way so that it cannot be recovered, identified, and/or interpreted and still be usable in the training of ML algorithms.

How does the FHE method work?

Fully homomorphic encryption (FHE) algorithms allow performing addition and multiplication operations on ciphertexts such that the sum and product of ciphertexts after decryption is equal to the sum and product of the corresponding public data.

Does your product protect the entire model?

Fully homomorphic encryption and Functional Transformation (Veils) approach protect the trained model as well.

How are you better than blockchain?

Comparing our solutions and blockchain is not meaningful because blockchain technology is designed to ensure data integrity and has nothing to do with ensuring data privacy and training ML algorithms.

Will you have access to the data?

Absolutely not! The point of the solution is to ensure that no one else has access to the data except the data owner.

Who has access to the data under your methods and protocols?

The point of the solution is to ensure that no one else has access to the data except the data owner.