According to McKinsey, by 2030, the adoption of cloud technologies could contribute up to US$3 trillion in EBITDA across industries such as retail, pharmaceuticals, and energy [1]. Gartner also predicts that by 2028, the public cloud services market will reach US$1.28 trillion [2].
The growing popularity of cloud computing is evident, and for those deciding whether to train ML models on their own infrastructure or move to the cloud, it's crucial to weigh the pros and cons.
Cloud service providers typically emphasize the advantages while downplaying potential drawbacks. However, the significance of these pros and cons will vary depending on the specific needs and use cases of different users.
At Guardora, we specialize in data protection solutions within the scope of machine learning, and this article has two objectives:
- To outline a high-level map of considerations for those facing the choice between local infrastructure and the cloud.
- To help clarify which drawbacks can be reasonably overlooked when applying Privacy-Preserving Machine Learning techniques, freeing up mental space for other important decisions.
Pros and Cons of Cloud Services for Machine Learning
Feature | Advantages | Disadvantages |
---|---|---|
Scalability and Deployment for Real-World Workloads | Cloud services allow rapid scaling of resources based on demand, whether you're dealing with small models or complex deep neural networks. Even if a user has an in-house team capable of developing algorithms, deploying models in production and scaling them for real-world workloads presents a separate challenge, often requiring large computational clusters. | There is a risk of requested resources being unavailable, which could result in either waiting for them to become free or facing the difficulty of switching providers. |
Infrastructure and Control | There’s no need to purchase and maintain expensive specialized equipment and servers. You can simply rent computational resources in the cloud. The expertise required to build, train, and deploy machine learning models in enterprise applications increases the cost of labor, development, and infrastructure. | In the long run, with continuous usage and large data volumes, renting cloud resources may become more expensive than buying your own hardware. There’s less control over the physical infrastructure in the cloud. You are forced to tolerate technical failures and scheduled maintenance. Additionally, there is no way to fully optimize hardware resources, which may be crucial for specific tasks. This also creates the need to rely on resource administrators to resolve issues as they arise. |
Performance and Reduced Training Time for Models | The immediate availability of high-performance GPUs and TPUs in the cloud accelerates model training. | The need to upload training datasets to cloud storage can negate the benefits of fast access to computational resources. |
Team Expertise | Cloud services, along with AIaaS platforms, provide access to machine learning capabilities without the need to hire highly skilled AI or data science professionals.[3] This frees users from the need to maintain a large team to manage the infrastructure and its associated tasks. | However, even cloud-managed machine learning systems still require human oversight and optimization. There are practical limits to what AI can accomplish without human supervision and intervention. Algorithms do not fully understand every situation or know how to respond to every possible input. |
Payment Flexibility | You can pay only for the resources you use, without needing significant upfront investments. Cloud services offer various pricing plans, including pay-per-minute options, allowing for cost optimization. | There is a dependency on inflation-driven price adjustments, which can impact long-term partnerships. |
Automation | There is an option to automate the deployment and management of models. | |
Global Accessibility and Ease of Collaboration | Engineers can work remotely without being tied to a single server or computer. Multiple users can collaborate on the same model or project simultaneously. | |
Dependence on Internet Connectivity | Cloud servers typically have faster connections, which is beneficial for remote data downloads or transfers. | A stable internet connection is required to access cloud resources, which can be a problem during network outages. |
Quick Environment Setup | Using the cloud does not require complex hardware configurations or software installations—everything is available "out of the box". | When launching machine learning models in the cloud, transferring systems from one cloud provider or service to another can be challenging. This requires moving data in a way that does not impact model performance. Machine learning models are often sensitive to small changes in input data. For instance, a model may perform incorrectly if you need to change the format or size of your data. |
Support for Modern Tools | Cloud platforms offer a wide selection of tools and services for various tasks, including support for popular libraries and frameworks such as Scikit Learn, TensorFlow, PyTorch, and others. | |
Integration with Other Services | Cloud providers often offer ready-made solutions for data storage, analytics, and related tasks, along with seamless integration with other services and tools. | Compatibility issues may arise between different cloud services. |
Updates and Support | Cloud service providers regularly update their infrastructure without requiring manual intervention. They offer 24/7 technical support and extensive documentation, providing a single channel for any inquiries. | |
Bandwidth Limitations and Slow Uploads of Large Datasets | Transferring large datasets to the cloud can be time-consuming and resource-intensive. Processing large volumes of data may take longer due to network bandwidth limitations. |
|
Technical Failures at the Provider | Cloud services may experience outages, which can impact project availability and model training. | |
Dependence on the Provider | If you switch or discontinue with a provider, you may need to transfer data and retrain the infrastructure. There are also other risks associated with reliance on a single service provider. |
|
Data and ML Models Security | Cloud service providers implement measures to protect data. Not all users can consistently adopt the latest advancements in privacy and security technologies. This can lead to vulnerabilities, as maintaining top-level security becomes increasingly complex and requires specialized knowledge. Machine learning security in cloud computing is particularly noteworthy[4]. | Data transmitted and stored in the cloud can be exposed to a wider range of cyberattack risks, despite the security measures implemented by providers[5]. Cloud-based machine learning faces the same challenges as any cloud computing platform. Cloud machine learning systems are often exposed to public networks and can be compromised by malicious actors who might manipulate ML outcomes or escalate infrastructure costs. Additionally, cloud ML models are vulnerable to denial-of-service (DoS) attacks. Many of these threats are mitigated when models are deployed behind a corporate firewall[6]. |
Privacy Limitations | Cloud services may not be suitable for handling sensitive data due to legal or corporate restrictions[7]. | |
Legal and Regulatory Issues with Data Localization | Some cloud providers may store data in other countries, which can lead to legal or regulatory challenges. |
Which of the Above Drawbacks Can Be Overlooked When Using Privacy-Preserving Machine Learning Techniques?
Data and ML Models Security: PPML methods specifically address data security and model integrity issues, rendering these concerns irrelevant when utilizing cloud services.
Privacy Limitations: Since PPML ensures that sensitive data remains non-interpretable and protected even in the event of a breach, worries about safeguarding sensitive information in the cloud become unfounded.
Legal and Regulatory Issues with Data Localization: Protection techniques can help transform personal data into anonymized data, freeing users from compliance with stringent regulations such as GDPR or restrictions on cross-border data transfers.
Special Attention to the Last Two Points
For instance, techniques such as homomorphic encryption, differential privacy, and synthetic data generation hold significant potential for data anonymization. Their use may eliminate the need to comply with personal data protection regulations. While there are currently no laws explicitly stating this at the time of this article's publication, there are several precedents that provide indirect support for this notion.
1. Judgment of the General Court of the EU in Case T-557/20 “Single Resolution Board Against the European Data Protection Supervisor.”[8]
This ruling emphasizes that determining whether data has been anonymized requires an assessment of the risk of re-identification based on risks and context.
It is important to note that anonymization is the process of removing personal identifiers from data so that individuals cannot be re-identified. Anonymized data is not considered personal under GDPR and, therefore, falls outside its scope (Article 4(1) and Recital 26 of the GDPR).[9]
This means that after applying PPML techniques to personal data, it ceases to be classified as "personal" in legal terms. Consequently, such data is no longer subject to various legal restrictions and data protection requirements regarding its use, dissemination, and even cross-border transfer.
As a result, companies can use data more freely, save costs on compliance, and implement projects that were previously unavailable due to legal data protection constraints.
The court ruled that in determining whether an individual is identifiable, all means that are likely to be used (including the costs and time required for identification, and the available technologies at the time of processing) should be considered. This assessment should be conducted from the perspective of the data recipient or owner.
2. Article 29 Data Protection Working Party 0829/14/EN WP216
Opinion 05/2014 on Anonymization Techniques, adopted on April 10, 2014
A.2. “Anonymisation” by randomization[10]
“... For as long as the key or the original data are available (even in the case of a trusted third party, contractually bound to provide secure key escrow service), the possibility to identify a data subject is not eliminated. …”
In other words, if the decryption key is unavailable, the data can be considered anonymous.
3. Regulation (EU) 2018/1725 of the European Parliament and of the Council of 23 October 2018 on the protection of natural persons with regard to the processing of personal data by the Union institutions, bodies, offices and agencies and on the free movement of such data, and repealing Regulation (EC) No 45/2001 and Decision No 1247/2002/EC[11]
Recital 26
The principles of data protection should apply to any information concerning an identified or identifiable natural person.
Personal data which have undergone pseudonymisation, which could be attributed to a natural person by the use of additional information, should be considered to be information on an identifiable natural person.
To determine whether a natural person is identifiable, account should be taken of all the means reasonably likely to be used, such as singling out, either by the controller or by another person, to identify the natural person directly or indirectly.
To ascertain whether means are reasonably likely to be used to identify the natural person, account should be taken of all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing and technological developments.
The principles of data protection should therefore not apply to anonymous information, namely information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.
This Regulation does not therefore concern the processing of such anonymous information, including for statistical or research purposes.
4. Case C-582/14[12]
The significance of the possibility of identifying the data subject and the level of identification risk in relation to the necessity of disproportionate efforts in terms of time, cost, and labor.
5. Privacy-enhancing technologies guidance by the UK's Information Commissioner's Office (ICO)[13]
“Example … the hospital also shares information with researchers studying regional trends of COVID-19 cases. In this case, the hospital generates synthetic data for the researchers, possibly in combination with differential privacy to achieve effective anonymisation.”
“non-interactive differential privacy – this is where the level of identifiable information is a property of the information itself, which is set for a given privacy budget. This approach can be useful for publishing anonymous statistics to the world at large.”
“Both models of differential privacy are able to provide anonymous information as output, as long as a sufficient level of noise is added to the data. The local model adds noise to the individual (input) data points to provide strong privacy protection of sensitive attributes. As the noise is added to each individual contribution, this will result in less accurate and useful information than the global model.”
To be fair, it’s worth exploring the section "What are the different types of PETs?" as well as the less optimistic table on which PETs provide privacy at input and output stages.
Additionally, the table with examples of PET use cases discussed in this guidance, along with information on the availability of standards and known limitations, is quite useful. Your goals may require a combination of methods to ensure the necessary protection at various stages of the data processing lifecycle. This is not an exhaustive list.
Globally, there are laws protecting personal and sensitive data. Properly anonymized data is exempt from these regulations, meaning it is not subject to privacy restrictions. "Proper anonymization" generally refers to the inability to reasonably re-identify individuals.
Delving deeper, the data controller or the party making a claim must, during the proof process:
- conduct tests,
- take into account all means that are reasonably likely to be available and could be used,
- consider all objective factors such as the cost and time required for identification,
- account for the available technologies at the time of processing and technological advancements.
Back to the Clouds
Personal or behavioral data, user interaction data, interests, and connections have long become valuable assets. They are collected, analyzed, utilized, and monetized in various ways.
Clouds require the trust not only of individuals but also of companies protecting their data, intellectual property, know-how, and machine learning models.
While NDAs and data protection laws are important, their effectiveness can only truly be assessed in a court of law, often long after incidents have occurred.
Compliance certificates also don’t offer active protection that can fully alleviate concerns about data leaks caused by human error.
This creates a need for users to have access to technologies that ensure paranoid-level protection while still remaining rational.
Some experts argue that trust in clouds today is akin to the joke about "a gambling club where everyone is trusted by their word." This club could shut down after the first major global privacy scandal.
Companies with the financial, human, and time resources often choose local solutions or build their own systems. However, even they go through a decision-making process weighing the "cloud pros and cons" mentioned earlier in this article.
Privacy-Preserving Machine Learning techniques don’t require blind trust in the cloud, as they ensure compliance with privacy constraints algorithmically, providing tangible and verifiable privacy protection.
A skillful combination of various methods, techniques, and PPML protocols ensures data opacity, protects against reverse engineering of sensitive data sources, and creates a robust privacy infrastructure, making it possible to process data according to the principle of "What You See Is What You Get".
Would you like to stay informed when certain Privacy-Preserving Machine Learning methods and protocols are formally recognized as anonymization methods by the legislation of a particular country?
Want to discuss cloud usage and think privacy-enhancing techniques might help you make that decision?
Or are you simply interested in the topics touched upon in this article, have questions, or disagree with something?
Join our Discord community, and let’s talk about it.
[1] Projecting the global value of cloud: $3 trillion is up for grabs for companies that go beyond adoption
[2] Forecast: Public Cloud Services, Worldwide, 2022-2028, 2Q24 Update
[3] Data science and Machine learning in the Clouds: A Perspective for the Future
[4] A Review of Machine Learning-based Security in Cloud Computing
[5] Securing Machine Learning in the Cloud: A Systematic Review of Cloud Machine Learning Security
[6] Machine Learning in the Cloud Complete Guide for 2023
[7] Research trends in deep learning and machine learning for cloud computing security
[8] EU General Court (Single Resolution Board v. European Data Protection Supervisor Case T-557/20)
[9] General Data Protection Regulation
[10] ARTICLE 29 DATA PROTECTION WORKING PARTY 0829/14/EN WP216
[11] REGULATION (EU) 2018/1725 OF THE EUROPEAN PARLIAMENT AND OF THE COUNCIL of 23 October 2018
[12] Judgment of the Court (Second Chamber) of 19 October 2016
[13] Privacy-enhancing technologies guidance by the UK's Information Commissioner's Office (ICO)