
High performance computing (HPC) is an ever-evolving domain that strives to solve complex compute problems at the sub-millisecond level. The aim has always been to improve communication layers – whether at the network, compute or storage layers – between all hardware components to reduce bottlenecks hindering accelerated scientific progress and solving complex computations with inter-connectivity protocols under the Message Passing Interface (MPI) umbrella. The field has an estimated value of $36 billion with a projection to reach $50 billion by 2027. The term ‘high performance’ often highlights an average time to solution, which is the rate limiting step for a hardware stack to finish a complex calculation or benchmark. The key players in the HPC market vary based on the solutions they provide. Prime examples of these market players – in no particular order – are: AMD, Intel, ARM, NVIDIA, HPE, Atos, DDN, and IBM – among others.
Uses of HPC have been numerous across the industry involving, genomics, pharmaceuticals, renewable energy to name a few, with use cases ranging from molecular dynamics, to computational fluid dynamics and quantum computing. Given that the typical lifespan of an HPC cluster is between 5 to 7 years, computational problems tend to evolve at a faster rate than the hardware’s lifespan. The ability to replenish hardware to address more demanding problems becomes constrained to budget and economical factors. This results in an overall tendency to reduce the size of the use cases and not explore an end to end solution to a problem in favour of easing the burden on storage, compute and network resources.
Without knowledgeable expertise, building an HPC can become a challenge; especially while trying to adhere to a defined budget and timeframe to deliver. This is no different from when we are referring to HPC in the cloud – be it a full migration or adopting a hybrid model. Over the years, it became evident that the main driving factors for moving HPC to the cloud are data centre costs, cost of compute and storage. A decision to move to the cloud has the benefits of utilising an elastic compute model coupled with AWS saving plans or reserved instances. The promise of the cloud provides an organisation the ability to build an infrastructure in a few minutes. Tearing down an old one to replace it is faster than upgrading an on-premise environment. The benefits also include cost management with marketplace or managed FinOps platforms, schedules, budget alerts, reproducibility with infrastructure as code (IaC), elasticity, and managed monitoring capabilities down to the packet level.
The Cloud HPC market has been growing exponentially over the past decade and is currently valued around $7.06 billion in 2022, and projected to grow to $14.4 billion by 20282. While there are several cloud solution providers (CSPs), only two of them currently vie for dominance in the HPC market: AWS and Azure – with GCP showing services layered around it recently. However, AWS and Azure offer competing but similar compute options, network and storage solutions. The difference is down to the offers that entice cost savings or increased high availability for storage and compute. This makes it ideal for many HPC intensive running organisations to think about cloud adoption. But it often feels overwhelming when an organisation evaluates which CSP to opt for given all the offerings available. This article focuses on storage with an aim to simplify the offerings that both AWS and Azure provide for running HPC in the cloud.
General View
Generally, a cloud provider’s HPC storage is supplied with links to 3rd party vendors. With Azure, customers are given options between Azure NetApp Files, IBM-Sycomp Spectrum Scale, or DDN EXAScaler Cloud. Azure is working towards adding Lustre-as-a-Service to its managed storage solution offerings but there is no indication as to when it would be generally available. With AWS, there is more variety to choose from under the FSx managed services family: Lustre, NetApp ONTAP, Windows File Server, and OpenZFS. Each of these choices come with some peace of mind when it’s down to high availability and disaster recovery; they can be deployed in a single or multiple availability zone – of course with an associated cost. This is part of an abstraction layer that AWS manages for deployments; AWS takes care of the data replication for the customer. If a single zone deployment is chosen, the data residing on the storage layer would be copied to several storage racks in the data centre. If a multi-zone deployment is chosen, the data is copied across multiple zones in the region of choice.
In Azure, things are done differently. High availability is an extra feature to turn on and not enabled by default. For example, in the case of Lustre, high availability is coined to RAID if enabled at the creation steps. There is also an option of interoperability syncs with Azure Blob Storage accounts for further redundancy and data protection. In the case of Azure NetApp, zone replication is not an option. A customer is given a choice for cross-regional hot-hot replication – with disabled write access at the replicated side until a failover occurs. The Azure NetApp Files team is working towards object tiering for release at some point in the future. It should be noted that some options may still be in preview for Azure NetApp Files at the time of writing. For the cases of Spectrum Scale and DDN, an ‘active-passive’ option is available for both provided that an object storage (through a storage account) is provisioned as part of the deployment strategy. For Azure Lustre, a choice is given for a zone preference the VMs’ deployment – this is only applicable to non-managed solutions in Azure.
Assessment
Table 1 assumes a deployment of each AWS and Azure service at 100TB, SSD deployment with feature sets where applicable to the cloud provider. For Azure, most storage options are tied to the compute layer provisioned (a 32 core LSv2/LSv3 SKU) and that is what is assumed for the 100TB example where applicable. This is to simplify the cost* of provisioning high performant storage in both Azure and AWS:
AWS | Azure | |||||
Storage Solution | PayG $/TB per month | Comment | Storage Solution | PayG $/TB per month | Comment | |
FSx Lustre | 307.23 | Full SSD, 50% compression | Azure NetApp Files | 301.25 | Premium storage | |
FSx NetApp | 129.53 | Full SSD, 50% compression/de-duplication,multi-AZ | IBM Spectrum Scale – 1yr subscription with Encryption | 11291.64 | 3 yr reserved instance (58% discount) – inclusive of Spectrum Scale licences and fees. Excludes Data Mobility fees | |
FSx OpenZFS | 48.74 | Full SSD, 50% compression/ deduplication | Azure Lustre | 2807.33 | 3 yr reserved VM instances (58% discount from PayG pricing) | |
FSx Windows Server | 163.84 | 50% compression/de-duplication, multi-AZ | Azure ExaScaler | 2680.64 | 3 yr reserved VM instances (58% discount from PayG pricing) | |
Remarks
Of course, customers of either can expand beyond these offerings and set up their own parallel file system through chained VMs running a bespoke file system like ZFS or CEPH or anything of their choosing, but they would need to oversee data lifecycle management and disaster recovery strategies manually. With that said, AWS has some advantage over Azure in terms of managed storage capabilities and cost optimizations for HPC storage. An example of this is how AWS gives the freedom to optimise throughput and security integrations with its Key Management System; customers can opt for AWS managed keys or bring their own keys for encryption when using anything from the FSx family. However, in the case of Azure’s only managed solution – Azure NetApp Files – the volumes’ throughput capacity is almost linearly correlated with volume size; ie., to get more throughput one is required to switch to a higher service tier. In terms of security, encryption is managed by Azure at the FIPS 140-2 standard and customers do not have the freedom to choose otherwise. There is of course the capability to choose between POSIX authentication layers like Kerberos for NFS or AD for SMB but it is not the same as volume encryption.
Conclusion
There are trade-offs to consider from both cloud providers in terms of pricing models, discounts, and options around throughput. It just depends on whether there is a budget for implementing these options long term. Generally speaking however, the future of cloud storage for HPC looks very promising. New services are added almost every quarter, and improvements to older ones are introduced. Customers have more options and capabilities added to their arsenal to accelerate research and development compared to their on-premise environments. Thus, HPC enthusiasts should be pleased in the short and long term as cloud adoption expands. Demand would introduce more service offerings. While storage costs are still a burden for many organisations due to macroeconomic factors, the benefits are bound to outweigh the costs in the long term and it becomes a FinOps exercise to cherry pick where to store data and what workloads to run in the cloud.
If you would like to inquire about options for migrating your HPC to the cloud, please do not hesitate to get in touch here.
About the author:
Fouad is an HPC specialist at Cloudreach, an Atos company. Fouad has over 10 years of experience in the HPC and AI domains, tracking the latest developments in the market, as well as designing and developing HPC solutions for customers in Azure and recently expanding to AWS.