HPC and AWS - A Match Made In Heaven?

Cloud Architect, Jeremy Bartosiewicz observes the benefits of running High-Performance Computing workloads on AWS and the opportunities HPC can unlock for your business.

I recently attended an  AWS partner HPC (High-Performance Computing) boot camp and was very excited and honoured to be invited to such an event. 

Historically, I haven’t really focused on HPC specific workloads but, when I stop to consider the many projects I have been involved in over my time at Cloudreach, I realise many of them have included HPC like workloads, just without the formal titles.

The more I consider the benefits of running such workloads on AWS, the more it seems like a match made in heaven ... let me explain.

Hitting the spot

One of my early ventures into cloud was with an architect and design agency back in 2014. Their design process for buildings requires analysis and simulation runs during initial work. Each simulation attempt would consume any spare compute capacity in the data centre for days on end and slow down other iterations waiting for resources.

Sound familiar? Yes! A very bread and butter example of a requirement for on-demand extension of compute into cloud, for which AWS works well. Did you hear “spot instances” screaming in the back of your mind? If not, you did just now!

We helped bring those runs down to 18 minutes per run, used little to none of the on-premise capacity and still made the system work end-to-end by extending the existing system into AWS over VPN. We actually profiled the workloads against various instance types to ensure our runs would complete in 55 minutes, to make use of the full billing unit at the time. VPN worked well as the input files were small, the processing data was very large, and the output files were, again, quite small. It’s the myriad of instance types and the bottomless pit of storage available, alongside spot pricing, which solved the burst requirements whilst introducing even greater savings.

Back then, spot was very primitive (despite being a cutting-edge concept). I can only make this claim thanks to hindsight. Specific to spot, there are some reasonably recent (2018) features worth shouting about which help you make use of and manage spot instances effectively. 

Drowned by the waterfall of announcements AWS make every year, Spot Fleet allows users to request a pool of instances, fully spot or partially on demand with various allocation strategies (be they based on price, instance type or capacity). This helps solve the classic problem of availability. In the worst-case scenario, on-demand instances are used or a baseline pool of resource / percentage can be specified. 

The age-old practice of bidding for spot is no longer valid due to a new pricing model, which puts to rest any horror stories of vastly overpriced instances being leveraged by automated processes. Spot interruption notices now provide a way for you to safely pause / shutdown workloads when instances are about to be reclaimed, allowing you to place a marker and resume where you left off with new instances - workload permitting of course.

My number one tip for making best use of spot, however, is that you need to throw away that tendency to discriminate or show favouritism to the latest and greatest AWS hardware. With these, you can save a significant amount. I write that with great conviction as a very large cynic. Sometimes bigger is better, and cheaper too. See table below.

At the time of writing, the following instance types are available on spot, going back a one month period, having not seen any pricing shift at all. For latest pricing see the pricing pages online.

 

Type

On-Demand price

Spot Price

Zones

c3.4xlarge

$0.9569

$0.2493

eu-west-1a, eu-west-1b- eu-west-1c

c5n.9xlarge

$1.9440

$0.6184

eu-west-1a, eu-west-1b- eu-west-1c

g2.8xlarge

$2.8080

$0.8424

eu-west-1a, eu-west-1b- eu-west-1c

f1.4xlarge

$3.6300

$1.0890

eu-west-1a, eu-west-1b- eu-west-1c

p2.16.xlarge

$15.5520

$4.6656

eu-west-1b- eu-west-1c

z1d.metal

$4.9920

$1.4976

eu-west-1a, eu-west-1b- eu-west-1c

 

This is a massive amount of compute that you could use to run those nightly batch jobs or ad hoc simulations, without breaking the bank. You could even run async processing of messages held in queues, or tagging of videos and images for AI and ML applications used by recommendation engines. The lack of pricing changes for the period checked shows how few are really making good use of spot availability.

Worthy of note, a p2.8.xlarge has an on-demand price higher than the spot of a p2.16.xlarge! By diversifying your instance spot fleet with different types, regions and availability zones, you genuinely can make savings in excess of 60%!

How can AWS help?

In 2019,  HPC-like workloads are even more commonplace. The world generates immense amounts of data and someone/something needs to churn it all in good time, ideally in parallel. So how can AWS help?

For those just getting started with grid-like HPC computing and AWS, parallel cluster has been put together to help that initial foray. It really does allow you to launch an infrastructure set capable of making instance selection and consumption easy. The tool is based on Chef, AWS CloudFormation, and support a number of schedulers (SGE, Torque, Slum and AWS Batch) to help send jobs to those instances. It also comes with a rather nice CLI to make life even easier.

Replace my enterprise scheduler with the open-source I hear you scream in anger? No, that’s not my suggestion here (rewriting those apps will take time). However, why not give something new and a bit more cloud-friendly a go in that next POC? (Remember...we are here to help!).

Some of you may be worried that AWS networking is just too slow for HPC workloads.  However, with recent ENA enhancements, and support for up to 100-gigabit network speeds, (depending on the instance type) you should be able to find the configuration which meets your needs.

Do take a look at some of the benchmarks available online and be mindful that placement groups will likely make sense for many in cloud HPC implementations.  

With all those wonderful instances and parallel compute capabilities, what should you do with storage? Maybe you could give Amazon FSx for Lustre a go?

 It is impressively fast, without any tuning, spin up delay and/or expertise required. You also miss out on the pain of long term management and get over 1000 mb per second over the virtual wire in parallel reads/writes without even trying (very nice performance evaluation article here). You can now implement smaller file systems with the service, which broadens the range of use cases whilst making POC’s even more cost-effective to begin with. It does require the installation of a small client, but with enough attention, maybe it will make its way back into the kernel.

HPC workloads

So what are some of the current HPC inspired workloads I see on the horizon? Being based in London, with recent experience within the financial services industry, I see FRTB and IFRS 17 as big drivers for new cloud-based HPC workloads, amongst others. If you allow us to take a seat at the table with some of your “Quants”, we would be more than happy to help try and implement their ideas in cloud.

A match made in heaven?

Are HPC and AWS a match made in heaven? Definitely more so now than before, with such a large and diverse pool of instance types and countless performance tweaks available. A managed Lustre service, 100Gbps networking and a way to significantly reduce the price of running workloads also help make a very compelling argument. 

Can AWS be leveraged for all HPC use cases? … there may be some exceptions. Where it makes sense to install a Cray, such as that which the met office use for global weather simulations, there may still be some challenges ahead in trying to leverage AWS. These are often circumstances where there are extreme performance, cost and security considerations in play; relevant features may quite simply not be available yet in cloud as well. 

Only by challenging HPC architecture norms, the status quo of installing rigs, and by developing applications in a cloud first way * will we find out whether AWS can cater to the full spectrum of HPC use cases. There are examples of large simulations attempted such as this one by Western Digital which consumed 2.5 Million HPC Tasks and 40K EC2 Spot Instances, so there is hope for those willing to adopt different methods.

The question to my reader is, are you brave enough to try and challenge the norm?

 

Remember, if you want to talk about compiling MPI libs, re-implementing enterprise grid servers in cloud to migrate workloads, re-writing and re-architecting systems to make use of some of these features (and others I failed to mention), do give us a call. Just please don't ask us to write a custom scheduler ;)

 

  • aws