FPGAs and the Future of Parallelized Computation in the Cloud

James Muldoon 20th December 2017
FPGA encryption

Abstract:

Field programmable gate arrays (FPGAs) are a type of mutable hardware that can be reprogrammed and repurposed. This allows for quicker development time, time to market, and turnaround for hardware developers working on application-specific integrated circuits (ASICs). “The nonrecurring engineering (NRE) expense of custom ASIC design far exceeds that of FPGA-based hardware solutions”, especially when considering the incremental changes to an FPGA design is negligible in cost when compared to respinning an ASIC [1].  As a technology FPGAs have been around since 1985, but only since 2016 have they been made available in the Cloud by large providers such as Amazon Web Services (AWS) [2,3], Alibaba Cloud, and Huawei Cloud.

Graphical processing unit (GPU) acceleration has grown exponentially more popular as machine learning technologies continue to grow, but increasing need for true parallelization brings forth FPGAs as a leader in accelerator technologies [4]. The high barrier to entry has been significantly diminished by AWS’s ability to remove the need to purchase physical hardware, as well as the development of new  technologies such as SDAccel and OpenCL [5]. The future of parallelization and accelerator technologies consistently points toward FPGAs with the wider adoption of cloud providers utilizing them, and offering services for them.

 

1. CPUs vs GPUs vs FPGAs

 

“There are advantages to each type of compute engine. CPUs offer high capacity at low latency. GPUs have the highest per-pin bandwidth. [Lastly,] FPGAs are designed to be very general” [6]. FPGAs straddle the middle ground between GPU and CPUs for their ideal use case.

CPUs are the generic workhorse for computing. They are good at doing many varied tasks, and are programmed with as much variance and control as the user desires, however the ability to parallelize at high throughput is limited to the architecture. A CPU has a number of cores that can run processes in parallel, but these are bound to the hardware’s original configuration. Thus, GPUs become a more attractive option when control paths are no longer the focus, and high throughput data paths become the focus.

High throughput data paths are the ideal use case for GPUs, which is why graphically intensive operations are tasked to them. Graphics computation is a field of vector problems that are ideally suited for parallelization, making the inherent GPU pipelines excel. Thus, GPU architecture allows for acceleration of vectorized data through it’s pipelines, which otherwise would have had to run through many other logical blocks within a CPU before an answer would have been realized.

FPGAs, as stated earlier are in a bit of a middle ground for what they can do. They can be made to deal with control paths as well as data paths, but unlike the CPU and GPU, they are able to reprogram their hardware gate level logic to suit the problem’s needs. This added flexibility reduces steps in pipelines to gain efficiencies in computational speed, energy use, and the ability to parallelize multiple vector operations simultaneously.

“In general, FPGAs provide the best expectation of performance, flexibility and low overhead, while GPUs tend to be easier to program and require less hardware resources” [7]. This was until the availability of cloud based FPGA solutions such as AWS’s F1 instance types. CPUs were the general processing unit that was good for a diverse workload, but not as efficient as the aforementioned GPU/FPGA solutions with regards to parallelization and throughput. This led to the creation of hybrid architectures for CPU/GPUs and CPU/FPGAs, which utilized the offloading of parallelizable sections to be accelerated by the GPU or FPGA that was on board. Meanwhile, running the non-accelerated sections off the CPU where a higher performance could be achieved via the synergy of the two different processing units.

To reiterate the key difference of note is that the FPGA board is not restricted to the CPU/GPUs fixed hardware architectures, but rather has the freedom of true parallelism with the reprogrammable hardware logic. This brings the trade-offs directly into the hands of the developer that were otherwise not gained by a CPU/GPU centric solution. The aforementioned trade-offs being improved performance and power efficiencies, at the cost of straightforward recompilation and upfront costs.

 

2. Parallelizable Highly Efficient Throughput of FPGA F1 Instances; A Case Study

 

On November 30th, AWS and Edico Genome delivered a joint presentation as a case study for the use of FPGAs specifically under the use case of genomics compute acceleration [8]. AWS led the presentation with a brief, high-level overview of some differences between, and characteristics of, FPGAs and GPUs, followed by what AWS specifically offered in the ways of F1 instance types (FPGA optimized instance type). F1 instances, as of the time of writing this, “include 16 nm Xilinx UltraScale Plus FPGA. Each FPGA includes local 64 GiB DDR4 ECC protected memory, with a dedicated PCIe x16 connection. Each FPGA contains approximately 2.5 million logic elements and approximately 6,800 Digital Signal Processing (DSP) engines” [9]. In combination, these features make for a powerful instance type that is leveraged by Edico Genome to truly display their joint potential.

Highlighting this, Edico Genome was featured for their accomplishments regarding the Guinness Book of World Records, and their collaboration with the Children’s Hospital of Philadelphia in setting a “new scientific world standard in rapidly processing whole human genomes into data files usable [by] researchers aiming to bring precision medicine into mainstream clinical practice” [10]. The metrics displayed roughly 34 minutes into the presentation show on average a 40x (over 100x in the case of the matchup against Java use) or higher performance increase over the conventional CPU options by utilizing F1 instances [8].

Additionally, the presentation focused on how they took their existing solution and migrated it to a cloud ready solution in order to leverage AWS and streamline their workflow. The use of AWS services to constantly stream data rather than downloading/uploading flat files allowed for them to reduce the times external to the execution time directly. Thus, their focus was able to fall heavily on the optimization of their current FPGA solution. Overall this presentation was a gem of a performance for how well FPGAs can outclass their competition with regards to important non-trivial problems.

 

3. FPGAs Barrier of Entry, and AWS’ Solution to the Problem

 

FPGA’s high entry costs are now a thing of the past. With the performance increases, partnership between Xilinx and AWS, as well as the ever-increasing regional support via AWS these F1 instances bring highly available computational power to the fingertips of developers that would have otherwise been out of budget. Furthermore, the presence in the Asian market by Xilinx supporting the other major cloud providers, such as Alibaba and Huawei, bring a community of register-transfer level (RTL) programmers to bear on the future of accelerated computing that otherwise may not have been captured [11].

AWS and  Xilinx have invested in the future of accessibility to program FPGA devices, which previously required extensive backgrounds in hardware, and the System C/System Verilog/Verilog/VHDL languages, by supporting the SDAccel programming environment. Accelerated computing is no longer a massive upfront investment in hardware, and engineering expertise. With the assistance of F1 instances and Xilinx’s FPGA Developer Amazon Machine Image (AMI), the ability to rapidly and repeatedly (re)program and test is simple and increasingly more affordable.

Furthermore, TSMC’s Chairman Morris Chang, has mentioned that the 3nm fabrication foundry could cost as much as $15-20 billion USD, which is the future of highly performant hardware technologies [12]. These costs of actually spinning an ASIC would inevitably be passed down to the contracts purchased to use on their lines. FPGAs alleviate this burden of NRE [1] expenses of ASICs by allowing for iterative development life cycles at comparatively negligible costs.

 

4. Future of Parallelized Computation and the Cloud

 

Intel is at the height of innovative design and quality hardware when discussing CPU technologies. Their acquisition of Altera (an FPGA technology leader) on the 28th of December 2015 [13], put them firmly in place to create their recently announced Hybrid CPU-FPGA architectures [14,15], which allow developers to leverage the Intel Acceleration Stack and OpenCL based programming environment. Along with this, the FPGA market speculation being worth nearly $10 billion by 2023 [15], and AWS’s constant improvement in the areas to meet the demand of highly parallelized systems with high throughput for the computation of neural networks, machine learning, genomics, finance, and cryptography (to name a few), has catapulted their position to being one of the most critical providers for the future of numerical acceleration.

 

Bibliography:

[1]
National Instruments, “Introduction to FPGA Technology: Top 5 Benefits”, National Instruments, 2012.
[2]
Amazon, AWS re:Invent 2016: Introducing Amazon EC2 F1 Instances with Custom FPGAs for Hardware Acceleration. 2016.
[3]
J. Barr, “EC2 F1 Instances with FPGAs – Now Generally Available”, AWS News Blog, 2017.
[4]
K. Freund, “Amazon’s Xilinx FPGA Cloud: Why This May Be A Significant Milestone”, Forbe’s, 2016.
[5]
“SDAccel Development Environment”, Xilinx.com, 2017. [Online]. Available: https://www.xilinx.com/products/design-tools/software-zone/sdaccel.html. [Accessed: 04- Dec- 2017].
[6]
J. Dorsch, “CPU, GPU, or FPGA?”, Semiconductor Engineering, 2016. [Online]. Available: https://semiengineering.com/cpu-gpu-or-fpga/. [Accessed: 18- Dec- 2017].
[7]
S. Che, J. Li, J. W. Sheaffer, K. Skadron and J. Lach, “Accelerating Compute-Intensive Applications with GPUs and FPGAs,” 2008 Symposium on Application Specific Processors, Anaheim, CA, 2008, pp. 101-107.
[8]
Amazon, AWS re:invent 2017: FPGA Accelerated Computing Using Amazon EC2 F1 Instances (CMP308). 2017.
[9]
“Amazon EC2 F1 Instances”, Amazon Web Services, Inc., 2017. [Online]. Available: https://aws.amazon.com/ec2/instance-types/f1/. [Accessed: 18- Dec- 2017].
[10]
Edico Genome, “Children’s Hospital of Philadelphia And Edico Genome Achieve Fastest-Ever Analysis Of 1,000 Genomes”, 2017.
[11]
K. Freund, “Amazon AWS And Xilinx: A Progress Report”, Forbes Tech Big Data, 2017. .