High Performance Computing (HPC) – An Introduction

When we talk about high performance computing ( HPC ) we are typically trying to solve some type of problem. These problems will generally fall into one of four types:

  • Compute Intensive – A single problem requiring a large amount of computation.
  • Memory Intensive – A single problem requiring a large amount of memory.
  • Data Intensive – A single problem operating on a large data set.
  • High Throughput – Many unrelated problems that are be computed in bulk.

 

In this post, I will provide a detailed introduction to High Performance Computing ( HPC ) that can help organizations solve the common issues listed above.

Compute Intensive Workloads

First, let us take a look at compute intensive problems. The goal is to distribute the work for a single problem across multiple CPUs to reduce the execution time as much as possible. In order for us to do this, we need to execute steps of the problem in parallel. Each process­—or thread—takes a portion of the work and performs the computations concurrently. The CPUs typically need to exchange information rapidly, requiring specialization communication hardware. Examples of these types of problems are those that can be found when analyzing data that is relative to tasks like financial modeling and risk exposure in both traditional business and healthcare use cases. This is probably the largest portion of HPC problem sets and is the traditional domain of HPC.

When attempting to solve compute intensive problems, we may think that adding more CPUs will reduce our execution time. This is not always true. Most parallel code bases have what we call a “scaling limit”. This is in no small part due to the system overhead of managing more copies, but also to more basic constraints.

This is summed up brilliantly in Amdahl’s law.

In computer architecture, Amdahl’s law is a formula which gives the theoretical speedup in latency of the execution of a task at fixed workload that can be expected of a system whose resources are improved. It is named after computer scientist Gene Amdahl, and was presented at the AFIPS Spring Joint Computer Conference in 1967.

Amdahl’s law is often used in parallel computing to predict the theoretical speedup when using multiple processors. For example, if a program needs 20 hours using a single processor core, and a particular part of the program which takes one hour to execute cannot be parallelized, while the remaining 19 hours (p = 0.95) of execution time can be parallelized, then regardless of how many processors are devoted to a parallelized execution of this program, the minimum execution time cannot be less than that critical one hour. Hence, the theoretical speedup is limited to at most 20 times (1/(1 − p) = 20). For this reason, parallel computing with many processors is useful only for very parallelizable programs.

– Wikipedia

Amdahl’s law can be formulated the following way:

 

where

  • S latency is the theoretical speedup of the execution of the whole task.
  • s is the speedup of the part of the task that benefits from improved system resources.
  • p is the proportion of execution time that the part benefiting from improved resources originally occupied.

Chart Example: If 95% of the program can be parallelized, the theoretical maximum speed up using parallel computing would be 20 times.

Bottom line: As you create more sections of your problem that are able to run concurrently, you can split the work between more processors and thus, achieve more benefits. However, due to complexity and overhead, eventually using more CPUs becomes detrimental instead of actually helping.

There are libraries that help with parallelization, like OpenMP or Open MPI, but before moving to these libraries, we should strive to optimize performance on a single CPU, then make p as large as possible.

Memory Intensive Workloads

Memory intensive workloads require large pools of memory rather than multiple CPUs. In my opinion, these are some of the hardest problems to solve and typically require great care when building machines for your system. Coding and porting is easier because memory will appear seamless, allowing for a single system image.  Optimization becomes harder, however, as we get further away from the original creation date of your machines because of component uniformity. Traditionally, in the data center, you don’t replace every single server every three years. If we want more resources in our cluster, and we want performance to be uniform, non-uniform memory produces actual latency. We also have to think about the interconnect between the CPU and the memory.

Nowadays, many of these concerns have been eliminated by commodity servers. We can ask for thousands of the same instance type with the same specs and hardware, and companies like Amazon Web Services are happy to let us use them.

Data Intensive Workloads

This is probably the most common workload we find today, and probably the type with the most buzz. These are known as “Big Data” workloads. Data Intensive workloads are the type of workloads suitable for software packages like Hadoop or MapReduce. We distribute the data for a single problem across multiple CPUs to reduce the overall execution time. The same work may be done on each data segment, though not always the case. This is essentially the inverse of a memory intensive workload in that rapid movement of data to and from disk is more important than the interconnect. The type of problems being solved in these workloads tend to be Life Science (genomics) in the academic field and have a wide reach in commercial applications, particularly around user data and interactions.

High Throughput Workloads

Batch processing jobs (jobs with almost trivial operations to perform in parallel as well as jobs with little to no inter-CPU communication) are considered High Throughput workloads. In high throughput workloads, we create an emphasis on throughput over a period rather than performance on any single problem. We distribute multiple problems independently across multiple CPUs to reduce overall execution time. These workloads should:

  • Break up naturally into independent pieces.
  • Have little or no inter CPU communcation
  • Be performed in separate processes or threads on a separate CPU (concurrently)

 

Workloads that are compute intensive jobs can likely be broken into high throughput jobs, however, high throughput jobs do not necessarily mean they are CPU intensive.

HPC On Amazon Web Services

Amazon Web Services (AWS) provides on-demand scalability and elasticity for a wide variety of computational and data-intensive workloads, including workloads that represent many of the world’s most challenging computing problems: engineering simulations, financial risk analyses, molecular dynamics, weather prediction, and many more.   

– AWS: An Introduction to High Performance Computing on AWS

Amazon literally has everything you could possibly want in an HPC platform. For every type of workload listed here, AWS has one or more instance classes to match and numerous sizes in each class, allowing you to get very granular in the provisioning of your clusters.

Speaking of provisioning, there is even a tool called CfnCluster which creates clusters for HPC use. CfnCluster is a tool used to build and manage High Performance Computing (HPC) clusters on AWS. Once created, you can log into your cluster via the master node where you will have access to standard HPC tools such as schedulers, shared storage, and an MPI environment.

For data intensive workloads, there a number of options to help get your data closer to your computer resources.

  • S3
  • Redshift
  • DynamoDB
  • RDS

 

EBS is even a viable option for creating large scale parallel file systems to meet high-volume, high-performance, and throughput requirements of workloads.

HPC Workloads & 2nd Watch

2nd Watch can help you solve complex science, engineering, and business problems using applications that require high bandwidth, enhanced networking, and very high compute capabilities.

Increase the speed of research by running high performance computing ( HPC ) in the cloud and reduce costs by paying for only the resources that you use, without large capital investments. With 2nd Watch, you have access to a full-bisection, high bandwidth network for tightly coupled, IO-intensive workloads, which enables you to scale out across thousands of cores for throughput-oriented applications. Contact us today to learn more about High Performance Computing ( HPC )

2nd Watch Customer Success

Celgene is an American biotechnology company that manufactures drug therapies for cancer and inflammatory disorders. Read more about their cloud journey and how they went from doing research jobs that previously took weeks or months, to just hours. Read the case study.

We have also helped a global finance & insurance firm prove their liquidity time and time again in the aftermath of the 2008 recession. By leveraging the batch computing solution that we provided for them, they are now able to scale out their computations across 120,000 cores while validating their liquidity with no CAPEX investment. Read the case study.

 

– Lars Cromley, Director of Engineering, Automation, 2nd Watch


Batch Computing in the Cloud with Amazon SQS & SWF

Batch computing isn’t necessarily the most difficult thing to design a solution around, but there are a lot of moving parts to manage, and building in elasticity to handle fluctuations in demand certainly cranks up the complexity.  It might not be particularly exciting, but it is one of those things that almost every business has to deal with in some form or another.

The on-demand and ephemeral nature of the Cloud makes batch computing a pretty logical use of the technology, but how do you best architect a solution that will take care of this?  Thankfully, AWS has a number of services geared towards just that.  Amazon SQS (Simple Queue Services) and SWF (Simple Workflow Service) are both very good tools to assist in managing batch processing jobs in the Cloud.  Elastic Transcoder is another tool that is geared specifically around transcoding media files.  If your workload is geared more towards analytics and processing petabyte scale big data, then tools like EMR (Elastic Map Reduce) and Kinesis could be right up your alley (we’ll cover that in another blog).  In addition to not having to manage any of the infrastructure these services ride on, you also benefit from the streamlined integration with other AWS services like IAM for access control, S3, SNS, DynamoDB, etc.

For this article, we’re going to take a closer look at using SQS and SWF to handle typical batch computing demands.

Simple Queue Services (SQS), as the name suggests, is relatively simple.  It provides a queuing system that allows you to reliably populate and consume queues of data.  Queued items in SQS are called messages and are either a string, number, or binary value.  Messages are variable in size but can be no larger than 256KB (at the time of this writing).  If you need to queue data/messages larger than 256KB in size the best practice is to store the data elsewhere (e.g. S3, DynamoDB, Redis, MySQL) and use the message data field as a linker to the actual data.  Messages are stored redundantly by the SQS service, providing fault tolerance and guaranteed delivery.  SQS doesn’t guarantee delivery order or that a message will be delivered only once, which seems like something that could be problematic except that it provides something called Visibility Timeout that ensures once a message has been retrieved it will not be resent for a given period of time.  You (well, your application really) have to tell SQS when you have consumed a message and issue a delete on that message.  The important thing is to make sure you are doing this within the Visibility Timeout, otherwise you may end up processing single messages multiple times.  The reasoning behind not just deleting a message once it has been read from the queue is that SQS has no visibility into your application and whether the message was actually processed completely, or even just successfully read for that matter.

Where SQS is designed to be data-centric and remove the burden of managing a queuing application and infrastructure, Simple Workflow Service (SWF) takes it a step further and allows you to better manage the entire workflow around the data.  While SWF implies simplicity in its name, it is a bit more complex than SQS (though that added complexity buys you a lot).  With SQS you are responsible for managing the state of your workflow and processing of the messages in the queue, but with SWF, the workflow state and much of its management is abstracted away from the infrastructure and application you have to manage.  The initiators, workers, and deciders have to interface with the SWF API to trigger state changes, but the state and logical flow are all stored and managed on the backend by SWF.  SWF is quite flexible too in that you can use it to work with AWS infrastructure, other public and private cloud providers, or even traditional on-premise infrastructure.  SWF supports both sequential and parallel processing of workflow tasks.

Note: if you are familiar with or are already using JMS, you may be interested to know SQS provides a JMS interface through its java messaging library.

One major thing SWF buys you over using SQS is that the execution state of the entire workflow is stored by SWF extracted from the initiators, workers, and deciders.  So not only do you not have to concern yourself with maintaining the workflow execution state, it is completely abstracted away from your infrastructure.  This makes the SWF architecture highly scalable in nature and inherently very fault-tolerant.

There are a number of good SWF examples and use-cases available on the web.  The SWF Developer Guide uses a classic e-commerce customer order workflow (i.e. place order, process payment, ship order, record completed order).  The SWF console also has a built in demo workflow that processes an image and converts it to either grayscale or sepia (requires AWS account login).  Either of these are good examples to walk through to gain a better understanding of how SWF is designed to work.

Contact 2nd Watch today to get started with your batch computing workloads in the cloud.

-Ryan Kennedy, Sr. Cloud Architect