Snowflake vs Amazon Redshift: What Is the Difference Between Snowflake and Amazon Redshift?

The modern business world is data-centric. As more businesses turn to cloud computing, they must evaluate and choose the right data warehouse to support their digital modernization efforts and business outcomes. Data warehouses can increase the bottom line, improve analytics, enhance the customer experience, and optimize decision-making. 

A data warehouse is a large repository of data businesses utilize for deep analytical insights and business intelligence. This data is collected from multiple data sources. A high-performing data warehouse can collect data from different operational databases and apply a uniform format for better analysis and quicker insights.

Two of the most popular data warehouse solutions are Snowflake and Amazon Web Services (AWS) Redshift. Let’s look at how these two data warehouses stack up against one another. 

What is Snowflake?

Snowflake is a cloud-based data warehousing solution that uses third-party cloud-compute resources, such as Azure, Google Cloud Platform, or Amazon Web Services (AWS.) It is designed to provide users with a fully managed, cloud-native database solution that can scale up or down as needed for different workloads. Snowflake separates compute from storage: a non-traditional approach to data warehousing. With this method, data remains in a central repository while compute instances are managed, sized, and scaled independently. 

Snowflake is a good choice for companies that are conscious about their operational overhead and need to quickly deploy applications into production without worrying about managing hardware or software. It is also the ideal platform to use when query loads are lighter, and the workload requires frequent scaling. 

The benefits of Snowflake include:

  • Easy integration with most components of data ecosystems
  • Minimal operational overhead: companies are not responsible for installing, configuring, or managing the underlying warehouse platform
  • Simple setup and use
  • Abstracted configuration for storage and compute instances
  • Robust and intuitive SQL interface

What is Amazon Redshift?

Amazon Redshift is an enterprise data warehouse built on Amazon Web Services (AWS). It provides organizations with a scalable, secure, and cost-effective way to store and analyze large amounts of data in the cloud. Its cloud-based compute nodes enable businesses to perform large-scale data analysis and storage. 

Amazon Redshift is ideal for enterprises that require quick query outputs on large data sets. Additionally, Redshift has several options for efficiently managing its clusters using AWS CLI/Amazon Redshift Console, Amazon Redshift Query API, and AWS Software Development Kit. Redshift is a great solution for companies already using AWS services and running applications with a high query load. 

The benefits of Amazon Redshift include:

  • Seamless integration with the AWS ecosystem
  • Multiple data output formatting support
  • Easy console to extract analytics and run queries
  • Customizable data and security models

Comparing Data Warehouse Solutions

Snowflake and Amazon Redshift both offer impressive performance capabilities, like scalability across multiple servers and high availability with minimal downtime. There are some differences between the two that will determine which one is the best fit for your business.

Performance

Both data warehouse solutions harness massively parallel processing (MPP) and columnar storage, which enables advanced analytics and efficiency on massive jobs. Snowflake boasts a unique architecture that supports structured and semi-structured data. Storage, computing, and cloud services are abstracted to optimize independent performance. Redshift recently unveiled concurrency scaling coupled with machine learning to compete with Snowflake’s concurrency scaling. 

Maintenance

Snowflake is a pure SaaS platform that doesn’t require any maintenance. All software and hardware maintenance is handled by Snowflake. Amazon Redshift’s clusters require manual maintenance from the user.

Data and Security Customization

Snowflake supports fewer customization choices in data and security. Snowflake’s security utilizes always-on encryption enforcing strict security checks. Redshift supports data flexibility via partitioning and distribution. Additionally, Redshift allows you to tailor its end-to-end encryption and set up your own identity management system to manage user authentication and authorization.

Pricing

Both platforms offer on-demand pricing but are packaged differently. Snowflake doesn’t bundle usage and storage in its pricing structure and treats them as separate entities. Redshift bundles the two in its pricing. Snowflake tiers its pricing based on what features you need. Your company can select a tier that best fits your feature needs. Redshift rewards businesses with discounts when they commit to longer-term contracts. 

Which data warehouse is best for my business?

To determine the best fit for your business, ask yourself the following questions in these specific areas:

  • Do I want to bundle my features? Snowflake splits compute and storage, and its tiered pricing provides more flexibility to your business to purchase only the features you require. Redshift bundles compute and storage to unlock the immediate potential to scale for enterprise data warehouses. 
  • Do I want a customizable security model? Snowflake grants security and compliance options geared toward each tier, so your company’s level of protection is relevant to your data strategy. Redshift provides fully customizable encryption solutions, so you can build a highly tailored security model. 
  • Do I need JSON storage? Snowflake’s JSON storage support wins over Redshift’s support. With Snowflake, you can store and query JSON with native functions. With Redshift, JSON is split into strings, making it difficult to query and work with. 
  • Do I need more automation? Snowflake automates issues like data vacuuming and compression. Redshift requires hands-on maintenance for these sorts of tasks. 

Conclusion

A data warehouse is necessary to stay competitive in the modern business world. The two major data warehouse players – Snowflake and Amazon Redshift – are both best-in-class solutions. One product is not superior to the other, so choosing the right one for your business means identifying the one best for your data strategy.

2nd Watch is an AWS Certified Partner and an Elite Snowflake Consulting Partner. We can help you choose the right data warehouse solution for you and support your business regardless of which data warehouse your choose.

We have been recognized by AWS as a Premier Partner since 2012, as well as an audited and approved Managed Service Provider and Data and Analytics Competency partner for our outstanding customer experiences, depth and breadth of our products and services, and our ability to scale to meet customer demand. Our engineers and architects are 100% certified on AWS, holding more than 200 AWS certifications.

Our full team of certified SnowPros has proven expertise to help businesses implement modern data solutions using Snowflake. From creating a simple proof of concept to developing an enterprise data warehouse to customized Snowflake training programs, 2nd Watch will help you to utilize Snowflake’s powerful cloud-based data warehouse for all of your data needs.

Contact 2nd Watch today to help you choose the right data warehouse for your business!


Comparing Modern Data Warehouse Options

To remain competitive, organizations are increasingly moving towards modern data warehouses, also known as cloud-based data warehouses or modern data platforms, instead of traditional on-premise systems. Modern data warehouses differ from traditional warehouses in the following ways:

    • There is no need to purchase physical hardware.
    • They are less complex to set up.
    • It is much easier to prototype and provide business value without having to build out the ETL processes right away.
    • There is no capital expenditure and a low operational expenditure.
    • It is quicker and less expensive to scale a modern data warehouse.
    • Modern cloud-based data warehouse architectures can typically perform complex analytical queries much faster because of how the data is stored and their use of massively parallel processing (MPP).

Modern data warehousing is a cost-effective way for companies to take advantage of the latest technology and architectures without the upfront cost to purchase, install, and configure the required hardware, software, and infrastructure.

Comparing Modern Data Warehousing Options

  • Traditional data warehouse deployed on (IaaS): Requires our customers to install traditional data warehouse software on computers provided by a cloud provider (e.g., Azure, AWS, Google).
  • Platform as a service (PaaS): The cloud provider manages the hardware deployment, software installation, and software configuration. However, the customer is responsible for managing the environment, tuning queries, and optimizing the data warehouse software.
  • A true SaaS data warehouse (SaaS): In a SaaS approach, software and hardware upgrades, security, availability, data protection, and optimization are all handled for you. The cloud provider provides all hardware and software as part of its service, as well as aspects of managing the hardware and software.

With all of the above scenarios, the tasks of purchasing, deploying, and configuring the hardware to support the data warehouse environment falls on the cloud provider instead of the customer.

IaaS, PaaS, and SaaS – What Is the Best Option for My Organization?

Infrastructure as a service (IaaS) is an instant computing infrastructure, provisioned and managed over the internet. It helps you avoid the expense and complexity of buying and managing your own physical servers and other data center infrastructure. In other words, if you’re prepared to buy the engine and build the car around it, the IaaS model may be for you.

In the scenario of platform as a service (PaaS), a cloud provider merely supplies the hardware and its traditional software via the cloud; the solution is likely to resemble its original, on-premise architecture and functionality. Many vendors offer a modern data warehouse that was originally designed and deployed for on-premises environments. One such technology is Amazon Redshift. Amazon acquired rights to ParAccel, named it Redshift, and hosted it in the AWS cloud environment. Redshift is a highly successful modern data warehouse service. It is easy in AWS to instantiate a Redshift cluster, but then you need to complete all of the administrative tasks.

You have to reclaim space after rows are deleted or updated (the process of vacuuming in Redshift), manage capacity planning, provisioning compute and storage nodes, determine your distribution keys, etc. All of the things you had to do with ParAccel (or with any traditional architecture), you have to do with Redshift.

Alternatively, any data warehouse solution built for the cloud using a true software as a solution (SaaS) data warehouse architecture allows for the cloud provider to include all hardware and software as part of its service as well as aspects of managing the hardware and software. One such technology, which requires no management and features separate compute, storage, and cloud services that can scale and change independently, is Snowflake. It differentiates itself from IaaS and PaaS cloud data warehouses because it was built from the ground up on cloud architecture.

All administrative tasks, tuning, patching, and management of the environment falls on the vendor. In lieu of the architecture we have seen with IaaS and PaaS solutions in the market today, Snowflake has a new architecture called a multi-clustered shared data that essentially makes the administrative headache of maintaining solutions go away. However, that doesn’t mean it’s the absolute right choice for your organization – that’s where an experienced consulting partner like 2nd Watch comes in.

If you depend on your data to better serve your customers, streamline your operations, and lead (or disrupt) your industry, a modern data platform built on the cloud is a must-have for your organization. Contact us to learn what a modern data warehouse would look like for your organization.


Federate Amazon Redshift Access with Azure Active Directory

Single sign-on (SSO) is a tool that solves fundamental problems, especially in mid-size and large organizations with lots of users.

End users do not want to have to remember too many username and password combinations. IT administrators do not want to have to create and manage too many different login credentials across enterprise systems. It is a far more manageable and secure approach to federate access and authentication through a single identity provider (IdP).

As today’s enterprises rely on a wide range of cloud services and legacy systems, they have increasingly adopted SSO via an IdP as a best practice for IT management. All access and authentication essentially flow through the IdP wherever it is supported. Employees do not have to remember multiple usernames and passwords to access the tools they need to do their jobs. Just as important, IT teams prevent an administrative headache. They manage a single identity per user, which makes tasks like removing access when a person leaves the organization much simpler and less prone to error.

The same practice extends to AWS. As we see more customers migrate to the cloud platform, we hear a growing need for the ability to federate access to Amazon Redshift when they use it for their data warehouse needs.

Database administration used to be a more complex effort. Administrators had to figure out which groups a user belonged to, which objects a user or group were authorized to use, and other needs—in manual fashion. These user and group lists—and their permissions—were traditionally managed within the database itself, and there was often a lot of drift between the database and the company directory.

Amazon Redshift administrators face similar challenges if they opt to manage everything within Redshift itself. There is a better way, though. They can use an enterprise IdP to federate Redshift access, managing users and groups within the IdP and passing the credentials to Amazon Redshift at login.

We increasingly hear from our clients, “We use Azure Active Directory (AAD) for identity management—can we essentially bring it with us as our IdP to Amazon Redshift?”

They want to use AAD with Redshift the way they use it elsewhere, to manage their users and groups in a single place to reduce administrative complexity. With Redshift, specifically, they also want to be able to continue managing permissions for those groups in the data warehouse itself. The good news is you can do this and it can be very beneficial.

Without a solution like this, you would approach database administration in one of two alternative ways:

  1. You would provision and manage users using AWS Identity and Access Management (IAM). This means, however, you will have another identity provider to maintain—credentials, tokens, and the like—separate from an existing IdP like AAD.
  2. You would do all of this within Redshift itself, creating users (and their credentials) and groups and doing database-level management. But this creates similar challenges to legacy database management, and when you have thousands of users, it simply does not scale.

Learn more about our AWS expertise here.


How to Federate Amazon Redshift Access with Okta

Single sign-on (SSO) is a tool that solves fundamental problems, especially in midsize and large organizations with lots of users.

End users do not want to have to remember too many username and password combinations. IT administrators do not want to have to create and manage too many different login credentials across enterprise systems. It is a far more manageable and secure approach to federate access and authentication through a single identity provider (IdP).

As today’s enterprises rely on a wide range of cloud services and legacy systems, they have increasingly adopted SSO via an IdP as a best practice for IT management. All access and authentication essentially flows through the IdP wherever it is supported. Employees do not have to remember multiple usernames and passwords to access the tools they need to do their jobs. Just as importantly, IT teams prevent an administrative headache: They manage a single identity per user, which makes tasks like removing access when a person leaves the organization much simpler and less prone to error.

The same practice extends to AWS. As we see more customers migrate to the cloud platform, we hear a growing need for the ability to federate access to Amazon Redshift when they use it for their data warehouse needs.

Database administration used to be a more complex effort. Administrators had to figure out which groups a user belonged to, which objects a user or group were authorized to use, and other needs—in manual fashion. These user and group lists—and their permissions—were traditionally managed within the database itself, and there was often a lot of drift between the database and the company directory.

Amazon Redshift administrators face similar challenges if they opt to manage everything within Redshift itself. There is a better way, though. They can use an enterprise IdP to federate Redshift access, managing users and groups within the IdP and passing the credentials to Amazon Redshift at login.

We increasingly hear from our clients, “We use Okta for identity management—can we essentially bring it with us as our IdP to Amazon Redshift?” They want to use Okta with Redshift the way they use it elsewhere, to manage their users and groups in a single place to reduce administrative complexity. With Redshift, specifically, they also want to be able to continue managing permissions for those groups in the data warehouse itself. The good news is you can do this and it can be very beneficial.

Without a solution like this, you would approach database administration in one of two alternative ways:

  1. You would provision and manage users using AWS Identity and Access Management (IAM). This means, however, you will have another identity provider to maintain—credentials, tokens, and the like—separate from an existing IdP like Okta.
  2. You would do all of this within Redshift itself, creating users (and their credentials) and groups and doing database-level management. But this creates similar challenges to legacy database management, and when you have thousands of users, it simply does not scale.

Our technical white paper covers how to federate access to Amazon Redshift using Okta as your IdP, passing user and group information through to the database at login. We outline the step-by-step process we follow when we implement this solution for 2nd Watch clients, including the modifications we found were necessary to ensure everything worked properly. We explain how to set up a trial account at Okta.com, build users and groups within the organization’s directory, and enable single sign-on (SSO) into Amazon redshift.

Download the technical white paper

-Rob Whelan, Data & Analytics Practice Director


Cloud Crunch Podcast: 5 Strategies to Maximize Your Cloud’s Value – Create Competitive Advantage from your Data

AWS Data Expert, Saunak Chandra, joins today’s episode to break down the first of five strategies used to maximize your cloud’s value – creating competitive advantage from your data. We look at tactics including Amazon Redshift, RA3 node type, best practices for performance, data warehouses, and varying data structures. Listen now on Spotify, iTunes, iHeart Radio, Stitcher, or wherever you get your podcasts.

We’d love to hear from you! Email us at CloudCrunch@2ndwatch.com with comments, questions and ideas.


Amazon Redshift Stands Strong Despite Maintenance Challenges

AWS says Amazon Redshift is the world’s fastest cloud data warehouse, allowing customers to analyze petabytes of structured and semi-structured data at high speeds that allow for exploratory analysis. According to a 2018 Forrester report, Redshift is the most popular cloud data warehouse for enterprises.

To better understand how enterprises are using Redshift, 2nd Watch surveyed Redshift users at large companies. A majority of respondents (57%) said their Redshift implementation had delivered on corporate expectations, while another 26% said it had “somewhat” delivered.

With all the benefits Redshift enables, it’s no wonder tens of thousands of customers use it. Benefits like three times the performance of any cloud data warehouse or being 50% less expensive than all other cloud data warehouses make it an attractive service to Fortune 500 companies and startups alike, including McDonald’s, Lyft, Comcast, and Yelp, among others.

Overall Findings:

Despite its apparent success in the market, not all Redshift deployments have gone according to plan. 45% of respondents said queries stacking up in queues was a recurring problem in their Redshift deployment; 30% said some of their Data Analyst’s time was unproductive as a result of tuning Redshift queries; and 34% said queries were taking more than one minute to return results. Meanwhile, 33% said they were struggling to manage requests for permissions, and 25% said their Redshift costs were higher than anticipated.

Query and Queuing Learnings:

Queuing of queries is not a new problem. Redshift has a long-underutilized feature called Workload Management queues, or WLM. These queues are like different entrances to a baseball stadium. They all go to the same baseball game, but with different ways to get in. WLM queues divvy up compute and processing power among groups of users so no single “heavy” user ends up dominating the database and preventing others from accessing. It’s common to have queries stack up in the Default WLM queue. A better pattern is to have at least three or four different workload management queues:

  1. ETL processes
  2. Administration
  3. Ad hoc exploration
  4. Data loading and unloading

As for time lost due to performance tuning, this is a tradeoff with Redshift: it is inexpensive on the compute side but takes some care and attention on the human side. Redshift is extremely high-performing when designed and implemented correctly for your use case. It’s common for Redshift users to design tables at the beginning of a data load, then not return to the design until there is a problem, after other data sets enter the warehouse. It’s a best practice to routinely run ANALYZE and have auto-vacuum turned on, and to know how your most common queries are structured, so you can sort tables accordingly.

If queries are taking a long time to run, you need to ask whether the latency is due to the heavy processing needs of the query, or if the tables are designed inefficiently with respect to the query. For example, if a query aggregates sales by date, but the timestamp for sales is not a sort key, the query planner might have to traverse many different tables just to make sure it has all the right data, therefore taking a long time. On the other hand, if your data is already nicely sorted but you have to aggregate terabytes of data into a single value, then waiting a minute or more for data is not unusual.

Permissions

Some survey respondents mentioned that permissions were difficult to manage. There are several options for configuring access to Redshift. Some users create database users and groups internal to Redshift and manage authentication at the database level (for example, logging in via SQL Workbench). Others delegate permissions with an identity provider like Active Directory.

Implementation and Cost Savings

Enterprise IT directors are working to overcome their Redshift implementation challenges. 30% said they are rewriting queries, and 28% said they have compressed their data in S3 as part of a LakeHouse architecture. Query tuning was having the greatest impact on the performance of Redshift clusters.

When Redshift costs exceed the plan, it is a good practice to assess where the costs are coming from. Is it from storage, compute, or something else? Generally, if you are looking to save on Redshift spend, you should explore a LakeHouse architecture, which is a storage pattern that shifts data between S3 and your Redshift cluster. When you need lots of data for analysis, data is loaded into Redshift. When you don’t need that data anymore, it is moved back to S3 where storage is much cheaper. However, the tradeoff is that analysis is slower when data is in S3.

Another place to look for cost savings is in the instance size. It is possible to have over-provisioned your Redshift nodes. Look for metrics like CPU utilization; if it is consistently 25% or even 30% or lower, then you have too much headroom and might be over-provisioned.

Popular Features

Challenges aside, enterprise IT directors seem to love Redshift. The top four Redshift features, according to our survey, are query monitoring rules (cited by 44% of respondents), federated queries (35%) and custom-built ETL workflows (33%).

Query Monitoring Rules are custom rules that track bad or slow queries. Customers love Query Monitoring Rules because they are simple to write and give you great visibility into queries that will disrupt operations. You can choose obvious metrics like query_execution_time, or more subtle things like query_blocks_read, which would be a proxy for how much searching the query planner has to do to get data. Customers like these features because the reporting is central, and it frees them from having to manually check queries themselves.

Federated queries allow you to bring in live, external data to join with your internal Redshift data. You can query, for example, an RDS instance in the same SQL statement as a query against your Redshift cluster. This allows for dynamic and powerful analysis that normally would take many time-consuming steps to get the data in the same place.

Finally, custom-built ETL workflows have become popular for several reasons. One, the sheer compute power sitting in Redshift makes it a very popular source for compute resources. Unused compute can be used for ongoing ETL. You would have to pay for this compute whether or not you use it. Two, and this is an interesting twist, Redshift has become a popular ETL tool because of its capabilities in processing SQL statements. Yes, ETL written in SQL has become popular, especially for complicated transformations and joins that would be cumbersome to write in Python, Scala, or Java.

Conclusion

Redshift’s place in the enterprise IT stack seems secure, though how IT departments use the solution will likely change over time – significantly, perhaps. The reason for persisting in all the maintenance tasks listed above, is that Redshift is increasingly becoming the centerpiece for a data-driven analytics program. Data volume is not shrinking; it is always growing. If you take advantage of these performance features, you will make the most of your Redshift cluster and therefore your analytics program.

Download the infographic on our survey findings.

-Rob Whelan, Data Engineering & Analytics Practice Director

 


2nd Watch Uses Redshift to Improve Client Optimization

Improving our use of Redshift: Then and now

Historically, and common among enterprise IT processes, the 2nd Watch optimization team was pulling in cost usage reports from Amazon and storing them in S3 buckets. The data was then loaded into Redshift, Amazon’s cloud data warehouse, where it could be manipulated and analyzed for client optimization. Unfortunately, the Redshift cluster filled up quickly and regularly, forcing us to spend unnecessary time and resources on maintenance and clean up. Additionally, Redshift requires a large cluster to work with, so the process for accessing and using data became slow and inefficient.

Of course, to solve for this we could have doubled the size, and therefore the cost, of our Redshift usage, but that went against our commitment to provide cost-effective options for our clients. We also could have considered moving to a different type of node that is storage optimized, instead of compute optimized.

Lakehouse Architecture for speed improvements and cost savings

The better solution we uncovered, however, was to follow the Lakehouse Architecture pattern to improve our use of Redshift to move faster and with more visibility, without additional storage fees. The Lakehouse Architecture is a way to strike a balance between cost and agility by selectively moving data in and out of Redshift depending on the processing speed needed for the data. Now, after a data dump to S3, we use AWS Glue crawlers and tables to create external tables in the Glue Data Catalogues. The external tables or schemas are linked to the Redshift cluster, allowing our optimization team to read from S3 to Redshift using Redshift Spectrum.

Our cloud data warehouse remains tidy without dedicated clean-up resources, and we can query the data in S3 via Redshift without having to move anything. Even though we’re using the same warehouse, we’ve optimized its use for the benefit of both our clients and 2nd Watch best practices. In fact, our estimated savings are $15,000 per month, or 100% of our previous Redshift cost.

How we’re using Redshift today

With our new model and the benefits afforded to clients, 2nd Watch is applying Redshift for a variety of optimization opportunities.

Discover new opportunities for optimization. By storing and organizing data related to our clients’ AWS, Azure, and/or Google Cloud usage versus spend data, the 2nd Watch optimization team can see where further optimization is possible. Improved data access and visibility enables a deeper examination of cost history, resource usage, and any known RIs or savings plans.

Increase automation and reduce human error. The new model allows us to use DBT (data build tool) to complete SQL transforms on all data models used to feed reporting. These reports go into our dashboards and are then presented to clients for optimization. DBT empowers analysts to transform warehouse data more efficiently, and with less risk, by relying on automation instead of spreadsheets.

Improve efficiency from raw data to client reporting. Raw data that lives in a data lake in s3 is transformed and organized into a structured data lake that is prepared to be defined in AWS Glue Catalog tables. This gives the analysts access to query the data from Redshift and use DBT to format the data into useful tables. From there, the optimization team can make data-based recommendations and generate complete reports for clients.

In the future, we plan on feeding a power business intelligence dashboard directly from Redshift, further increasing efficiency for both our optimization team and our clients.

Client benefits with Redshift optimization

  • Cost savings: Only pay for the S3 storage you use, without any storage fees from Redshift.
  • Unlimited data access: Large amounts of old data are available in the data lake, which can be joined across tables and brought into Redshift as needed.
  • Increased data visibility: Greater insight into data enables us to provide more optimization opportunities and supports decision making.
  • Improved flexibility and productivity: Analysts can get historical data within one hour, rather than waiting 1-2 weeks for requests to be fulfilled.
  • Reduced compute cost: By shifting the compute cost of loading data into to Amazon EKS.

-Spencer Dorway, Data Engineer


High Performance Computing (HPC) – An Introduction

When we talk about high performance computing ( HPC ) we are typically trying to solve some type of problem. These problems will generally fall into one of four types:

  • Compute Intensive – A single problem requiring a large amount of computation.
  • Memory Intensive – A single problem requiring a large amount of memory.
  • Data Intensive – A single problem operating on a large data set.
  • High Throughput – Many unrelated problems that are be computed in bulk.

 

In this post, I will provide a detailed introduction to High Performance Computing ( HPC ) that can help organizations solve the common issues listed above.

Compute Intensive Workloads

First, let us take a look at compute intensive problems. The goal is to distribute the work for a single problem across multiple CPUs to reduce the execution time as much as possible. In order for us to do this, we need to execute steps of the problem in parallel. Each process­—or thread—takes a portion of the work and performs the computations concurrently. The CPUs typically need to exchange information rapidly, requiring specialization communication hardware. Examples of these types of problems are those that can be found when analyzing data that is relative to tasks like financial modeling and risk exposure in both traditional business and healthcare use cases. This is probably the largest portion of HPC problem sets and is the traditional domain of HPC.

When attempting to solve compute intensive problems, we may think that adding more CPUs will reduce our execution time. This is not always true. Most parallel code bases have what we call a “scaling limit”. This is in no small part due to the system overhead of managing more copies, but also to more basic constraints.

This is summed up brilliantly in Amdahl’s law.

In computer architecture, Amdahl’s law is a formula which gives the theoretical speedup in latency of the execution of a task at fixed workload that can be expected of a system whose resources are improved. It is named after computer scientist Gene Amdahl, and was presented at the AFIPS Spring Joint Computer Conference in 1967.

Amdahl’s law is often used in parallel computing to predict the theoretical speedup when using multiple processors. For example, if a program needs 20 hours using a single processor core, and a particular part of the program which takes one hour to execute cannot be parallelized, while the remaining 19 hours (p = 0.95) of execution time can be parallelized, then regardless of how many processors are devoted to a parallelized execution of this program, the minimum execution time cannot be less than that critical one hour. Hence, the theoretical speedup is limited to at most 20 times (1/(1 − p) = 20). For this reason, parallel computing with many processors is useful only for very parallelizable programs.

– Wikipedia

Amdahl’s law can be formulated the following way:

 

where

  • S latency is the theoretical speedup of the execution of the whole task.
  • s is the speedup of the part of the task that benefits from improved system resources.
  • p is the proportion of execution time that the part benefiting from improved resources originally occupied.

Chart Example: If 95% of the program can be parallelized, the theoretical maximum speed up using parallel computing would be 20 times.

Bottom line: As you create more sections of your problem that are able to run concurrently, you can split the work between more processors and thus, achieve more benefits. However, due to complexity and overhead, eventually using more CPUs becomes detrimental instead of actually helping.

There are libraries that help with parallelization, like OpenMP or Open MPI, but before moving to these libraries, we should strive to optimize performance on a single CPU, then make p as large as possible.

Memory Intensive Workloads

Memory intensive workloads require large pools of memory rather than multiple CPUs. In my opinion, these are some of the hardest problems to solve and typically require great care when building machines for your system. Coding and porting is easier because memory will appear seamless, allowing for a single system image.  Optimization becomes harder, however, as we get further away from the original creation date of your machines because of component uniformity. Traditionally, in the data center, you don’t replace every single server every three years. If we want more resources in our cluster, and we want performance to be uniform, non-uniform memory produces actual latency. We also have to think about the interconnect between the CPU and the memory.

Nowadays, many of these concerns have been eliminated by commodity servers. We can ask for thousands of the same instance type with the same specs and hardware, and companies like Amazon Web Services are happy to let us use them.

Data Intensive Workloads

This is probably the most common workload we find today, and probably the type with the most buzz. These are known as “Big Data” workloads. Data Intensive workloads are the type of workloads suitable for software packages like Hadoop or MapReduce. We distribute the data for a single problem across multiple CPUs to reduce the overall execution time. The same work may be done on each data segment, though not always the case. This is essentially the inverse of a memory intensive workload in that rapid movement of data to and from disk is more important than the interconnect. The type of problems being solved in these workloads tend to be Life Science (genomics) in the academic field and have a wide reach in commercial applications, particularly around user data and interactions.

High Throughput Workloads

Batch processing jobs (jobs with almost trivial operations to perform in parallel as well as jobs with little to no inter-CPU communication) are considered High Throughput workloads. In high throughput workloads, we create an emphasis on throughput over a period rather than performance on any single problem. We distribute multiple problems independently across multiple CPUs to reduce overall execution time. These workloads should:

  • Break up naturally into independent pieces.
  • Have little or no inter CPU communcation
  • Be performed in separate processes or threads on a separate CPU (concurrently)

 

Workloads that are compute intensive jobs can likely be broken into high throughput jobs, however, high throughput jobs do not necessarily mean they are CPU intensive.

HPC On Amazon Web Services

Amazon Web Services (AWS) provides on-demand scalability and elasticity for a wide variety of computational and data-intensive workloads, including workloads that represent many of the world’s most challenging computing problems: engineering simulations, financial risk analyses, molecular dynamics, weather prediction, and many more.   

– AWS: An Introduction to High Performance Computing on AWS

Amazon literally has everything you could possibly want in an HPC platform. For every type of workload listed here, AWS has one or more instance classes to match and numerous sizes in each class, allowing you to get very granular in the provisioning of your clusters.

Speaking of provisioning, there is even a tool called CfnCluster which creates clusters for HPC use. CfnCluster is a tool used to build and manage High Performance Computing (HPC) clusters on AWS. Once created, you can log into your cluster via the master node where you will have access to standard HPC tools such as schedulers, shared storage, and an MPI environment.

For data intensive workloads, there a number of options to help get your data closer to your computer resources.

  • S3
  • Redshift
  • DynamoDB
  • RDS

 

EBS is even a viable option for creating large scale parallel file systems to meet high-volume, high-performance, and throughput requirements of workloads.

HPC Workloads & 2nd Watch

2nd Watch can help you solve complex science, engineering, and business problems using applications that require high bandwidth, enhanced networking, and very high compute capabilities.

Increase the speed of research by running high performance computing ( HPC ) in the cloud and reduce costs by paying for only the resources that you use, without large capital investments. With 2nd Watch, you have access to a full-bisection, high bandwidth network for tightly coupled, IO-intensive workloads, which enables you to scale out across thousands of cores for throughput-oriented applications. Contact us today to learn more about High Performance Computing ( HPC )

2nd Watch Customer Success

Celgene is an American biotechnology company that manufactures drug therapies for cancer and inflammatory disorders. Read more about their cloud journey and how they went from doing research jobs that previously took weeks or months, to just hours. Read the case study.

We have also helped a global finance & insurance firm prove their liquidity time and time again in the aftermath of the 2008 recession. By leveraging the batch computing solution that we provided for them, they are now able to scale out their computations across 120,000 cores while validating their liquidity with no CAPEX investment. Read the case study.

 

– Lars Cromley, Director of Engineering, Automation, 2nd Watch