AWS re:Invent 2020 kicked off virtually last week, and there’s a lot to unpack! Andy Jassy’s much-anticipated keynote focused on the increased need for rapid iteration and transformation, with product and service announcements that did not disappoint. Here is your recap of week 1 at AWS re:Invent 2020.
AWS re:Invent kicked off virtually with registration free and open to everyone for the first time this year. Andy Jassy’s much-anticipated keynote, as well as the Partner Keynote with Doug Yeum, were the featured content for the first week. Both stressed the increased need for rapid iteration and transformation.
With the announcements of ECS and EKS Anywhere, AWS itself has seemingly begun to transform from attempting to be all things to all customers, to having a more multi-cloud approach. These two services are in the same vein as Anthos on Google Cloud Platform and Arc on Microsoft Azure. All of these services allow customers to run containers in the environments of their choosing. AWS announced S3 Strong Concurrency, which also brings them in line with GCP and Azure. Andy Jassy did make sure to differentiate Amazon from Microsoft by specifically calling them out as incumbents that customers ”are fed up with and sick of.” This was part of the announcement for Bablefish, an open-source MSSQL to PostgreSQL translator.
Approximately 30 products and features were announced in these two keynotes, but the one that will impact almost every AWS customer is EBS gp3. Gp3 allows you to “provision performance apart from capacity.” On gp2, you can get increased performance by increasing the size of the volume. By switching EBS volumes to the new 7th generation gp3 volume types, customers can provision IOPS separately and pay only for the volume size they need. These new volumes are apparently faster and cheaper than gp2 in every way. Most customers are planning to switch all supported volumes to gp3 as soon as possible, but there are still some concerns about support for boot volumes.
Another important storage announcement was io2 Block Express volumes, which can provide up to 256K IOPS and 4000 Mbps of throughput. Some customers have been waiting for a cloud storage solution that can compete with Storage Area Networks they have used in on-premises environments, but, as AWS critic Corey Quinn pointed out, two of these volumes transferring at that throughput across two Availability Zones would cost somewhere in the neighborhood of a dollar per second.
Amazon SageMaker Pipelines, Feature Store, and Data Wrangler (not to be confused with aws-data-wrangler by AWS Labs) were also announced. These tools will be welcomed by companies that need to regularly clean data, store/retrieve associated metadata, and consistently re-deploy Machine Learning models. QuickSight Q was demoed and seems to be a natural language processing marvel. It can retrieve QuickSight Business Intelligence without pre-defined data models.
AWS Glue has been enhanced with materialized “Elastic” views that can be created with traditional SQL and will replicate across multiple data stores. The Elastic Views are serverless and will monitor source data for changes and update regularly.
The big Partner Keynote highlight was the introduction of the Amazon RDS Delivery Partner program as part of the AWS Service Delivery Program. Now customers can easily find partners with the database expertise to ensure Disaster Recovery, High-availability, Cost Optimization, and Security.
There is a lot more to come in the following weeks of re:Invent, and we look forward to doing deeper dives on all of these announcements here on our blog and in our podcast, Cloud Crunch! Check back next week for a Week 2 recap.
-Joey Brown, Sr Cloud Consultant
Bo Jackson has a message for all AWS re:Invent 2020 attendees. Watch his Cameo, then visit our re:Invent 2020 sponsor page to get your free 2nd Watch sweatpants.
There is a saying in meteorology that you can be accurate more often than not if you predict tomorrow’s weather to be the same as today’s weather. Of course, that is not always the case unless you live in a place like San Diego or if you use data to make your predictions.
Forecasting in business requires data, lots of data, and it requires specialized data science skills, time, and tools to wrangle, prepare, and analyze the data.
Cloud solution providers such as AWS are enabling organizations to collect and host all that business data and provide tools to seamlessly integrate and analyze data for analytics and forecasting. Amazon Forecast is a managed service that consumes time series data and makes predictions without requiring the user to have any machine leaning knowledge or experience.
Determine Use Case and Acceptable Accuracy
It is important to identify use cases and accuracy criteria before generating forecasts. As machine learning tools become easier to use, hasty predictions that are ostensibly accurate become more common.
Without a prepared use case and a definition of acceptable accuracy, the usefulness of any data analysis will be in question. A common use case is predicting the customer demand of inventory items to ensure adequate supply. For inventory, a common use case is to expect that the predict demand will be higher than the actual demand 90% of the time to ensure adequate supply without overstocking.
Statistics-wise that would be a quantile loss or percentile denoted as P90.
How Amazon Forecast Works
To get started, you need to collect the historical and related time series data and upload it to Amazon Forecast. Amazon Forecast automatically inspects the data and identifies the key attributes and selects the appropriate machine learning algorithm, trains the model, and generates the forecasts. Forecasts can be visualized or the data can be exported for downstream processing.
Aggregate and Prepare Data
Time series data is often more granular than necessary for many use cases. If transaction data is collected across multiple locations (or device readings or inventory items) and the use case requires only a prediction of the total amount, the data will need to be aggregated before attempting any predictions.
Inconsistencies in time series data are common and should be analyzed and corrected as much as possible before attempting any predictions. In many cases, perfect corrections are impossible due to missing or inaccurate data and methods to smooth, fill, or interpolate the data will need to be employed.
Amazon Forecast Forecasts
Generating a forecast from Amazon Forecast is much easier than doing the prerequisite work. Amazon Forecast provides half a dozen predefined algorithms and an option for AutoML, which will evaluate all algorithms and choose one it determines to fit best.
Simple CSV files are uploaded, a Predictor is trained, and a Forecast is created. The end-to-end process usually takes several hours depending on the size of the data and the parameter settings. Once generated, you can see the results in a Forecast Lookup or export them back to CSV to be consumed by a data visualization service such as Amazon QuickSight.
If you skipped the prerequisites, you would look at the Forecast results and ask, “Now what?” If your results satisfy your use case and accuracy requirements, you can start working on other use cases and/or create an Amazon Forecast pipeline that delivers regular predictions.
Improving Forecast Accuracy
The most important factor affecting forecast accuracy is the quality and quantity of the data. Larger datasets are the first thing that should be tried. Data analysis might also be needed to ensure that the data is consistent.
If the generated Forecast has not satisfied the accuracy requirements you defined, it’s time to adjust some of the (hyper)parameters, include additional data, or both.
Parameters and Hyperparameters
Reducing the forecast horizon can increase accuracy; it’s easier to make shorter term predictions. Manually setting the Predictor’s algorithm to DeepAR+ will enable an advanced option called HRO which stands for Hyperparameter Optimization. Enabling HRO will cause the Predictor to run multiple times with different tweaks to attempt to increase the accuracy.
Related Time Series and Metadata
Related Time Series data (e.g., weather data, holidays) and Metadata (e.g. sub-categories of inventory items) can be added to the Dataset Group to attempt to increase the accuracy. Matching item_ids and making sure beginning and ending timestamps match the dataset can add additional overhead that may not be necessary depending on your accuracy requirements.
For more details on using Amazon Forecast, watch this video on how to build accurate forecasting models.
-Joey Brown, Sr Cloud Consultant
Catch our re:Invent Breakout Session ‘Reality Check: Moving the Data Lake from Storage to Strategic’
Watch our AWS re:Invent session ANT283-S “Reality Check: Moving the Data Lake from Storage to Strategic” for your chance to win a Sony PlayStation 5!
Many organizations have created data lakes to store both relation and non-relational data to enable faster decision making. All too often, these data lakes move from proof-of-concept to production and quickly become just another data repository, not achieving the required strategic business dependency. Watch Reality Check: Moving Your Data Lake from Storage to Strategic to get a reality check on how the data lake management approaches you’ve employed have led to failure. Discover the steps needed to build strategic importance and restore data dependency and learn how cloud native creates efficiency along with a long-term competitive advantage.
Winning is easy:
- Watch our breakout session, ‘ANT283-S Reality Check: Moving the Data Lake from Storage to Strategic’
- Share what you learned from the session on social by 12/18
- Tag @2nd Watch, and you’ll be entered into the drawing held on 12/22!
AWS re:Invent 2020 is off to a great virtual start, and we want to meet you here! Visit the 2nd Watch re:Invent Sponsor Page now through December 18 to speak with one of our cloud experts, watch our session “ANT283-S Reality Check: Moving the Data Lake from Storage to Strategic” (and don’t forget to comment on the session on social for your chance to win a Sony PlayStation 5), access a ton of downloadable content, and claim your free 2nd Watch re:Invent sweatpants.
Your Trusted Cloud Advisor
As a cloud native AWS Premier Partner, we orchestrate your cloud transformation from strategy to execution, fueling business growth. Our focus is on enabling accelerated cloud migration, application modernization, IT optimization and data engineering to facilitate true business transformation.
When you deployed Redshift a few years ago, your new data lake was going to allow your organization to make better, faster, more informed business decisions. It would break down data silos allowing your Data Scientists to have greater access to all data sources, quickly, enabling them to be more efficient in delivering consumable data insights.
Now that some time has passed, though, there is a good chance your data lake may no longer be returning to you the value it initially did. It has turned into a catch all for your data and maybe even a giant data mess with your clusters filling up too quickly, resulting in the need to constantly delete data or scale up. Teams are blaming one another for consuming too many resources, even though they are split and shouldn’t be impacting one another. Slow queries have resulted from a less than optimal table structure decided upon when initially deployed that no longer fits the business and data you are generating today. All of this results in your expensive Data Scientists and Analysts being less productive than when you initially deployed Redshift.
Keep in mind, though, that the Redshift you deployed a few years ago is not the same Redshift today. We all know that AWS is continuously innovating, but over the last 2 years they have added more than 200 new features to Redshift that can address many of these problems, such as:
- Utilizing AQUA nodes, which can deliver a 10x performance improvement
- Refreshing instance families that can lower your overall spend
- Federated query, which allows you to query across Redshift, S3, and relational database services to come up with aggregated data sets, which can then be put back into the data lakes to be consumed by other analytic services
- Concurrency scaling, which automatically adds and removes capacity to handle unpredictable demand from thousands of concurrent users, so you do not take a performance hit
- The ability to take advantage of machine learning with automatic workload management (WLM) to dynamically manage memory and concurrency, helping maximize query throughput
As a matter of fact, clients repeatedly tell us there have been so many innovations with Redshift, it’s hard for them to determine which ones will benefit them, let alone be aware of all of them all.
Having successfully deployed and maintained AWS Redshift for years here at 2nd Watch, we have packaged our best practice learnings to deliver the AWS Redshift Health Assessment. The AWS Redshift Health Assessment is designed to ensure your Redshift Cluster is not inhibiting the productivity of your valuable and costly specialized resources.
At the end of our 2-3 week engagement, we deliver a lightweight prioritized roadmap of the best enhancements to be made to your Redshift cluster that will deliver immediate impact to your business. We will look for ways to not only improve performance but also save you money where possible, as well as analyze your most important workloads to ensure you have an optimal table design deployed utilizing the appropriate and optimal Redshift features to get you the results you need.
AWS introduced the concept of a Lake House analogy to better describe what Redshift has become. A Lake House is prime real estate that everyone wants because it gives you a view of something beautiful, with limitless opportunities of enjoyment. With the ability to use a common query or dashboard across your data warehouses and multiple data lakes, like a lake house, Redshift provides you the beautiful sight of all your data and limitless possibilities. However, every lake house needs ongoing maintenance to ensure it brings you the enjoyment you desired when you first purchased it and a lake house built with Redshift is no different.
Contact 2nd Watch today to maximize the value of your data, like you intended when you deployed Redshift.
-Rob Whelan, Data Engineering & Analytics Practice Manager
AWS says Amazon Redshift is the world’s fastest cloud data warehouse, allowing customers to analyze petabytes of structured and semi-structured data at high speeds that allow for exploratory analysis. According to a 2018 Forrester report, Redshift is the most popular cloud data warehouse for enterprises.
To better understand how enterprises are using Redshift, 2nd Watch surveyed Redshift users at large companies. A majority of respondents (57%) said their Redshift implementation had delivered on corporate expectations, while another 26% said it had “somewhat” delivered.
With all the benefits Redshift enables, it’s no wonder tens of thousands of customers use it. Benefits like three times the performance of any cloud data warehouse or being 50% less expensive than all other cloud data warehouses make it an attractive service to Fortune 500 companies and startups alike, including McDonald’s, Lyft, Comcast, and Yelp, among others.
Despite its apparent success in the market, not all Redshift deployments have gone according to plan. 45% of respondents said queries stacking up in queues was a recurring problem in their Redshift deployment; 30% said some of their Data Analyst’s time was unproductive as a result of tuning Redshift queries; and 34% said queries were taking more than one minute to return results. Meanwhile, 33% said they were struggling to manage requests for permissions, and 25% said their Redshift costs were higher than anticipated.
Query and Queuing Learnings:
Queuing of queries is not a new problem. Redshift has a long-underutilized feature called Workload Management queues, or WLM. These queues are like different entrances to a baseball stadium. They all go to the same baseball game, but with different ways to get in. WLM queues divvy up compute and processing power among groups of users so no single “heavy” user ends up dominating the database and preventing others from accessing. It’s common to have queries stack up in the Default WLM queue. A better pattern is to have at least three or four different workload management queues:
- ETL processes
- Ad hoc exploration
- Data loading and unloading
As for time lost due to performance tuning, this is a tradeoff with Redshift: it is inexpensive on the compute side but takes some care and attention on the human side. Redshift is extremely high-performing when designed and implemented correctly for your use case. It’s common for Redshift users to design tables at the beginning of a data load, then not return to the design until there is a problem, after other data sets enter the warehouse. It’s a best practice to routinely run ANALYZE and have auto-vacuum turned on, and to know how your most common queries are structured, so you can sort tables accordingly.
If queries are taking a long time to run, you need to ask whether the latency is due to the heavy processing needs of the query, or if the tables are designed inefficiently with respect to the query. For example, if a query aggregates sales by date, but the timestamp for sales is not a sort key, the query planner might have to traverse many different tables just to make sure it has all the right data, therefore taking a long time. On the other hand, if your data is already nicely sorted but you have to aggregate terabytes of data into a single value, then waiting a minute or more for data is not unusual.
Some survey respondents mentioned that permissions were difficult to manage. There are several options for configuring access to Redshift. Some users create database users and groups internal to Redshift and manage authentication at the database level (for example, logging in via SQL Workbench). Others delegate permissions with an identity provider like Active Directory.
Implementation and Cost Savings
Enterprise IT directors are working to overcome their Redshift implementation challenges. 30% said they are rewriting queries, and 28% said they have compressed their data in S3 as part of a LakeHouse architecture. Query tuning was having the greatest impact on the performance of Redshift clusters.
When Redshift costs exceed the plan, it is a good practice to assess where the costs are coming from. Is it from storage, compute, or something else? Generally, if you are looking to save on Redshift spend, you should explore a LakeHouse architecture, which is a storage pattern that shifts data between S3 and your Redshift cluster. When you need lots of data for analysis, data is loaded into Redshift. When you don’t need that data anymore, it is moved back to S3 where storage is much cheaper. However, the tradeoff is that analysis is slower when data is in S3.
Another place to look for cost savings is in the instance size. It is possible to have over-provisioned your Redshift nodes. Look for metrics like CPU utilization; if it is consistently 25% or even 30% or lower, then you have too much headroom and might be over-provisioned.
Challenges aside, enterprise IT directors seem to love Redshift. The top four Redshift features, according to our survey, are query monitoring rules (cited by 44% of respondents), federated queries (35%) and custom-built ETL workflows (33%).
Query Monitoring Rules are custom rules that track bad or slow queries. Customers love Query Monitoring Rules because they are simple to write and give you great visibility into queries that will disrupt operations. You can choose obvious metrics like query_execution_time, or more subtle things like query_blocks_read, which would be a proxy for how much searching the query planner has to do to get data. Customers like these features because the reporting is central, and it frees them from having to manually check queries themselves.
Federated queries allow you to bring in live, external data to join with your internal Redshift data. You can query, for example, an RDS instance in the same SQL statement as a query against your Redshift cluster. This allows for dynamic and powerful analysis that normally would take many time-consuming steps to get the data in the same place.
Finally, custom-built ETL workflows have become popular for several reasons. One, the sheer compute power sitting in Redshift makes it a very popular source for compute resources. Unused compute can be used for ongoing ETL. You would have to pay for this compute whether or not you use it. Two, and this is an interesting twist, Redshift has become a popular ETL tool because of its capabilities in processing SQL statements. Yes, ETL written in SQL has become popular, especially for complicated transformations and joins that would be cumbersome to write in Python, Scala, or Java.
Redshift’s place in the enterprise IT stack seems secure, though how IT departments use the solution will likely change over time – significantly, perhaps. The reason for persisting in all the maintenance tasks listed above, is that Redshift is increasingly becoming the centerpiece for a data-driven analytics program. Data volume is not shrinking; it is always growing. If you take advantage of these performance features, you will make the most of your Redshift cluster and therefore your analytics program.
-Rob Whelan, Data Engineering & Analytics Practice Director
Well, it’s that time of year again. Where I live, the leaves are changing color, temperature is diving, and turkeys are starting to fear for their lives. These signs all point to AWS re:Invent being right around the corner. This year, AWS re:Invent will kick off its 9th annual conference on November 30th, 2020 with a couple major caveats. It will be 3 weeks long, 100% virtual, and free to all. This year will be a marathon, not a sprint, so make sure to pace yourself. As always, 2nd Watch is here to help prepare you with what we think you can expect this year, so let’s get to it!
At the time I am writing this article things are a bit unclear on how everything will work at re:Invent this year. We can definitely count on live keynotes from the likes of Andy Jassy, Peter DeSantis, Werner Vogels and more. For the hundreds of sessions, it’s unclear if the sessions will be broadcasted live at a scheduled time and then rebroadcasted or if everything will be on-demand. We do know sponsor-provided sessions will be pre-recorded and available on-demand on the sponsor pages. I am sure this will be flushed out once the session catalog is released in mid-November. Regardless, all sessions are recorded and then posted later on various platforms such as the AWS YouTube channel. Per the AWS re:Invent 2020 FAQ’s, “You can continue logging in to the re:Invent platform and viewing sessions until the end of January 2021. After January, you will be able to view the sessions on the AWS YouTube channel.”
AWS expects somewhere in the range of a mind boggling 250,000+ people to register this year, so we can all hold our expectations for getting that famous re:Invent hoodie. The event will be content-focused, so each sponsor will get their own sponsor page, which is the equivalent of a sponsor booth. Sponsor pages are sure to have downloadable content, on-demand videos and other goodies available to attendees, but again, how you’re going to fill up your swag bag is yet to be seen. Let’s move on to our advice and predictions, and then we will take the good with the bad to wrap it up.
- Be humble– Hold off on boasting to your colleagues this year that you are part of the elite that get sent to re:Invent. News flash: they are going this year too. Along with 250,000+ other people.
- Pace yourself – You will not be able to attend every session that you are interested in. Pick one learning track and try to get the most out of it.
- No FOMO – Fear not, all the sessions are recorded and posted online for you to view on-demand, at your convenience.
- Stay connected – Take advantage of any virtual interactive sessions that you can to meet new people and make new connections in the industry.
- Get hands on– Take advantage of the Jams and Gamedays to work with others and get hands on experience with AWS services.
Let’s take a quick look at some of the predictions our experts at 2nd Watch have for service release announcements this year at re:Invent.
- AWS Glue will have some serious improvements around the graphical user interface.
- Better datatype detection and automatic handling of datatype conflicts.
- Glue job startup will also speed up significantly.
- Amazon SageMaker algorithms will become easier to use – data integration will be smoother and less error-prone.
- AWS will release managed implementations of text generation algorithms like GPT-2.
- Some kind of automatic visualization or analysis feature for Amazon Redshift will be released, so you don’t have to build analyses from scratch every time.
- Expanded/enhanced GPU instance type offerings will be made available.
- Lambda integration with AWS Outposts will be made available.
- Serverless aurora goes fully on-demand with pay-per-request and also, potentially, global deployment.
Make sure to check back here on the 2nd Watch blog or the 2nd Watch Facebook, LinkedIn and Twitter pages for weekly re:Invent recaps, We’ll also be live-tweeting AWS announcements during the keynotes, so keep your eye on our Twitter feed for all the highlights!
Finally, we thought it would be fun to highlight some of the good that comes with the changes this year.
Take the Good with the Bad
Nothing is as we are used to this year, and re:Invent falls right in line with that sentiment. We are eager with anticipation of a great event, nevertheless, and hope you are too. Since we won’t get to see you in-person at our booth this year, please visit our pre-re:Invent site at offers.2ndwatch.com/aws-reinvent-2020 now to pre-schedule a meeting with us and find out about all the fun giveaways and contests we have this year. Don’t miss out on your free 2nd Watch re:Invent sweatpants, a chance to win a Sony PlayStation 5, a great virtual session on taking your data lake from storage to strategic, and a lot more! Then, make sure to visit our re:Invent sponsor page 11/30-12/18 on the re:Invent portal.
We would love to meet you and discuss all the possibilities for your cloud journey. Have a fantastic re:Invent 2020 and stay safe!
-Dustin Snyder, Director of Cloud Infrastructure & Architecture
Here is a short list of links you may find useful:
A colleague of mine postulated that the IT department would eventually go the way of the dinosaur. He put forward that as Everything-as-a-Service model becomes the norm, IT would no longer provide meaningful value to the business. My flippant response was to point out that they have been saying that mainframes are dead for decades.
This of course doesn’t get to the heart of the conversation. What is the future role of IT as we move towards the use of Everything-as-a-Service? Will marketing, customer services, finance and other departments continue to look to IT for their application deployment? Will developers and engineers move to containerization to build and release code, turning to a DevOps model where the Ops are simply a cloud provider?
We’ve already proven that consumers can adapt to very complex applications. Every day when you deploy and use an application on your phone, you are operating at a level of complexity that once required IT assistance. And yes, the development of intuitive UXs has enabled this trend, however the same principal is occurring at the enterprise level. Cloud, in many ways, has already brought this simplification forward. It has democratized IT.
So, what is the future of IT? What significant disruptions to operations processes will occur through democratization? I liken it to the evolution of eSports (Madden NFL). You don’t manage each player on the field. You choose the skill players for the team, then run the plays. The only true decision you make is which offensive play to run, or which defensive scheme to set. In IT terms, you review the field (operations), orchestrate the movement of resources, and ensure the continuation of the applications looking for any potential issues and resolving them before they become an issue. This is the future of IT.
What are the implications? I believe IT evolves into a higher order (read more business value) function. They enable digital transformation, not from a resource perspective, but from a strategic business empowerment perspective. They get out of the job that keeps them from being strategic, the tactical day to day of managing resources, to enabling and implementing business strategy. However, that takes a willingness to specifically allocate how IT is contributing to the business value output/increase at some very granular levels. To achieve this, it might require reengineering teams, architectures, and budgets to tightly link specific IT contributions to specific business outputs. The movement to modern cloud technology supports this fundamental shift, and over time, will soon start to solve chronic problems of underfunding or lack of support for ongoing improvement. IT is not going the way of the dinosaur. They’re becoming the fuel that enables business to grow strategically.
Want more tips on how to empower IT to contribute to growing your business strategy? Contact us
-Michael Elliott, Sr Director of Product Marketing