Google Cloud, Open-Source and Enterprise Solutions

In 2020, a year where enterprises had to rethink their business models to stay alive, Google Cloud was able to grow 47% and capture market share. If you are not already looking at Google Cloud as part of your cloud strategy, you probably should.

Google has made conscious choices about not locking in customers with proprietary technology. Open-source technology has, for many years, been a core focus for Google, and many of Google Cloud’s solutions can integrate easily with other cloud providers.

Kubernetes (GKE), Knative (Cloud Functions), TensorFlow (Machine Learning), and Apache Beam (Data Pipelines) are some examples of cloud-agnostic tools that Google has open-sourced and which can be deployed to other clouds as well as on-premises, if you ever have a reason to do so.

Specifically, some of Google Cloud’s services and its go-to-market strategy set Google Cloud apart. Modern and scalable solutions like BigQuery, Looker, and Anthos fall into this category. They are best of class tools for each of their use cases, and if you are serious about your digital transformation efforts, you should evaluate their capabilities and understand what they can do for your business.

Three critical challenges we see from our enterprise clients here at 2nd Watch repeatedly include:

  1. How to get started with public cloud
  2. How to better leverage their data
  3. How to take advantage of multiple clouds

Let’s dive into each of these.

Foundation

Ask any architect if they would build a house without a foundation, and they would undisputedly tell you “No.” Unfortunately, many companies new to the cloud do precisely that. The most crucial step in preparing an enterprise to adopt a new cloud platform is to set up the foundation.

Future standards are dictated in the foundation, so building it incorrectly will cause unnecessary pain and suffering to your valuable engineering resources. The proper foundation, that includes your project structure aligned with your project lifecycle and environments, and a CI/CD pipeline to push infrastructure changes through code will enable your teams to become more agile while managing infrastructure in a modern way.

A foundation’s essential blocks include project structure, network segmentation, security, IAM, and logging. Google has a multi-cloud tool called Cloud Operations for logs management, reporting, and alerting, or you can ingest logs into existing tools or set up the brand of firewalls you’re most familiar and comfortable with from the Google Cloud Marketplace. Depending on your existing tools and industry regulations, compliance best practices might vary slightly, guiding you in one direction or another.

DataOps

Google has, since its inception, been an analytics powerhouse. The amount of data moving through Google’s global fiber network at any given time is incredible. Why does this matter to you? Google has now made some of its internal tools that manage large amounts of data available to you, enabling you to better leverage your data. BigQuery is one of these tools.

Being serverless, you can get started with BigQuery on a budget, and it can scale to petabytes of data without breaking a sweat. If you have managed data warehouses, you know that scaling them and keeping them performant is a task that is not easy. With BigQuery, it is.

Another valuable tool, Looker, makes visualizing your data easy. It enables departments to share a single source of truth, which breaks down data silos and enables collaboration between departments with dashboards and views for data science and business analysis.

Hybrid Cloud Solutions

Google Cloud offers several services for multi-cloud capabilities, but let’s focus on Anthos here. Anthos provides a way to run Kubernetes clusters on Google Cloud, AWS, Azure, on-premises, or even on the edge while maintaining a single pane of glass for deploying and managing your containerized applications.

With Anthos, you can deploy applications virtually anywhere and serve your users from the cloud datacenter nearest them, across all providers, or run apps at the edge – like at local franchise restaurants or oil drilling rigs – all with the familiar interfaces and APIs your development and operations teams know and love from Kubernetes.

Currently in preview, soon Google Cloud will release BigQuery Omni to the public. BigQuery Omni lets you extend the capabilities of BigQuery to the other major cloud providers. Behind the scenes, BigQuery Omni runs on top of Anthos and Google takes care of scaling and running the clusters, so you only have to worry about writing queries and analyzing data, regardless of where your data lives. For some enterprises that have already adopted BigQuery, this can mean a ton of cost savings in data transfer charges between clouds as your queries run where your data lives.

Google Cloud offers some unmatched open-source technology and solutions for enterprises you can leverage to gain competitive advantages. 2nd Watch has helped organizations overcome business challenges and meet objectives with similar technology, implementations, and strategies on all major cloud providers, and we would be happy to assist you in getting to the next level on Google Cloud.

2nd Watch is here to serve as your trusted cloud data and analytics advisor. When you’re ready to take the next step with your data, contact Us.

Learn more

Webinar: 6 Essential Tactics for your Data & Analytics Strategy

Webinar:  Building an ML foundation for Google BigQuery ML & Looker

-Aleksander Hansson, 2nd Watch Google Cloud Specialist

Amazon Forecast: Best Practices

In part one of this article, we offered an overview of Amazon Forecast and how to use it. In part two, we get into Amazon Forecast best practices:

Know your business goal

In our data and analytics practice, business value comes first. We want to know and clarify use cases before we talk about technology. Using amazon Forecast is no different. When creating a forecast, do you want to make sure you always have enough inventory on hand? Or do you want to make sure that all your inventory gets used all the time? This will drive which “quartile” you look at.

Each quartile – the defaults are 10%, 50%, and 90% – is important for its own reasons and should be looked at to give a range. What is the 50% quartile? The forecast at this quartile has a 50-50 chance of being right. The real number has a 50% chance of being higher and a 50% chance of being lower than the actual value. The forecast at the 90% quartile has a 90% chance of being higher than the actual value, while the forecast at the 10% quartile has only a 10% chance of being higher. So, if you want to make sure you sell all your inventory, use the 10% quartile forecast.

Use related time series

Amazon has made Forecast so easy to use with related time series, you really have nothing to lose to make your forecast more robust. All you have to do is make the time series time units the same as your target time series.

One way to create a related dataset is to use categorical or binary data whose future values are already known – for example, whether the future time is on a weekend or a holiday or there is a concert playing – anything that is on a schedule that you can rely on.

Even if you don’t know if something will happen, you can create multiple forecasts where you vary the future values. For example, if you want to forecast attendance at a baseball game this Sunday, and you want to model the impact of weather, you could create a feature is_raining and try one forecast with “yes, it’s raining” and another with “no, it’s not raining.”

Look at a range of forecasted values, not a singular forecasted value

Don’t expect the numbers to be precise. One of the biggest values from a forecast is knowing what the likely range of actual values will be. Then, take some time to analyze what drives that range. Can it be made smaller (more accurate) with more related data? If so, can you control any of that related data?

Visualize the results

Show historical and forecast values on one chart. This will give you a sense of how the forecast is trending. You can backfill the chart with actuals as they come in, so you can learn more about your forecast’s accuracy.

Choose a “medium term” time horizon

Your time horizon – how far in the future your forecast looks – is either 500 timesteps or ⅓ of your time series data, whichever is smaller. We recommend choosing up to a 10% horizon for starters. This will give you enough forward-looking forecasts to evaluate the usefulness of your results without taking too long.

Save your data prep code

Save the code you use to stage your data for the forecast for the future. Because you will be doing this again, you don’t want to repeat yourself. An efficient way to do this is to use PySpark code inside a Sagemaker notebook. If you end up using your forecast in production, you will eventually place that code into a Glue ETL pipeline (using PySpark), so it is best to just use PySpark out of the box.

Another advantage of using PySpark is that the utilities to load and drop csv-formatted data to/from S3 are dead simple. You will be using CSV for Forecasting work.

Interpret the results!

The guide to interpret results is here, but admittedly it is a little dense if you are not a statistician. One easy metric to look at, especially if you use multiple algorithms, is Root Mean Squared Error (RMSE). You want this as low as possible, and, in fact, Amazon will choose its winning algorithm mostly on this value.

It will take some time

How long will it take? If you do select AutoML, expect model training to take a while – at least 20 minutes for even the smallest datasets. If your dataset is large, it can take an hour or several hours. The same is true when you generate the actual forecast. So, start it in the beginning of the day so you can work with it before lunch, or near the end of your day so you can look at it in the morning.

Data prep details (for your data engineer)

  • Match the ‘forecast frequency’ to the frequency of your observation timestamps.
  • Set the demand datatype to a float prior to import (it might be an integer).
  • Get comfortable with `striptime` and `strftime` – you have only two options for timestamp format.
  • Assume all data are from the same time zone. If they are not, make them that way. Use python datetime methods.
  • Split out a validation set like this: https://github.com/aws-samples/amazon-forecast-samples/blob/master/notebooks/1.Getting_Data_Ready.ipynb
  • If using pandas dataframes, do not use the index when writing to csv.

Conclusion

If you’re ever asked to produce a forecast or predict some number in the future, you now have a robust method at your fingertips to get there. With Amazon Forecast, you have access to Amazon.com’s optimized algorithms for time series forecasting. If you can get your target data into CSV format, then you can use a forecast. Before you start, have a business goal in mind – it is essential to think about ranges of possibilities rather than a discrete number. And be sure to keep in mind our best practices for creating a forecast, such as using a “medium term” time horizon, visualizing the results, and saving your data preparation code.

If you’re ready to make better, data-driven decisions, trust your dashboards and reports, confidently bring in new sources for enhanced analysis, create a culture of DataOps, and become AI-ready, contact us to schedule a demo of our DataOps Foundation.

-Rob Whelan, Practice Director, Data & Analytics

How to Use Amazon Forecast for your Business

How to use Amazon Forecast: What Is it Good For?

How many times have you been asked to predict revenue for next month or next quarter? Do you mostly rely on your gut? Have you ever been asked to support your numbers? Cue sweaty palms frantically churning out spreadsheets.

Maybe you’ve suffered from the supply chain “bullwhip” effect: you order too much inventory, which makes your suppliers hustle, only to deliver a glut of product that you won’t need to replace for a long time, which makes your suppliers sit idle.

Wouldn’t it be nice to plan for your supply chain as tightly as Amazon.com does? With Amazon Forecast, you can do exactly that. In part one of this two-part article, I’ll provide an overview of the Amazon Forecast service and how to get started. Part two of the article will focus on best practices for using Amazon Forecast.

Amazon Forecast: The backstory

Amazon knows a thing or two about inventory planning, given its intense focus on operations. Over the years, it has used multiple algorithms for accurate forecasting. It even fine-tuned them to run in an optimized way on its cloud compute instances. Forecasting demand is important, if nothing else to get a “confidence interval” – a range where it’s fairly certain reality will fall, say, 80% of the time.

In true Amazon Web Services fashion, Amazon decided to provide its forecasting service for sale in Amazon Forecast, a managed service that takes your time series data in CSV format and spits out a forecast into the future. Amazon Forecast gives you a customizable confidence interval that you can set to 95%, 90%, 80%, or whatever percentage you need. And, you can re-use and re-train the model with actuals as they come in.

When you use Amazon Forecast, you can tell it to run up to five different state-of-the-art algorithms and pick a winner. This saves you the time of deliberating over which algorithm to use.

The best part about Amazon Forecast is that you can make the forecast more robust by adding in “related” time series – any data that you think is correlated to your forecast. For example, you might be predicting electricity demand based on macro scales such as season, but also on a micro level such as whether or not it rained that day.

Amazon Forecast: How to use

Amazon Forecast is considered a serverless service. You don’t have to manage any compute instances to use it. Since it is serverless, you can create multiple scenarios simultaneously – up to three at once. There is no reason to do this in series; you can come up with three scenarios and fire them off all at once. Additionally, Amazon Forecast is low-cost , so it is worth trying and experimenting with often. As is generally the case with AWS, you end up paying mostly for the underlying compute and storage, rather than any major premium for using the service. Like any other machine learning task, you have a huge advantage if you have invested in keeping your data orderly and accessible.

Here is a general workflow for using Amazon Forecast:

  1. Create a Dataset Group. This is just a logical container for all the datasets you’re going to use to create your predictor.
  2. Import your source datasets. A nice thing here is that Amazon Forecast facilitates the use of different “versions” of your datasets. As you go about feature engineering, you are bound to create different models which will be based on different underlying datasets. This is absolutely crucial for the process of experimentation and iteration.
  3. Create a predictor. This is another way of saying “create a trained model on your source data.”
  4. Create a forecast using the predictor. This is where you actually generate a forecast looking into the future.

To get started, stage your time series data in a CSV file in S3. You have to follow AWS’s naming convention for the column names. You also can optionally use your domain knowledge to enrich the data with “related time series.” Meaning, if you think external factors drive the forecast, you should add those data series, too. You can add multiple complementary time series.

When your datasets are staged, you create a Predictor. A Predictor is just a trained machine learning model. If you choose the “AutoML” option, Amazon will make up to five algorithms compete. It will save the results of all of the models that trained successfully (sometimes an algorithm clashes with the underlying data).

Finally, when your Predictor is done, you can generate a forecast, which will be stored on S3, which can be easily shared with your organization or with any Business Intelligence tool. It’s always a good idea to visualize the results to give them a reality check.

In part two of this article, we’ll dig into best practices for using Amazon Forecast. And if you’re interested in learning even more about transforming your organization to be more data-driven, check out our DataOps Foundation service that helps you transform your data analytics processes.

-Rob Whelan, Practice Director, Data & Analytics