Build State-of-the-Art Forecasts with Amazon Forecast in 15 Minutes

There is a saying in meteorology that you can be accurate more often than not if you predict tomorrow’s weather to be the same as today’s weather. Of course, that is not always the case unless you live in a place like San Diego or if you use data to make your predictions.

Forecasting in business requires data, lots of data, and it requires specialized data science skills, time, and tools to wrangle, prepare, and analyze the data.

Cloud solution providers such as AWS are enabling organizations to collect and host all that business data and provide tools to seamlessly integrate and analyze data for analytics and forecasting. Amazon Forecast is a managed service that consumes time series data and makes predictions without requiring the user to have any machine leaning knowledge or experience.

Prerequisites

Determine Use Case and Acceptable Accuracy

It is important to identify use cases and accuracy criteria before generating forecasts. As machine learning tools become easier to use, hasty predictions that are ostensibly accurate become more common.

Without a prepared use case and a definition of acceptable accuracy, the usefulness of any data analysis will be in question. A common use case is predicting the customer demand of inventory items to ensure adequate supply. For inventory, a common use case is to expect that the predict demand will be higher than the actual demand 90% of the time to ensure adequate supply without overstocking.

Statistics-wise that would be a quantile loss or percentile denoted as P90.

How Amazon Forecast Works

To get started, you need to collect the historical and related time series data and upload it to Amazon Forecast. Amazon Forecast automatically inspects the data and identifies the key attributes and selects the appropriate machine learning algorithm, trains the model, and generates the forecasts. Forecasts can be visualized or the data can be exported for downstream processing.

Aggregate and Prepare Data

Time series data is often more granular than necessary for many use cases. If transaction data is collected across multiple locations (or device readings or inventory items) and the use case requires only a prediction of the total amount, the data will need to be aggregated before attempting any predictions.

Inconsistencies in time series data are common and should be analyzed and corrected as much as possible before attempting any predictions. In many cases, perfect corrections are impossible due to missing or inaccurate data and methods to smooth, fill, or interpolate the data will need to be employed.

Amazon Forecast Forecasts

Generating a forecast from Amazon Forecast is much easier than doing the prerequisite work. Amazon Forecast provides half a dozen predefined algorithms and an option for AutoML, which will evaluate all algorithms and choose one it determines to fit best.

Simple CSV files are uploaded, a Predictor is trained, and a Forecast is created. The end-to-end process usually takes several hours depending on the size of the data and the parameter settings. Once generated, you can see the results in a Forecast Lookup or export them back to CSV to be consumed by a data visualization service such as Amazon QuickSight.

If you skipped the prerequisites, you would look at the Forecast results and ask, “Now what?” If your results satisfy your use case and accuracy requirements, you can start working on other use cases and/or create an Amazon Forecast pipeline that delivers regular predictions.

Improving Forecast Accuracy

The most important factor affecting forecast accuracy is the quality and quantity of the data. Larger datasets are the first thing that should be tried. Data analysis might also be needed to ensure that the data is consistent.

If the generated Forecast has not satisfied the accuracy requirements you defined, it’s time to adjust some of the (hyper)parameters, include additional data, or both.

Parameters and Hyperparameters

Reducing the forecast horizon can increase accuracy; it’s easier to make shorter term predictions. Manually setting the Predictor’s algorithm to DeepAR+ will enable an advanced option called HRO which stands for Hyperparameter Optimization. Enabling HRO will cause the Predictor to run multiple times with different tweaks to attempt to increase the accuracy.

Related Time Series and Metadata

Related Time Series data (e.g., weather data, holidays) and Metadata (e.g. sub-categories of inventory items) can be added to the Dataset Group to attempt to increase the accuracy. Matching item_ids and making sure beginning and ending timestamps match the dataset can add additional overhead that may not be necessary depending on your accuracy requirements.

For more details on using Amazon Forecast, watch this video on how to build accurate forecasting models.

-Joey Brown, Sr Cloud Consultant


Amazon Forecast: Best Practices

In part one of this article, we offered an overview of Amazon Forecast and how to use it. In part two, we get into Amazon Forecast best practices:

Know your business goal

In our data and analytics practice, business value comes first. We want to know and clarify use cases before we talk about technology. Using amazon Forecast is no different. When creating a forecast, do you want to make sure you always have enough inventory on hand? Or do you want to make sure that all your inventory gets used all the time? This will drive which “quartile” you look at.

Each quartile – the defaults are 10%, 50%, and 90% – is important for its own reasons and should be looked at to give a range. What is the 50% quartile? The forecast at this quartile has a 50-50 chance of being right. The real number has a 50% chance of being higher and a 50% chance of being lower than the actual value. The forecast at the 90% quartile has a 90% chance of being higher than the actual value, while the forecast at the 10% quartile has only a 10% chance of being higher. So, if you want to make sure you sell all your inventory, use the 10% quartile forecast.

Use related time series

Amazon has made Forecast so easy to use with related time series, you really have nothing to lose to make your forecast more robust. All you have to do is make the time series time units the same as your target time series.

One way to create a related dataset is to use categorical or binary data whose future values are already known – for example, whether the future time is on a weekend or a holiday or there is a concert playing – anything that is on a schedule that you can rely on.

Even if you don’t know if something will happen, you can create multiple forecasts where you vary the future values. For example, if you want to forecast attendance at a baseball game this Sunday, and you want to model the impact of weather, you could create a feature is_raining and try one forecast with “yes, it’s raining” and another with “no, it’s not raining.”

Look at a range of forecasted values, not a singular forecasted value

Don’t expect the numbers to be precise. One of the biggest values from a forecast is knowing what the likely range of actual values will be. Then, take some time to analyze what drives that range. Can it be made smaller (more accurate) with more related data? If so, can you control any of that related data?

Visualize the results

Show historical and forecast values on one chart. This will give you a sense of how the forecast is trending. You can backfill the chart with actuals as they come in, so you can learn more about your forecast’s accuracy.

Choose a “medium term” time horizon

Your time horizon – how far in the future your forecast looks – is either 500 timesteps or ⅓ of your time series data, whichever is smaller. We recommend choosing up to a 10% horizon for starters. This will give you enough forward-looking forecasts to evaluate the usefulness of your results without taking too long.

Save your data prep code

Save the code you use to stage your data for the forecast for the future. Because you will be doing this again, you don’t want to repeat yourself. An efficient way to do this is to use PySpark code inside a Sagemaker notebook. If you end up using your forecast in production, you will eventually place that code into a Glue ETL pipeline (using PySpark), so it is best to just use PySpark out of the box.

Another advantage of using PySpark is that the utilities to load and drop csv-formatted data to/from S3 are dead simple. You will be using CSV for Forecasting work.

Interpret the results!

The guide to interpret results is here, but admittedly it is a little dense if you are not a statistician. One easy metric to look at, especially if you use multiple algorithms, is Root Mean Squared Error (RMSE). You want this as low as possible, and, in fact, Amazon will choose its winning algorithm mostly on this value.

It will take some time

How long will it take? If you do select AutoML, expect model training to take a while – at least 20 minutes for even the smallest datasets. If your dataset is large, it can take an hour or several hours. The same is true when you generate the actual forecast. So, start it in the beginning of the day so you can work with it before lunch, or near the end of your day so you can look at it in the morning.

Data prep details (for your data engineer)

  • Match the ‘forecast frequency’ to the frequency of your observation timestamps.
  • Set the demand datatype to a float prior to import (it might be an integer).
  • Get comfortable with `striptime` and `strftime` – you have only two options for timestamp format.
  • Assume all data are from the same time zone. If they are not, make them that way. Use python datetime methods.
  • Split out a validation set like this: https://github.com/aws-samples/amazon-forecast-samples/blob/master/notebooks/1.Getting_Data_Ready.ipynb
  • If using pandas dataframes, do not use the index when writing to csv.

Conclusion

If you’re ever asked to produce a forecast or predict some number in the future, you now have a robust method at your fingertips to get there. With Amazon Forecast, you have access to Amazon.com’s optimized algorithms for time series forecasting. If you can get your target data into CSV format, then you can use a forecast. Before you start, have a business goal in mind – it is essential to think about ranges of possibilities rather than a discrete number. And be sure to keep in mind our best practices for creating a forecast, such as using a “medium term” time horizon, visualizing the results, and saving your data preparation code.

If you’re ready to make better, data-driven decisions, trust your dashboards and reports, confidently bring in new sources for enhanced analysis, create a culture of DataOps, and become AI-ready, contact us to schedule a demo of our DataOps Foundation.

-Rob Whelan, Practice Director, Data & Analytics


How to Use Amazon Forecast for your Business

How to use Amazon Forecast: What Is it Good For?

How many times have you been asked to predict revenue for next month or next quarter? Do you mostly rely on your gut? Have you ever been asked to support your numbers? Cue sweaty palms frantically churning out spreadsheets.

Maybe you’ve suffered from the supply chain “bullwhip” effect: you order too much inventory, which makes your suppliers hustle, only to deliver a glut of product that you won’t need to replace for a long time, which makes your suppliers sit idle.

Wouldn’t it be nice to plan for your supply chain as tightly as Amazon.com does? With Amazon Forecast, you can do exactly that. In part one of this two-part article, I’ll provide an overview of the Amazon Forecast service and how to get started. Part two of the article will focus on best practices for using Amazon Forecast.

Amazon Forecast: The backstory

Amazon knows a thing or two about inventory planning, given its intense focus on operations. Over the years, it has used multiple algorithms for accurate forecasting. It even fine-tuned them to run in an optimized way on its cloud compute instances. Forecasting demand is important, if nothing else to get a “confidence interval” – a range where it’s fairly certain reality will fall, say, 80% of the time.

In true Amazon Web Services fashion, Amazon decided to provide its forecasting service for sale in Amazon Forecast, a managed service that takes your time series data in CSV format and spits out a forecast into the future. Amazon Forecast gives you a customizable confidence interval that you can set to 95%, 90%, 80%, or whatever percentage you need. And, you can re-use and re-train the model with actuals as they come in.

When you use Amazon Forecast, you can tell it to run up to five different state-of-the-art algorithms and pick a winner. This saves you the time of deliberating over which algorithm to use.

The best part about Amazon Forecast is that you can make the forecast more robust by adding in “related” time series – any data that you think is correlated to your forecast. For example, you might be predicting electricity demand based on macro scales such as season, but also on a micro level such as whether or not it rained that day.

Amazon Forecast: How to use

Amazon Forecast is considered a serverless service. You don’t have to manage any compute instances to use it. Since it is serverless, you can create multiple scenarios simultaneously – up to three at once. There is no reason to do this in series; you can come up with three scenarios and fire them off all at once. Additionally, Amazon Forecast is low-cost , so it is worth trying and experimenting with often. As is generally the case with AWS, you end up paying mostly for the underlying compute and storage, rather than any major premium for using the service. Like any other machine learning task, you have a huge advantage if you have invested in keeping your data orderly and accessible.

Here is a general workflow for using Amazon Forecast:

  1. Create a Dataset Group. This is just a logical container for all the datasets you’re going to use to create your predictor.
  2. Import your source datasets. A nice thing here is that Amazon Forecast facilitates the use of different “versions” of your datasets. As you go about feature engineering, you are bound to create different models which will be based on different underlying datasets. This is absolutely crucial for the process of experimentation and iteration.
  3. Create a predictor. This is another way of saying “create a trained model on your source data.”
  4. Create a forecast using the predictor. This is where you actually generate a forecast looking into the future.

To get started, stage your time series data in a CSV file in S3. You have to follow AWS’s naming convention for the column names. You also can optionally use your domain knowledge to enrich the data with “related time series.” Meaning, if you think external factors drive the forecast, you should add those data series, too. You can add multiple complementary time series.

When your datasets are staged, you create a Predictor. A Predictor is just a trained machine learning model. If you choose the “AutoML” option, Amazon will make up to five algorithms compete. It will save the results of all of the models that trained successfully (sometimes an algorithm clashes with the underlying data).

Finally, when your Predictor is done, you can generate a forecast, which will be stored on S3, which can be easily shared with your organization or with any Business Intelligence tool. It’s always a good idea to visualize the results to give them a reality check.

In part two of this article, we’ll dig into best practices for using Amazon Forecast. And if you’re interested in learning even more about transforming your organization to be more data-driven, check out our DataOps Foundation service that helps you transform your data analytics processes.

-Rob Whelan, Practice Director, Data & Analytics