In part one of this article, we offered an overview of Amazon Forecast and how to use it. In part two, we get into Amazon Forecast best practices:
Know your business goal
In our data and analytics practice, business value comes first. We want to know and clarify use cases before we talk about technology. Using amazon Forecast is no different. When creating a forecast, do you want to make sure you always have enough inventory on hand? Or do you want to make sure that all your inventory gets used all the time? This will drive which “quartile” you look at.
Each quartile – the defaults are 10%, 50%, and 90% – is important for its own reasons and should be looked at to give a range. What is the 50% quartile? The forecast at this quartile has a 50-50 chance of being right. The real number has a 50% chance of being higher and a 50% chance of being lower than the actual value. The forecast at the 90% quartile has a 90% chance of being higher than the actual value, while the forecast at the 10% quartile has only a 10% chance of being higher. So, if you want to make sure you sell all your inventory, use the 10% quartile forecast.
Use related time series
Amazon has made Forecast so easy to use with related time series, you really have nothing to lose to make your forecast more robust. All you have to do is make the time series time units the same as your target time series.
One way to create a related dataset is to use categorical or binary data whose future values are already known – for example, whether the future time is on a weekend or a holiday or there is a concert playing – anything that is on a schedule that you can rely on.
Even if you don’t know if something will happen, you can create multiple forecasts where you vary the future values. For example, if you want to forecast attendance at a baseball game this Sunday, and you want to model the impact of weather, you could create a feature is_raining and try one forecast with “yes, it’s raining” and another with “no, it’s not raining.”
Look at a range of forecasted values, not a singular forecasted value
Don’t expect the numbers to be precise. One of the biggest values from a forecast is knowing what the likely range of actual values will be. Then, take some time to analyze what drives that range. Can it be made smaller (more accurate) with more related data? If so, can you control any of that related data?
Visualize the results
Show historical and forecast values on one chart. This will give you a sense of how the forecast is trending. You can backfill the chart with actuals as they come in, so you can learn more about your forecast’s accuracy.
Choose a “medium term” time horizon
Your time horizon – how far in the future your forecast looks – is either 500 timesteps or ⅓ of your time series data, whichever is smaller. We recommend choosing up to a 10% horizon for starters. This will give you enough forward-looking forecasts to evaluate the usefulness of your results without taking too long.
Save your data prep code
Save the code you use to stage your data for the forecast for the future. Because you will be doing this again, you don’t want to repeat yourself. An efficient way to do this is to use PySpark code inside a Sagemaker notebook. If you end up using your forecast in production, you will eventually place that code into a Glue ETL pipeline (using PySpark), so it is best to just use PySpark out of the box.
Another advantage of using PySpark is that the utilities to load and drop csv-formatted data to/from S3 are dead simple. You will be using CSV for Forecasting work.
Interpret the results!
The guide to interpret results is here, but admittedly it is a little dense if you are not a statistician. One easy metric to look at, especially if you use multiple algorithms, is Root Mean Squared Error (RMSE). You want this as low as possible, and, in fact, Amazon will choose its winning algorithm mostly on this value.
It will take some time
How long will it take? If you do select AutoML, expect model training to take a while – at least 20 minutes for even the smallest datasets. If your dataset is large, it can take an hour or several hours. The same is true when you generate the actual forecast. So, start it in the beginning of the day so you can work with it before lunch, or near the end of your day so you can look at it in the morning.
Data prep details (for your data engineer)
- Match the ‘forecast frequency’ to the frequency of your observation timestamps.
- Set the demand datatype to a float prior to import (it might be an integer).
- Get comfortable with `striptime` and `strftime` – you have only two options for timestamp format.
- Assume all data are from the same time zone. If they are not, make them that way. Use python datetime methods.
- Split out a validation set like this: https://github.com/aws-samples/amazon-forecast-samples/blob/master/notebooks/1.Getting_Data_Ready.ipynb
- If using pandas dataframes, do not use the index when writing to csv.
If you’re ever asked to produce a forecast or predict some number in the future, you now have a robust method at your fingertips to get there. With Amazon Forecast, you have access to Amazon.com’s optimized algorithms for time series forecasting. If you can get your target data into CSV format, then you can use a forecast. Before you start, have a business goal in mind – it is essential to think about ranges of possibilities rather than a discrete number. And be sure to keep in mind our best practices for creating a forecast, such as using a “medium term” time horizon, visualizing the results, and saving your data preparation code.
If you’re ready to make better, data-driven decisions, trust your dashboards and reports, confidently bring in new sources for enhanced analysis, create a culture of DataOps, and become AI-ready, contact us to schedule a demo of our DataOps Foundation.
-Rob Whelan, Practice Director, Data & Analytics