It can be fun to tinker around with shiny, new technology toys, but without specific goals, the organization suffers. Time and resources are wasted, and without proof of value-added, the buy-in necessary from leadership won’t happen. Why are you implementing this solution, and what do you hope to get out of the data you put in?
ML projects can produce several outcomes contributing to decisions fueled by data and gaining insights into customer buying behavior, which can be used to optimize the sales cycle with new marketing campaigns. Other uses could include utilizing predictive search to improve user experience, streamlining warehouse inventory with image processing, real-time fraud detection, predictive maintenance, or elevating customer service with voice to text speech recognition.
ML projects are typically led by a data scientist who is responsible for understanding the business requirements and who leverages data to train a computer model to learn patterns in very large volumes of data to predict outcomes while also improving the outcomes over time.
Successful ML solutions can generate 4-5% higher profit margins, so identify benchmarks, set growth goals, and integrate regular progress measurements to make sure you’re always on track with your purpose in mind.
Step 2: Apply Machine Learning
The revolutionary appeal for ML is that it does not require an explicit computer program to deliver analytics and predictions, it leverages a computer model that can be trained to predict and improve the outcomes. After the data scientist’s analysis defines the business requirements, they wrangle the necessary data to train the ML model by leveraging an algorithm, which is the engine that turns the data into a model.
Data preparation is critical to the success of the ML project because it is the foundation of everything that follows. Garbage in equals garbage out, but value in produces more value.
Raw data can be tempting, but data that isn’t clean, governed, and appropriate for business use corrupts the model and invalidates the outcome. Data needs to be prepared and ready, meaning it has been reviewed for accuracy, and it’s available and accessible to all users. Data is typically stored in a cloud data warehouse or data lake and it must be maintained with ongoing governance.
A common mistake organizations make is relying on data scientists to clean the data. Studies have found that data scientists spend 70% of their time wrangling data and only 30% of the time implementing the solution and delivering business value. These highly paid and skilled professionals are scarce resources trained for innovation and analyzing data, not cleaning data. Only after the data is clean should data scientists start their analysis.
The data scientist’s core expertise is in selecting the appropriate algorithm to process and analyze the data. The science in ML is figuring out which algorithm to use and how to optimize it to deliver accurate and reliable results.
Thankfully, ML algorithms are available today in all the major service provider platforms, and many Python and R libraries. The general use cases within reach include:
Classification (is this a cat or is this not a cat) using anomaly detection, marketing segmentation, and recommendation engines.
NLP (natural language progression) using autocomplete, sentiment, and understanding (i.e., chatbots).
Timeseries using forecasting.
Algorithms are either supervised or unsupervised. Supervised learning algorithms start with training data and correct answers. Labeled data trains the model using the algorithm and feedback. Think texting and autocorrect – the algorithm is always learning new words based on your interaction with autocorrect. That feedback is delivered to the live model for updates and the feedback loop never ends.
Unsupervised learning algorithms start with unlabeled data. The algorithm divides the data into meaningful clusters used to make inferences about the records. These algorithms are useful for segmentation of click stream data or email lists.
Some popular algorithms include CNN (convolutional neuro network), a deep learning algorithm, K Means Clustering, PCA, Support Vector Machine, Decision Trees, and Logistic Regression.
With everything in place, it’s time to see if the model is doing what you need it to do. When evaluating model quality, consider bias and variance. Bias quantifies the algorithm’s limited flexibility to learn the pattern. Variance quantifies the algorithm’s sensitivity to specific sets of training.
Three things can happen when optimizing the model:
Over-fitting: Low bias + high variance. The model is too tightly fitted to the training data, and it won’t generalize data it hasn’t seen before.
Under-fitting: High bias + low variance. The model is new and hasn’t reached a point of accuracy. Get to over-fitting first, then back up and reiterate until the model fits.
Limiting/preventing under/over-fitting: There are too many features in the model (i.e. data points used to build the model), and you need to either reduce them, or create new features from existing features.
Before unleashing your ML project on customers, experiment first with employees. Solutions like virtual assistance and chat bots that are customer-facing can jeopardize your reputation if they don’t add value to interactions with customers. Because ML influences decision-making, accuracy is a must before real-world implementation.
Step 3: Experiment and Push into Production
With software projects, it either works or it crashes. With data science projects, you have to see, touch, and feel the results to know if it’s working. Reach out to users for feedback and to ensure any changes to user experience are positive. Luckily, with the cloud, the cost of experimentation is low, so don’t be afraid to beta test before a full launch.
Once the model fits and you’ve pushed the project into production, make noise about it around the organization. Promote that you’re implementing something new and garner the attention of executive leadership. Unfortunately, 70% of data projects fail because they don’t have an executive champion.
Share your learnings internally using data, charts, results, and emphasizing company-wide impact. You’re not going to get buy in on day one, but as you move up the chain of command, earning more and more supporters, your budget will allow for more machine learning solutions. Utilize buzzwords and visual representations of the project – remember data science needs to be seen, touched, and felt.
Ensure ML and data science success with best practices for introducing, completing, and repeating implementation. 2nd Watch Data and Analytic Solutions help your organization realize the power of ML with proper data cleaning, the right algorithm selection, and quality model deployment. Contact Us to see how you can do more with the data you have.
-Sam Tawfik, Sr Marketing Manager, Data & Analytics
Amazing possibilities are available in data science with artificial intelligence (AI) and machine learning (ML). Large sets of data, inexpensive storage options, and cloud processing capabilities are enabling computers to make human-like decisions. Across industries, businesses are leveraging these algorithm-based models to save time, reduce costs, enable users, and grow profits.
What’s the difference?
Data science, AI, and ML can get lumped together, but there are some distinctions to understand. Simply put, AI is a computer doing things that typically would require human scrutiny or reasoning. ML is the application of statistical learning techniques to automatically learn patterns in data. These patterns are used to develop a model to make more accurate predictions about the world. And both terms utilize data science to accomplish outcomes.
With these central terms defined, we recommend using ‘machine learning’ or ‘ML’ to describe data science projects internally because there is sometimes an aura of fear around AI that “the robots are going to take my job.” Although joking (a bit), buy-in from executives is critical to a successful data project, so ML is recommended over AI.
Utilizing ML for profit growth
A recent study showed 78% of companies have already deployed ML, and 90% of them have made more money as a result. Manufacturing and supply-chain management are experiencing the largest average cost decrease, and marketing, sales, product and service development are reaching the highest average revenue gains. Additionally, a McKinsey survey revealed that organizations with a high diffusion of ML had 4-5% higher profit margins than their peers with no ML. Not only can ML reduce your overall costs, but it also enables you to grow your bottom line. If your organization is not utilizing ML, now is the time to start.
From data to model
Machine learning is already a staple in many of the functions we utilize daily. Predictive search in Google and within catalogues, fraud detection on suspicious credit card purchases, near-instant credit approval, social network suggestions via mutual connections, and voice recognition are all common today. Behind these intelligent decisions is a model that acts as a function or program. The model is trained on sample data using a machine learning algorithm to learn patterns. Based on the information learned about the sample data, the model is applied to inputs it may or may not have seen before and predicts an outcome.
Traditional programming depends on the written program and the input data it’s fed. The computer runs the program against the data, and you get an output directly tied to the logic or function of the program. Only the data that can be processed by the program gets analyzed, and outliers are removed.
In ML, the computer is still given input data. For example, what you know about your customer – time stamps, demographics, spend, etc. – but it doesn’t have a written program. Instead, it’s given the output you desire. For example, you might want to know which customers churn. Then you build a model by training programmed algorithms to analyze input data and predict an output. Essentially, the model recognizes the correlation between the output results and the input data. Here, the model utilizes algorithms to identify patterns in data that that heavily influence the customer churn score.
In this example, an organization might discover that most customers stop doing business with them after a certain promo ends, or a high percentage of customers who come in through a specific lead gen pipeline don’t stay for long. Using this information, the organization can make informed and specific decisions about how to reduce churn based on known patterns. All relevant data is taken into account in ML to deliver a more comprehensive story about why things are happening in your organization. Machine learning can quickly affirm or discredit intuition and allow organizations to fail faster, and in the right direction, to meet overall goals more efficiently.
There is a saying in meteorology that you can be accurate more often than not if you predict tomorrow’s weather to be the same as today’s weather. Of course, that is not always the case unless you live in a place like San Diego or if you use data to make your predictions.
Forecasting in business requires data, lots of data, and it requires specialized data science skills, time, and tools to wrangle, prepare, and analyze the data.
Cloud solution providers such as AWS are enabling organizations to collect and host all that business data and provide tools to seamlessly integrate and analyze data for analytics and forecasting. Amazon Forecast is a managed service that consumes time series data and makes predictions without requiring the user to have any machine leaning knowledge or experience.
Determine Use Case and Acceptable Accuracy
It is important to identify use cases and accuracy criteria before generating forecasts. As machine learning tools become easier to use, hasty predictions that are ostensibly accurate become more common.
Without a prepared use case and a definition of acceptable accuracy, the usefulness of any data analysis will be in question. A common use case is predicting the customer demand of inventory items to ensure adequate supply. For inventory, a common use case is to expect that the predict demand will be higher than the actual demand 90% of the time to ensure adequate supply without overstocking.
Statistics-wise that would be a quantile loss or percentile denoted as P90.
How Amazon Forecast Works
To get started, you need to collect the historical and related time series data and upload it to Amazon Forecast. Amazon Forecast automatically inspects the data and identifies the key attributes and selects the appropriate machine learning algorithm, trains the model, and generates the forecasts. Forecasts can be visualized or the data can be exported for downstream processing.
Aggregate and Prepare Data
Time series data is often more granular than necessary for many use cases. If transaction data is collected across multiple locations (or device readings or inventory items) and the use case requires only a prediction of the total amount, the data will need to be aggregated before attempting any predictions.
Inconsistencies in time series data are common and should be analyzed and corrected as much as possible before attempting any predictions. In many cases, perfect corrections are impossible due to missing or inaccurate data and methods to smooth, fill, or interpolate the data will need to be employed.
Amazon Forecast Forecasts
Generating a forecast from Amazon Forecast is much easier than doing the prerequisite work. Amazon Forecast provides half a dozen predefined algorithms and an option for AutoML, which will evaluate all algorithms and choose one it determines to fit best.
Simple CSV files are uploaded, a Predictor is trained, and a Forecast is created. The end-to-end process usually takes several hours depending on the size of the data and the parameter settings. Once generated, you can see the results in a Forecast Lookup or export them back to CSV to be consumed by a data visualization service such as Amazon QuickSight.
If you skipped the prerequisites, you would look at the Forecast results and ask, “Now what?” If your results satisfy your use case and accuracy requirements, you can start working on other use cases and/or create an Amazon Forecast pipeline that delivers regular predictions.
Improving Forecast Accuracy
The most important factor affecting forecast accuracy is the quality and quantity of the data. Larger datasets are the first thing that should be tried. Data analysis might also be needed to ensure that the data is consistent.
If the generated Forecast has not satisfied the accuracy requirements you defined, it’s time to adjust some of the (hyper)parameters, include additional data, or both.
Parameters and Hyperparameters
Reducing the forecast horizon can increase accuracy; it’s easier to make shorter term predictions. Manually setting the Predictor’s algorithm to DeepAR+ will enable an advanced option called HRO which stands for Hyperparameter Optimization. Enabling HRO will cause the Predictor to run multiple times with different tweaks to attempt to increase the accuracy.
Related Time Series and Metadata
Related Time Series data (e.g., weather data, holidays) and Metadata (e.g. sub-categories of inventory items) can be added to the Dataset Group to attempt to increase the accuracy. Matching item_ids and making sure beginning and ending timestamps match the dataset can add additional overhead that may not be necessary depending on your accuracy requirements.
For more details on using Amazon Forecast, watch this video on how to build accurate forecasting models.
-Joey Brown, Sr Cloud Consultant