Simple & Secure Data Lakes with AWS Lake Formation

Data is the lifeblood of business. To help companies visualize their data, guide business decisions, and enhance their business operations requires employing machine learning services. But where to begin. Today, tremendous amounts of data are created by companies worldwide, often in disparate systems.

These large amounts of data, while helpful, don’t necessarily need to be processed immediately, yet need to be consolidated into a single source of truth to enable business value. Companies are faced with the issue of finding the best way to securely store their raw data for later use. One popular type of data store is referred to as a “data lake, and is very different from the traditional data warehouse.

Use Case: Data Lakes and McDonald’s

McDonald’s brings in about 1.5 million customers each day, creating 20-30 new data points with each of their transactions. The restaurant’s data comes from multiple data sources including a variety of data vendors, mobile apps, loyalty programs, CRM systems, etc. With all this data to use from various sources, the company wanted to build a complete perspective of a CLV and other useful analytics. To meet their needs for data collection and analytics, McDonald’s France partnered with 2nd Watch. The data lake allowed McDonald’s to ingest data into one source, reducing the effort required to manage and analyze their large amounts of data.

Due to their transition from a data warehouse to a data lake, McDonald’s France has greater visibility into the speed of service, customer lifetime value, and conversion rates. With an enhanced view of their data, the company can make better business decisions to improve their customers’ experience. So, what exactly is a data lake, how does it differ from a data warehouse, and how do they store data for companies like McDonald’s France?

What is a Data Lake?

A data lake is a centralized storage repository that holds a vast amount of raw data in its native format until it is needed for use. A data lake can include any combination of:

  • Structured data: highly organized data from relational databases
  • Semi-structured data: data with some organizational properties, such as HTML
  • Unstructured data: data without a predefined data model, such as email

Data Lakes are often mistaken for Data Warehouses, but the two data stores cannot be used interchangeably. Data Warehouses, the more traditional data store, process and store your data for analytical purposes. Filtering data through data warehouses occurs automatically, and the data can arrive from multiple locations. Data lakes, on the other hand, store and centralize data that comes in without processing it. Thus, there is no need to identify a specific purpose for the data as with a data warehouse environment. Your data, whether in its original form or curated form, can be stored in a data lake. Companies often choose a data lake for their flexibility in supporting any type of data, their scalability, analytics, machine learning capabilities, and low costs.

While Data Warehouses are appealing for their element of automatically curated data and fast results, data lakes can lead to several areas of improvement for your data and business including:

  • Improved customer interactions
  • Improved R&D innovation choices
  • Increase operational efficiencies

Essentially, a piece of information stored in a data lake will seem like a small drop in a big lake. Due to the lack of organization and security that tends to occur when storing large quantities of data in data lakes, this storing method has received some criticism. Additionally, setting up a data lake can be time and labor intensive, often taking months to complete. This is because, when built the traditional way, there are a series of steps that need to be completed and then repeated for different data sets.

Even once fully architected, there can be errors in the setup due to your data lakes being manually configured over an extended period. An important piece to your data lake is a data catalog, which uses machine learning capabilities to recognize data and create a universal schema when new datasets come into your data lake. Without defined mechanisms and proper governance, your data lake can quickly become a “data swamp”, where your data becomes hard to manage, analyze, and ultimately becomes unusable. Fortunately, there is a solution to all these problems. You can build a well-architected data lake in a short amount of time with AWS Lake Formation.

AWS Lake Formation & its Benefits

Traditionally, data lakes were set up as on-premises deployments before people realized the value and security provided by the cloud. These on-premises environments required continual adjustments for things like optimization and capacity planning—which is now easier due to cloud services like AWS Lake Formation. Deploying data lakes in the cloud provides scalability, availability, security, and faster time to build and deploy your data lake.

AWS Lake Formation is a service that makes it easy to set up a secure data lake in days, saving your business a lot of time and effort to focus on other aspects of your business. While AWS Lake Formation significantly cuts down the time it takes to setup your data lake, it is built and deployed securely. Additionally, AWS Lake Formation enables you to break down data silos and combine a variety of analytics to gain data insights and ultimately guide better business decisions. The benefits delivered by this AWS service are:

  • Build data lakes quickly: To build a data lake in Lake Formation, you simply need to import data from databases already in AWS, other AWS sources, or from other external sources. Data stored in Amazon S3, for example, can be moved into your data lake, where your crawl, catalog, and prepare your data for analytics. Lake Formation also helps transform data with AWS Glue to prepare for it for quality analytics. Additionally, with AWS’s FindMatches, data can be cleaned and deduplicated to simplify your data.
  • Simplify security management: Security management is simpler with Lake Formation because it provides automatic server-side encryption, providing a secure foundation for your data. Security settings and access controls can also be configured to ensure high-level security. Ones configured with rules, Lake formation enforces your access controls. With Lake Formation, your security and governance standards will be met.
  • Provide self-service access to data: With large amounts of data in your data lake, finding the data you need for a specific purpose can be difficult. Through Lake Formation, your users can search for relevant data using custom fields such as name, contents, and sensitivity to make discovering data easier. Lake Formation can also be paired with AWS analytics services, such as Amazon Athena, Amazon Redshift, and Amazon EMR. For example, queries can be run through Amazon Athena using data that is registered with Lake Formation.

Building a data lake is one hurdle but building a well-architected and secure data lake is another. With Lake Formation, building and managing data lakes is much easier. On a secure cloud environment, your data will be safe and easy to access.

2nd Watch has been recognized as a Premier Consulting Partner by AWS for nearly a decade and our engineers are 100% certified on AWS. Contact us to learn more about AWS Lake Formation or to get assistance building your data lake.

-Tessa Foley, Marketing


Build State-of-the-Art Forecasts with Amazon Forecast in 15 Minutes

There is a saying in meteorology that you can be accurate more often than not if you predict tomorrow’s weather to be the same as today’s weather. Of course, that is not always the case unless you live in a place like San Diego or if you use data to make your predictions.

Forecasting in business requires data, lots of data, and it requires specialized data science skills, time, and tools to wrangle, prepare, and analyze the data.

Cloud solution providers such as AWS are enabling organizations to collect and host all that business data and provide tools to seamlessly integrate and analyze data for analytics and forecasting. Amazon Forecast is a managed service that consumes time series data and makes predictions without requiring the user to have any machine leaning knowledge or experience.

Prerequisites

Determine Use Case and Acceptable Accuracy

It is important to identify use cases and accuracy criteria before generating forecasts. As machine learning tools become easier to use, hasty predictions that are ostensibly accurate become more common.

Without a prepared use case and a definition of acceptable accuracy, the usefulness of any data analysis will be in question. A common use case is predicting the customer demand of inventory items to ensure adequate supply. For inventory, a common use case is to expect that the predict demand will be higher than the actual demand 90% of the time to ensure adequate supply without overstocking.

Statistics-wise that would be a quantile loss or percentile denoted as P90.

How Amazon Forecast Works

To get started, you need to collect the historical and related time series data and upload it to Amazon Forecast. Amazon Forecast automatically inspects the data and identifies the key attributes and selects the appropriate machine learning algorithm, trains the model, and generates the forecasts. Forecasts can be visualized or the data can be exported for downstream processing.

Aggregate and Prepare Data

Time series data is often more granular than necessary for many use cases. If transaction data is collected across multiple locations (or device readings or inventory items) and the use case requires only a prediction of the total amount, the data will need to be aggregated before attempting any predictions.

Inconsistencies in time series data are common and should be analyzed and corrected as much as possible before attempting any predictions. In many cases, perfect corrections are impossible due to missing or inaccurate data and methods to smooth, fill, or interpolate the data will need to be employed.

Amazon Forecast Forecasts

Generating a forecast from Amazon Forecast is much easier than doing the prerequisite work. Amazon Forecast provides half a dozen predefined algorithms and an option for AutoML, which will evaluate all algorithms and choose one it determines to fit best.

Simple CSV files are uploaded, a Predictor is trained, and a Forecast is created. The end-to-end process usually takes several hours depending on the size of the data and the parameter settings. Once generated, you can see the results in a Forecast Lookup or export them back to CSV to be consumed by a data visualization service such as Amazon QuickSight.

If you skipped the prerequisites, you would look at the Forecast results and ask, “Now what?” If your results satisfy your use case and accuracy requirements, you can start working on other use cases and/or create an Amazon Forecast pipeline that delivers regular predictions.

Improving Forecast Accuracy

The most important factor affecting forecast accuracy is the quality and quantity of the data. Larger datasets are the first thing that should be tried. Data analysis might also be needed to ensure that the data is consistent.

If the generated Forecast has not satisfied the accuracy requirements you defined, it’s time to adjust some of the (hyper)parameters, include additional data, or both.

Parameters and Hyperparameters

Reducing the forecast horizon can increase accuracy; it’s easier to make shorter term predictions. Manually setting the Predictor’s algorithm to DeepAR+ will enable an advanced option called HRO which stands for Hyperparameter Optimization. Enabling HRO will cause the Predictor to run multiple times with different tweaks to attempt to increase the accuracy.

Related Time Series and Metadata

Related Time Series data (e.g., weather data, holidays) and Metadata (e.g. sub-categories of inventory items) can be added to the Dataset Group to attempt to increase the accuracy. Matching item_ids and making sure beginning and ending timestamps match the dataset can add additional overhead that may not be necessary depending on your accuracy requirements.

For more details on using Amazon Forecast, watch this video on how to build accurate forecasting models.

-Joey Brown, Sr Cloud Consultant


Keep up your Redshift Lake House Property Values

When you deployed Redshift a few years ago, your new data lake was going to allow your organization to make better, faster, more informed business decisions.  It would break down data silos allowing your Data Scientists to have greater access to all data sources, quickly, enabling them to be more efficient in delivering consumable data insights.

Now that some time has passed, though, there is a good chance your data lake may no longer be returning to you the value it initially did.  It has turned into a catch all for your data and maybe even a giant data mess with your clusters filling up too quickly, resulting in the need to constantly delete data or scale up.  Teams are blaming one another for consuming too many resources, even though they are split and shouldn’t be impacting one another.  Slow queries have resulted from a less than optimal table structure decided upon when initially deployed that no longer fits the business and data you are generating today.  All of this results in your expensive Data Scientists and Analysts being less productive than when you initially deployed Redshift.

Keep in mind, though, that the Redshift you deployed a few years ago is not the same Redshift today.  We all know that AWS is continuously innovating, but over the last 2 years they have added more than 200 new features to Redshift that can address many of these problems, such as:

  • Utilizing AQUA nodes, which can deliver a 10x performance improvement
  • Refreshing instance families that can lower your overall spend
  • Federated query, which allows you to query across Redshift, S3, and relational database services to come up with aggregated data sets, which can then be put back into the data lakes to be consumed by other analytic services
  • Concurrency scaling, which automatically adds and removes capacity to handle unpredictable demand from thousands of concurrent users, so you do not take a performance hit
  • The ability to take advantage of machine learning with automatic workload management (WLM) to dynamically manage memory and concurrency, helping maximize query throughput

As a matter of fact, clients repeatedly tell us there have been so many innovations with Redshift, it’s hard for them to determine which ones will benefit them, let alone be aware of all of them all.

Having successfully deployed and maintained AWS Redshift for years here at 2nd Watch, we have packaged our best practice learnings to deliver the AWS Redshift Health Assessment.  The AWS Redshift Health Assessment is designed to ensure your Redshift Cluster is not inhibiting the productivity of your valuable and costly specialized resources.

At the end of our 2-3 week engagement, we deliver a lightweight prioritized roadmap of the best enhancements to be made to your Redshift cluster that will deliver immediate impact to your business.  We will look for ways to not only improve performance but also save you money where possible, as well as analyze your most important workloads to ensure you have an optimal table design deployed utilizing the appropriate and optimal Redshift features to get you the results you need.

AWS introduced the concept of a Lake House analogy to better describe what Redshift has become.  A Lake House is prime real estate that everyone wants because it gives you a view of something beautiful, with limitless opportunities of enjoyment.  With the ability to use a common query or dashboard across your data warehouses and multiple data lakes, like a lake house, Redshift provides you the beautiful sight of all your data and limitless possibilities.  However, every lake house needs ongoing maintenance to ensure it brings you the enjoyment you desired when you first purchased it and a lake house built with Redshift is no different.

Contact 2nd Watch today to maximize the value of your data, like you intended when you deployed Redshift.

-Rob Whelan, Data Engineering & Analytics Practice Manager


5 Questions You Need to Answer to Maximize Your Data Use

Businesses have been collecting data for decades, but we’re only just starting to understand how best to apply new technologies, like machine learning and AI, for analysis. Fortunately, the cloud offers tools to maximize data use. When starting any data project, the best place to begin is by exploring common data problems to gain valuable insights that will help create a strategy for accomplishing your overall business goal.

Why do businesses need data?

The number one reason enterprise organizations need data is for decision support. Business moves faster today than it ever has, and to keep up, leaders need more than a ‘gut feeling’ on which to base decisions. Data doesn’t make decisions for us, but rather augments and influences which path forward will yield the results we desire.

Another reason we all need data is to align strategic initiatives from the top down. When C-level leaders decide to pursue company wide change, managers need data-based goals and incentives that run parallel with the overall objectives. For change to be successful, there needs to be metrics in place to chart progress. Benchmarks, monthly or quarterly goals, department-specific stats, and so on are all used to facilitate achievement and identify intervention points.

We’ve never before had more data available to us than we do today. While making the now necessary decision to utilize your data for insights is the first step, finding data, cleaning it, understanding why you want it, and analyzing the value and application can be intensive. Ask yourself these five questions before diving into a data project to gain clarity and avoid productivity-killing data issues.

1. Is your data relevant?

  • What kind of value are you getting from your data?
  • How will you apply the data to influence your decision?

2. Can you see your data?

  • Are you aware of all the data you have access to?
  • What data do you need that you can’t see?

3. Can you trust your data?

  • Do you feel confident making decisions based on the data you have?
  • If you’re hesitant to use your data, why do you doubt its authenticity?

4. Do you know the recency of your data?

  • When was the data collected? How does that influence relevancy?
  • Are you getting the data you need, when you need it?

5. Where is your data siloed?

  • What SaaS applications do different departments use? (For example: Workday for HR, HubSpot for marketing, Salesforce for Sales, MailChimp, Trello, Atlassian, and so on.)
  • Do you know where all of your data is being collected and stored?

Cloud to the rescue! But only with accurate data

The cloud is the most conducive environment for data analysis because of its plethora of analysis tools available. More and more tools, like plug-and-play machine learning algorithms, are developed every day, and they are widely and easily available in the cloud.

But tools can’t do all the work for you. Tools cannot unearth the value of data. It’s up to you to know why you’re doing what you’re doing. What is the business objective you’re trying to get to? Why do you care about the data you’re seeking? What do you need to get out of it?

A clearly defined business objective is incredibly important to any cloud initiative involving data. Once that’s been identified, it’s important for that goal to serve as the guiding force behind the tools you use in the cloud. Because tools are really for developers and engineers, you want to pair them with someone engaging in the business value of the effort as well. Maybe it’s a business analyst or a project manager, but the team should include someone who is in touch with the business objective.

However, you can’t completely rely on cloud tools to solve data problems because you probably have dirty data, or data that isn’t correct or in the specified format. If your data isn’t accurate, all the tools in the world won’t help you accomplish your objectives. Dirty data interferes with analysis and creates a barrier to your data providing any value.

To cleanse your data, you need to validate the data coming in with quality checks. Typically, there are issues with dates and time stamps, spelling errors from form fields, and other human error in data entry. Formatting date-entry fields and using calendar pickers can help users uniformly complete date information. Drop down menus on form fields will reduce spelling errors and allow you to filter more easily. Small design changes like these can significantly help the cleanliness of your data and your ability to maximize the impact of cloud tools.

Are you ready for data-driven decision making? Access and act on trustworthy data with the Data and Analytics services provided by 2nd Watch to enable smart, fast, and effective decisions that support your business goals. Contact Us to learn more about how to maximize your data use.

-Robert Whelan, Data Engineering & Analytics Practice Manager


3 Productivity-Killing Data Problems and How to Solve Them

With the typical enterprise using over 1,000 Software as a Service applications (source: Kleiner Perkins), each with its own private database, it’s no wonder people complain their data is siloed. Picture a thousand little silos, all locked up!

Number of cloud applications used per enterprise, by industry vertical

Then, imagine you start building a dashboard out of all those data silos. You’re squinting at it and wondering, can I trust this dashboard? You placate yourself because at least you have data to look at, but this creates more questions for which data doesn’t yet exist.

If you’re in a competitive industry, and we all are, you need to take your data analysis to the next level. You’re either gaining competitive advantage over your competition or being left behind.

As a business leader, you need data to support your decisions. These three data complexities are at the core of every leader’s difficulties with gaining business advantages from data:

  1. Siloed data
  2. Untrustworthy data
  3. No data

 

  1. Siloed data

Do you have trouble seeing your data at all? Are you mentally scanning your systems and realizing just how many different databases you have? A recent customer of ours was collecting reams of data from their industrial operations but couldn’t derive the data’s value due to the siloed nature of their datacenter database. The data couldn’t reach any dashboard in any meaningful way. It is a common problem. With enterprise data doubling every few years, it takes modern tools and strategies to keep up with it.

For our customer, we started with defining the business purpose of their industrial data – to predict demand in the coming months so they didn’t have a shortfall. That business purpose, which had team buy-in at multiple corporate levels, drove the entire engagement. It allowed us to keep the technology simple and focused on the outcome.

One month into the engagement, they had clean, trustworthy, valuable data in a dashboard. Their data was unlocked from the database and published.

Siloed data takes some elbow grease to access, but it becomes a lot easier if you have a goal in mind for the data. It cuts through noise and helps you make decisions more easily if you know where you are going.

  1. Untrustworthy data

Do you have trouble trusting your data? You have a dashboard, yet you’re pretty sure the data is wrong, or lots of it is missing. You can’t take action on it, because you hesitate to trust it. Data trustworthiness is a prerequisite for making your data action oriented. But, most data has problems – missing values, invalid dates, duplicate values, and meaningless entries. If you don’t trust the numbers, you’re better off without the data.

Data is there for you to take action on, so you should be able to trust it. One key strategy is to not bog down your team with maintaining systems, but rather use simple, maintainable, cloud-based systems that use modern tools to make your dashboard real.

  1. No data

Often you don’t even have the data you need to make a decision. “No data” comes in many forms:

  • You don’t track it. For example, you’re an ecommerce company that wants to understand how email campaigns can help your sales, but you don’t have a customer email list.
  • You track it but you can’t access it. For example, you start collecting emails from customers, but your email SaaS system doesn’t let you export your emails. Your data is so “siloed” that it effectively doesn’t exist for analysis.
  • You track it but need to do some calculations before you can use it. For example, you have a full customer email list, a list of product purchases, and you just need to join the two together. This is a great place to be and is where we see the vast majority of customers.

That means finding patterns and insights not just within datasets, but across datasets. This is only possible with a modern, cloud-native data lake.

The solution: define your business need and build a data lake

Step one for any data project – today, tomorrow and forever – is to define your business need.

Do you need to understand your customer better? Whether it is click behavior, email campaign engagement, order history, or customer service, your customer generates more data today than ever before that can give you clues as to what she cares about.

Do you need to understand your costs better? Most enterprises have hundreds of SaaS applications generating data from internal operations. Whether it is manufacturing, purchasing, supply chain, finance, engineering, or customer service, your organization is generating data at a rapid pace.

(AWS :What is a Data Lake?)

Don’t be overwhelmed. You can cut through the noise by defining your business case.

The second step in your data project is to take that business case and make it real in a cloud-native data lake. Yes, a data lake. I know the term has been abused over the years, but a data lake is very simple; it’s a way to centrally store all (all!) of your organization’s data, cheaply, in open source formats to make it easy to access from any direction.

Data lakes used to be expensive, difficult to manage, and bulky. Now, all major cloud providers (AWS, Azure, GCP) have established best practices to keep storage dirt-cheap and data accessible and very flexible to work with. But data lakes are still hard to implement and require specialized, focused knowledge of data architecture.

How does a data lake solve these three problems?

  1. Data lakes de-silo your data. Since the data stored in your data lake is all in the same spot, in open-source formats like JSON and CSV, there aren’t any technological walls to overcome. You can query everything in your data lake from a single SQL client. If you can’t, then that data is not in your data lake and you should bring it in.
  2. Data lakes give you visibility into data quality. Modern data lakes and expert consultants build in a variety of checks for data validation, completeness, lineage, and schema drift. These are all important concepts that together tell you if your data is valuable or garbage. These sorts of patterns work together nicely in a modern, cloud-native data lake.
  3. Data lakes welcome data from anywhere and allow for flexible analysis across your entire data catalog. If you can format your data into CSV, JSON, or XML, then you can put it in your data lake. This solves the problem of “no data.” It is very easy to create the relevant data, either by finding it in your organization, or engineering it by analyzing across your data sets. An example would be joining data from Sales (your CRM) and Customer Service (Zendesk) to find out which product category has the best or worst customer satisfaction scores.

The 2nd Watch Dataops Foundation Platform

You should only build a data lake if you have clear business outcomes in mind. Most cloud consulting partners will robotically build a bulky data lake without any thought to the business outcome. What sets 2nd Watch apart is our focus on your business needs. Do you need to make better decisions? Speed up a process? Reduce costs somewhere? We keep your goal front and center throughout the entire engagement. We’ve deployed data lakes dozens of times for enterprises with this unique focus in mind.

Our ready-to-deploy data lake captures years of cloud experience and best practices, with integration from governance to data exploration and storage. We explain the reasons behind the decisions and make changes based on your requirements, while ingesting data from multiple sources and exploring it as soon as possible. In the above image, the core of the data lake are the three zones represented by green S3 bucket squares.

Here is a tour of each zone:

  • Drop Zone: As the “single source of truth,” this is a copy of your data in its most raw format, always available to verify what the actual truth is. Place data here with minimal or no formatting. For example, you can take a daily “dump” of a relational database in CSV format.
  • Analytics Zone: To support general analytics, data in the Analytics Zone is compressed and reformatted for fast analytics. From here, you can use a single SQL Client, like Athena, to run SQL queries over your entire enterprise dataset — all from a single place. This is the core value add of your data lake.
  • Curated Zone: The “golden” or final, polished, most-valued datasets for your company go here. This is where you save and refresh data that will be used for dashboards or turned into visualizations.

Our Classic 3-zone data lake on S3 features immutable data by default. You’ll never lose data, nor do you have to configure a lot of settings to accomplish this. Using AWS Glue, data is automatically compressed and archived to minimize storage costs. Convenient search with always-up-to-date data catalog allows you to easily discover all your enterprise datasets.

In the Curated Zone, only the most important “data marts” – approved datasets – get loaded into more costly Redshift or RDS, minimizing costs and complexity. And with Amazon SageMaker, tapping into your Analytics and Curated Zone, you are prepared for effective machine learning. One of the most overlooked aspects of machine learning and advanced analytics is the great importance of clean, available data. Our data lake solves that issue.

If you’re struggling with one of these three core data issues, the solution is to start with a crisp definition of your business need, and then build a data lake to execute on that need. A data lake is just a central repository for flexible and cheap data storage. If you focus on keeping your data lake simple and geared towards the analysis you need for your business, these three core data problems will be a thing of the past.

If you want more information on creating a data lake for your business, download our DataOps Foundation datasheet to learn about our 4-8 week engagement that helps you build a flexible, scalable data lake for centralizing, exploring and reporting on your data.

-Rob Whelan, Practice Manager, Data Engineering & Analytics

 

 


Cloud Crunch Podcast: 5 Strategic IT Business Drivers CXOs are Contemplating Now

What is the new normal for life and business after COVID-19, and how does that impact IT? We dive into the 5 strategic IT business drivers CXOs are contemplating now and the motivation behind those drivers. Read the corresponding blog article at https://www.2ndwatch.com/blog/five-strategic-business-drivers-cxos-contemplating-now/. We’d love to hear from you! Email us at CloudCrunch@2ndwatch.com with comments, questions and ideas. Listen now on Spotify, iTunes, iHeart Radio, Stitcher, or wherever you get your podcasts.