With the typical enterprise using over 1,000 Software as a Service applications (source: Kleiner Perkins), each with its own private database, it’s no wonder people complain their data is siloed. Picture a thousand little silos, all locked up!
Number of cloud applications used per enterprise, by industry vertical
Then, imagine you start building a dashboard out of all those data silos. You’re squinting at it and wondering, can I trust this dashboard? You placate yourself because at least you have data to look at, but this creates more questions for which data doesn’t yet exist.
If you’re in a competitive industry, and we all are, you need to take your data analysis to the next level. You’re either gaining competitive advantage over your competition or being left behind.
As a business leader, you need data to support your decisions. These three data complexities are at the core of every leader’s difficulties with gaining business advantages from data:
- Siloed data
- Untrustworthy data
- No data
- Siloed data
Do you have trouble seeing your data at all? Are you mentally scanning your systems and realizing just how many different databases you have? A recent customer of ours was collecting reams of data from their industrial operations but couldn’t derive the data’s value due to the siloed nature of their datacenter database. The data couldn’t reach any dashboard in any meaningful way. It is a common problem. With enterprise data doubling every few years, it takes modern tools and strategies to keep up with it.
For our customer, we started with defining the business purpose of their industrial data – to predict demand in the coming months so they didn’t have a shortfall. That business purpose, which had team buy-in at multiple corporate levels, drove the entire engagement. It allowed us to keep the technology simple and focused on the outcome.
One month into the engagement, they had clean, trustworthy, valuable data in a dashboard. Their data was unlocked from the database and published.
Siloed data takes some elbow grease to access, but it becomes a lot easier if you have a goal in mind for the data. It cuts through noise and helps you make decisions more easily if you know where you are going.
- Untrustworthy data
Do you have trouble trusting your data? You have a dashboard, yet you’re pretty sure the data is wrong, or lots of it is missing. You can’t take action on it, because you hesitate to trust it. Data trustworthiness is a prerequisite for making your data action oriented. But, most data has problems – missing values, invalid dates, duplicate values, and meaningless entries. If you don’t trust the numbers, you’re better off without the data.
Data is there for you to take action on, so you should be able to trust it. One key strategy is to not bog down your team with maintaining systems, but rather use simple, maintainable, cloud-based systems that use modern tools to make your dashboard real.
- No data
Often you don’t even have the data you need to make a decision. “No data” comes in many forms:
- You don’t track it. For example, you’re an ecommerce company that wants to understand how email campaigns can help your sales, but you don’t have a customer email list.
- You track it but you can’t access it. For example, you start collecting emails from customers, but your email SaaS system doesn’t let you export your emails. Your data is so “siloed” that it effectively doesn’t exist for analysis.
- You track it but need to do some calculations before you can use it. For example, you have a full customer email list, a list of product purchases, and you just need to join the two together. This is a great place to be and is where we see the vast majority of customers.
That means finding patterns and insights not just within datasets, but across datasets. This is only possible with a modern, cloud-native data lake.
The solution: define your business need and build a data lake
Step one for any data project – today, tomorrow and forever – is to define your business need.
Do you need to understand your customer better? Whether it is click behavior, email campaign engagement, order history, or customer service, your customer generates more data today than ever before that can give you clues as to what she cares about.
Do you need to understand your costs better? Most enterprises have hundreds of SaaS applications generating data from internal operations. Whether it is manufacturing, purchasing, supply chain, finance, engineering, or customer service, your organization is generating data at a rapid pace.
Don’t be overwhelmed. You can cut through the noise by defining your business case.
The second step in your data project is to take that business case and make it real in a cloud-native data lake. Yes, a data lake. I know the term has been abused over the years, but a data lake is very simple; it’s a way to centrally store all (all!) of your organization’s data, cheaply, in open source formats to make it easy to access from any direction.
Data lakes used to be expensive, difficult to manage, and bulky. Now, all major cloud providers (AWS, Azure, GCP) have established best practices to keep storage dirt-cheap and data accessible and very flexible to work with. But data lakes are still hard to implement and require specialized, focused knowledge of data architecture.
How does a data lake solve these three problems?
- Data lakes de-silo your data. Since the data stored in your data lake is all in the same spot, in open-source formats like JSON and CSV, there aren’t any technological walls to overcome. You can query everything in your data lake from a single SQL client. If you can’t, then that data is not in your data lake and you should bring it in.
- Data lakes give you visibility into data quality. Modern data lakes and expert consultants build in a variety of checks for data validation, completeness, lineage, and schema drift. These are all important concepts that together tell you if your data is valuable or garbage. These sorts of patterns work together nicely in a modern, cloud-native data lake.
- Data lakes welcome data from anywhere and allow for flexible analysis across your entire data catalog. If you can format your data into CSV, JSON, or XML, then you can put it in your data lake. This solves the problem of “no data.” It is very easy to create the relevant data, either by finding it in your organization, or engineering it by analyzing across your data sets. An example would be joining data from Sales (your CRM) and Customer Service (Zendesk) to find out which product category has the best or worst customer satisfaction scores.
The 2nd Watch Dataops Foundation Platform
You should only build a data lake if you have clear business outcomes in mind. Most cloud consulting partners will robotically build a bulky data lake without any thought to the business outcome. What sets 2nd Watch apart is our focus on your business needs. Do you need to make better decisions? Speed up a process? Reduce costs somewhere? We keep your goal front and center throughout the entire engagement. We’ve deployed data lakes dozens of times for enterprises with this unique focus in mind.
Our ready-to-deploy data lake captures years of cloud experience and best practices, with integration from governance to data exploration and storage. We explain the reasons behind the decisions and make changes based on your requirements, while ingesting data from multiple sources and exploring it as soon as possible. In the above image, the core of the data lake are the three zones represented by green S3 bucket squares.
Here is a tour of each zone:
- Drop Zone: As the “single source of truth,” this is a copy of your data in its most raw format, always available to verify what the actual truth is. Place data here with minimal or no formatting. For example, you can take a daily “dump” of a relational database in CSV format.
- Analytics Zone: To support general analytics, data in the Analytics Zone is compressed and reformatted for fast analytics. From here, you can use a single SQL Client, like Athena, to run SQL queries over your entire enterprise dataset — all from a single place. This is the core value add of your data lake.
- Curated Zone: The “golden” or final, polished, most-valued datasets for your company go here. This is where you save and refresh data that will be used for dashboards or turned into visualizations.
Our Classic 3-zone data lake on S3 features immutable data by default. You’ll never lose data, nor do you have to configure a lot of settings to accomplish this. Using AWS Glue, data is automatically compressed and archived to minimize storage costs. Convenient search with always-up-to-date data catalog allows you to easily discover all your enterprise datasets.
In the Curated Zone, only the most important “data marts” – approved datasets – get loaded into more costly Redshift or RDS, minimizing costs and complexity. And with Amazon SageMaker, tapping into your Analytics and Curated Zone, you are prepared for effective machine learning. One of the most overlooked aspects of machine learning and advanced analytics is the great importance of clean, available data. Our data lake solves that issue.
If you’re struggling with one of these three core data issues, the solution is to start with a crisp definition of your business need, and then build a data lake to execute on that need. A data lake is just a central repository for flexible and cheap data storage. If you focus on keeping your data lake simple and geared towards the analysis you need for your business, these three core data problems will be a thing of the past.
If you want more information on creating a data lake for your business, download our DataOps Foundation datasheet to learn about our 4-8 week engagement that helps you build a flexible, scalable data lake for centralizing, exploring and reporting on your data.
-Rob Whelan, Practice Manager, Data Engineering & Analytics