Modern Data Warehouses and Machine Learning: A Powerful Pair

Artificial intelligence (AI) technologies like machine learning (ML) have changed how we handle and process data. However, AI adoption isn’t simple. Most companies utilize AI only for the tiniest fraction of their data because scaling AI is challenging. Typically, enterprises cannot harness the power of predictive analytics because they don’t have a fully mature data strategy.

To scale AI and ML, companies must have a robust information architecture that executes a company-wide data and predictive analytics strategy. This requires businesses to focus their data application beyond cost reduction and operations, for example. Fully embracing AI will require enterprises to make judgment calls and face challenges in assembling a modern information architecture that readies company data for predictive analytics. 

A modern data warehouse is the catalyst for AI adoption and can accelerate a company’s data maturity journey. It’s a vital component of a unified data and AI platform: it collects and analyzes data to prepare the data for later stages in the AI lifecycle. Utilizing your modern data warehouse will propel your business past conventional data management problems and enable your business to transform digitally with AI innovations.

What is a modern data warehouse?

On-premise or legacy data warehouses are not sufficient for a competitive business. Today’s market demands organizations to rely on massive amounts of data to best serve customers, optimize business operations, and increase their bottom lines. On-premise data warehouses are not designed to handle this volume, velocity, and variety of data and analytics.

If you want to remain competitive in the current landscape, your business must have a modern data warehouse built on the cloud. A modern data warehouse automates data ingestion and analysis, which closes the loop that connects data, insight, and analysis. It can run complex queries to be shared with AI technologies, supporting seamless ML and better predictive analytics. As a result, organizations can make smarter decisions because the modern data warehouse captures and makes sense of organizational data to deliver actionable insights company-wide.

How does a modern data warehouse work with machine learning?

A modern data warehouse operates at different levels to collect, organize, and analyze data to be utilized for artificial intelligence and machine learning. These are the key characteristics of a modern data warehouse:

Multi-Model Data Storage

Data is stored in the warehouse to optimize performance and integration for specific business data. 

Data Virtualization

Data that is not stored in the data warehouse is accessed and analyzed at the source, which reduces complexity, risk of error, cost, and time in data analysis. 

Mixed Workloads

This is a key feature of a modern data warehouse: mixed workloads support real-time warehousing. Modern data warehouses can concurrently and continuously ingest data and run analytic workloads.

Hybrid Cloud Deployment

Enterprises choose hybrid cloud infrastructure to move workloads seamlessly between private and public clouds for optimal compliance, security, performance, and costs. 

A modern data warehouse can collect and process the data to make the data easily shareable with other predictive analytics and ML tools. Moreover, these modern data warehouses offer built-in ML integrations, making it seamless to build, train, and deploy ML models.

What are the benefits of using machine learning in my modern data warehouse?

Modern data warehouses employ machine learning to adjust and adapt to new patterns quickly. This empowers data scientists and analysts to receive actionable insights and real-time information, so they can make data-driven decisions and improve business models throughout the company. 

Let’s look at how this applies to the age-old question, “how do I get more customers?” We’ll discuss two different approaches to answering this common business question.

The first methodology is the traditional approach: develop a marketing strategy that appeals to a specific audience segment. Your business can determine the segment to target based on your customers’ buying intentions and your company’s strength in providing value. Coming to this conclusion requires asking inductive questions about the data:

  • What is the demand curve?
  • What product does our segment prefer?
  • When do prospective customers buy our product?
  • Where should we advertise to connect with our target audience?

There is no shortage of business intelligence tools and services designed to help your company answer these questions. This includes ad hoc querying, dashboards, and reporting tools.

The second approach utilizes machine learning within your data warehouse. With ML, you can harness your existing modern data warehouse to discover the inputs that impact your KPIs most. You simply have to feed information about your existing customers into a statistical model, then the algorithms will profile the characteristics that define an ideal customer. We can ask questions around specific inputs:

  • How do we advertise to women with annual income between $100,000 and $200,000 who like to ski?
  • What are the indicators of churn in our self-service customer base?
  • What are frequently seen characteristics that will create a market segmentation?

ML builds models within your data warehouse to enable you to discover your ideal customer via your inputs. For example, you can describe your target customer to the computing model, and it will find potential customers that fall under that segment. Or, you can feed the computer data on your existing customers and have the machine learn the most important characteristics. 

Conclusion

A modern data warehouse is essential for ingesting and analyzing data in our data-heavy world.  AI and predictive analytics feed off more data to work effectively, making your modern data warehouse the ideal environment for the algorithms to run and enabling your enterprise to make intelligent decisions. Data science technologies like artificial intelligence and machine learning take it one step further and allow you to leverage the data to make smarter enterprise-wide decisions.

2nd Watch offers a Data Science Readiness Assessment to provide you with a clear vision of how data science will make the greatest impact on your business. Our assessment will get you started on your data science journey, harnessing solutions such as advanced analytics, ML, and AI. We’ll review your goals, review your current state, and design preliminary models to discover how data science will provide the most value to your enterprise.

-Ryan Lewis | Managing Consultant at 2nd Watch

Get started with your Data Science Readiness Assessment today to see how you can stay competitive by automating processes, improving operational efficiency, and uncovering ROI-producing insights.


What Is the Difference Between Snowflake and Amazon Redshift?

The modern business world is data-centric. As more businesses turn to cloud computing, they must evaluate and choose the right data warehouse to support their digital modernization efforts and business outcomes. Data warehouses can increase the bottom line, improve analytics, enhance the customer experience, and optimize decision-making. 

A data warehouse is a large repository of data businesses utilize for deep analytical insights and business intelligence. This data is collected from multiple data sources. A high-performing data warehouse can collect data from different operational databases and apply a uniform format for better analysis and quicker insights.

Two of the most popular data warehouse solutions are Snowflake and Amazon Web Services (AWS) Redshift. Let’s look at how these two data warehouses stack up against one another. 

What is Snowflake?

Snowflake is a cloud-based data warehousing solution that uses third-party cloud-compute resources, such as Azure, Google Cloud Platform, or Amazon Web Services (AWS.) It is designed to provide users with a fully managed, cloud-native database solution that can scale up or down as needed for different workloads. Snowflake separates compute from storage: a non-traditional approach to data warehousing. With this method, data remains in a central repository while compute instances are managed, sized, and scaled independently. 

Snowflake is a good choice for companies that are conscious about their operational overhead and need to quickly deploy applications into production without worrying about managing hardware or software. It is also the ideal platform to use when query loads are lighter, and the workload requires frequent scaling. 

The benefits of Snowflake include:

  • Easy integration with most components of data ecosystems
  • Minimal operational overhead: companies are not responsible for installing, configuring, or managing the underlying warehouse platform
  • Simple setup and use
  • Abstracted configuration for storage and compute instances
  • Robust and intuitive SQL interface

What is Amazon Redshift?

Amazon Redshift is an enterprise data warehouse built on Amazon Web Services (AWS). It provides organizations with a scalable, secure, and cost-effective way to store and analyze large amounts of data in the cloud. Its cloud-based compute nodes enable businesses to perform large-scale data analysis and storage. 

Amazon Redshift is ideal for enterprises that require quick query outputs on large data sets. Additionally, Redshift has several options for efficiently managing its clusters using AWS CLI/Amazon Redshift Console, Amazon Redshift Query API, and AWS Software Development Kit. Redshift is a great solution for companies already using AWS services and running applications with a high query load. 

The benefits of Amazon Redshift include:

  • Seamless integration with the AWS ecosystem
  • Multiple data output formatting support
  • Easy console to extract analytics and run queries
  • Customizable data and security models

Comparing Data Warehouse Solutions

Snowflake and Amazon Redshift both offer impressive performance capabilities, like scalability across multiple servers and high availability with minimal downtime. There are some differences between the two that will determine which one is the best fit for your business.

Performance

Both data warehouse solutions harness massively parallel processing (MPP) and columnar storage, which enables advanced analytics and efficiency on massive jobs. Snowflake boasts a unique architecture that supports structured and semi-structured data. Storage, computing, and cloud services are abstracted to optimize independent performance. Redshift recently unveiled concurrency scaling coupled with machine learning to compete with Snowflake’s concurrency scaling. 

Maintenance

Snowflake is a pure SaaS platform that doesn’t require any maintenance. All software and hardware maintenance is handled by Snowflake. Amazon Redshift’s clusters require manual maintenance from the user.

Data and Security Customization

Snowflake supports fewer customization choices in data and security. Snowflake’s security utilizes always-on encryption enforcing strict security checks. Redshift supports data flexibility via partitioning and distribution. Additionally, Redshift allows you to tailor its end-to-end encryption and set up your own identity management system to manage user authentication and authorization.

Pricing

Both platforms offer on-demand pricing but are packaged differently. Snowflake doesn’t bundle usage and storage in its pricing structure and treats them as separate entities. Redshift bundles the two in its pricing. Snowflake tiers its pricing based on what features you need. Your company can select a tier that best fits your feature needs. Redshift rewards businesses with discounts when they commit to longer-term contracts. 

Which data warehouse is best for my business?

To determine the best fit for your business, ask yourself the following questions in these specific areas:

  • Do I want to bundle my features? Snowflake splits compute and storage, and its tiered pricing provides more flexibility to your business to purchase only the features you require. Redshift bundles compute and storage to unlock the immediate potential to scale for enterprise data warehouses. 
  • Do I want a customizable security model? Snowflake grants security and compliance options geared toward each tier, so your company’s level of protection is relevant to your data strategy. Redshift provides fully customizable encryption solutions, so you can build a highly tailored security model. 
  • Do I need JSON storage? Snowflake’s JSON storage support wins over Redshift’s support. With Snowflake, you can store and query JSON with native functions. With Redshift, JSON is split into strings, making it difficult to query and work with. 
  • Do I need more automation? Snowflake automates issues like data vacuuming and compression. Redshift requires hands-on maintenance for these sorts of tasks. 

Conclusion

A data warehouse is necessary to stay competitive in the modern business world. The two major data warehouse players – Snowflake and Amazon Redshift – are both best-in-class solutions. One product is not superior to the other, so choosing the right one for your business means identifying the one best for your data strategy.

2nd Watch is an AWS Certified Partner and an Elite Snowflake Consulting Partner. We can help you choose the right data warehouse solution for you and support your business regardless of which data warehouse your choose.

We have been recognized by AWS as a Premier Partner since 2012, as well as an audited and approved Managed Service Provider and Data and Analytics Competency partner for our outstanding customer experiences, depth and breadth of our products and services, and our ability to scale to meet customer demand. Our engineers and architects are 100% certified on AWS, holding more than 200 AWS certifications.

Our full team of certified SnowPros has proven expertise to help businesses implement modern data solutions using Snowflake. From creating a simple proof of concept to developing an enterprise data warehouse to customized Snowflake training programs, 2nd Watch will help you to utilize Snowflake’s powerful cloud-based data warehouse for all of your data needs.

Contact 2nd Watch today to help you choose the right data warehouse for your business!


How a Dedicated Data Warehouse Yields Better Insight than Your CRM or ERP

What percent of your enterprise data goes completely untapped? It’s far more than most organizations realize. Research suggests that as much as 68% of global enterprise data goes unused. The reasons are varied (we can get to the root cause with a current state assessment), but one growing problem stems from misconceptions about CRMs, ERPs, EHRs, and similar operational software systems.

The right operational software systems are valuable tools with their own effective reporting functions. The foundation of any successful reporting or analytics initiative depends on two factors: on a centralized source of truth that exists in a unified source format. All operational software systems struggle to satisfy either aspect of that criteria.

Believe it or not, one of the most strategic systems for data-driven decision-making is still a dedicated data warehouse. Here is the value a data warehouse brings to your organization and the necessary steps to implement that enhance your analytics’ accuracy and insight.

Download Now: Modern Data Warehouse Comparison Guide [Snowflake, Redshift, Azure Synapse, and Google BigQuery]

CRMs and ERPs Are Data Silos with Disparate Formats

Operational software systems are often advertised as offering a unified view, but that’s only true for their designed purpose. CRMs offer a comprehensive view of customers, ERPs of operations, and EHRs of patient or member medical history. Outside of their defined parameters, these systems are data silos.

In an HBR blog post, Edd Wilder-James captures the conundrum perfectly: “You can’t cleanly separate the data from its intended use. Depending on your desired application, you need to format, filter, and manipulate the data accordingly.”

Some platforms are enabled to integrate outside data sources, but even that provides you with a filtered view of your data, not the raw and centralized view necessary to generate granular and impactful reports. It’s the difference between abridged and unabridged books – you might glean chunks of the big picture but miss entire sections or chapters that are crucial to the overall story.

Building a dedicated data warehouse removes the question of whether your data sets are complete. You can extract, transfer, and load data from source systems into star schemas with a unified format optimized for business users to leverage. The data is formatted around the business process rather than the limitations of the tool. That way, you can run multifaceted reports or conduct advanced analytics when you need it – without anchoring yourself to any specific technology.

Tracking Down Your Data Sources

In all honesty, organizations not familiar with the process often overlook vital information sources. There might be a platform used to track shipping that only one member of your team uses. Maybe there’s a customer service representative who logs feedback in an ad hoc document. Or it’s possible there’s HIPAA-compliant software in use that isn’t automatically loading into your EHR. Regardless of your industry, there are likely gaps in your knowledge well outside of the CRMs, ERPs, EHRs, and other ostensibly complete data sources.

How do you build a single source of truth? It’s not as simple as shifting around a few sources. Implementing a dedicated data warehouse requires extensive planning and preparation. The journey starts with finding the invisible web of sources outside of your primary operational software systems. Those organizations that choose to forgo a full-fledged current state assessment to identify those hidden sources only achieve fragmentary analytics at best.

Data warehouse implementations need guidance and buy-in at the corporate level. That starts with a well-defined enterprise data strategy. Before you can create your strategy, you need to ask yourself questions such as these:

  • What are your primary business objectives?
  • What are your key performance indicators?
  • Which source systems contribute to those goals?
  • Which source systems are we currently using across the enterprise?

By obtaining the answers to these and other questions from decision-makers and end users, you can clarify the totality of your current state. Otherwise, hunting down those sources is an uphill battle.

Creating Data Warehouse Value that Lasts

Consolidating your dispersed data sources is just a starting point. Next, you need to extract the data from each source system and populate them within the data warehouse framework itself. A key component of this step is to test data within your warehouse to verify quality and completeness.

If data loss occurs during the ETL process, the impact of your work and veracity of your insights will be at risk. Running a variety of different tests (e.g., data accuracy, data completeness, data transformation, etc.) will reduce the possibility of any unanticipated biases in your single source of truth.

What about maintaining a healthy and dynamic data warehouse? How often should you load new data? The answer depends on the frequency of your reporting needs. As a rule of thumb, think in terms of freshness. If your data has gone stale by the time you’re loading it into your data warehouse, increase the frequency of your data refresh. Opt for real-time analytics if it will provide you with a strategic advantage, not because you want to keep current with the latest buzzword.

Improving Your Results with an Outsourced Partner

Each step in the process comes with its own complications. It’s easy to fall into common data warehousing pitfalls unless you have internal resources with experience pinpointing hidden data sources, selecting the right data model, and maintaining your data warehouse post-implementation.

One of our clients in the healthcare software space was struggling to transition to a dynamic data warehousing model that could enhance their sales. Previously, they had a reporting application that they were using on a semi-annual basis. Though they wanted to increase the frequency of their reporting and enable multiple users to run reports simultaneously, they didn’t have the internal expertise to confidently navigate these challenges.

Working with 2nd Watch made a clear difference. Our client was able to leverage a data warehouse architecture that provided daily data availability (in addition to the six-month snapshot) and self-service dashboards that didn’t require changes or updates on their part. We also set them on the right path to leverage a single source of the truth through future developments.

Our strategies in that project prioritized our client’s people instead of a specific technology. We considered the reporting and analytics needs of their business users rather than pigeonholing their business into a specific tool. Through our tech-agnostic approach, we guided them toward a future state that provided strategic advantage and a clear ROI that might have otherwise gone unachieved.

Want your data warehouse to provide you with a single source of the truth? Schedule a whiteboard session to review your options and consolidate your data into actionable insight.


Where Does a Modern Data Warehouse Fit in an Organization?

In part 1 and part 2 of our modern data warehouse series, we laid out the benefits of a data warehouse and compared the different types of modern data warehouses available. In part 3, we take a step back and see how the modern data warehouse fits in your overall data architecture.

A modern data warehouse is just one piece of the puzzle of a modern data architecture that will ultimately provide insights to the business via reporting, dashboarding, and advanced analytics.

There are many factors to consider when it comes to modern data warehousing, and it’s important to understand upfront that it’s a huge endeavor. With that in mind, a well-designed modern data warehouse will help your organization grow and stay competitive in our ever-changing world.

Download Now: Modern Data Warehouse Comparison Guide [Snowflake, Redshift, Azure Synapse,and Google BigQuery]

The ultimate goal of modern architecture is to facilitate the movement of data not only to the data warehouse but also to other applications in the enterprise. The truth of the matter is that a modern data architecture is designed very similarly to how we at 2nd Watch would design an on-premise or traditional data architecture, though with some major differences. Some of the benefits of a modern data architecture are as follows:

  • Tools and technology available today allow the development process to speed up tremendously.
  • Newer data modeling methodologies can be used to track the history of data efficiently and cost-effectively.
  • Implementation of near real-time scenarios is much more cost-effective and easier to implement utilizing cloud technologies.
  • With some SaaS providers, you can worry much less about the underlying hardware, indexing, backups, and database maintenance and more about the overall business solution.
  • While technology advances have removed some of the technical barriers experienced in on-premises systems, data must still be modeled in a way that supports goals, business needs, and specific use cases.

Below you will find a high-level diagram of a modern data architecture we use at 2nd Watch, along with a description of the core components of the architecture:

Technical details aside, 2nd Watch’s architecture provides key benefits that will add value to any business seeking a modern data warehouse. The raw data layer enables the ingestion of all forms of data, including unstructured data. In addition, the raw layer keeps your data safe by eliminating direct user access and creating historical backups of your source data. This historical record of data can be accessed for data science use cases as well as modeled for reports and dashboards to show historical trends over time.

The transformation-focused data hub enables easy access to data from various source systems. For example, imagine you have one customer that can be tracked across several subsidiary companies. The business layer would enable you to track their activity across all of your business lines by conforming the various data points into one source of truth. Furthermore, the business layer allows your organization to add additional data sources without disrupting your current reporting and solutions.

The enterprise data warehouse provides a data layer structured with reporting in mind. It ensures that any reports and dashboards update quickly and reliably, and it provides data scientists with reliable data structured for use in models. Overall, the modern data warehouse architecture enables you to provide your end users with near real-time reporting, allowing them to act on insights as they occur. Each component of the architecture provides unique business value that translates into a competitive advantage.

If you depend on your data to better serve your customers, streamline your operations, and lead (or disrupt) your industry, a modern data platform built on the cloud is a must-have for your organization.

Contact us for a complimentary whiteboarding session to learn what a modern data warehouse would look like for your organization.


3 Options for Getting Started with a Modern Data Warehouse

In previous blog posts, we laid out the benefits of a modern data warehouse, explored the different types of modern data warehouses available, and discussed where a modern data warehouse fits in your overall data architecture.

Download Now: Modern Data Warehouse Comparison Guide [Snowflake, Amazon Redshift, Azure Synapse, and Google BigQuery]

There is no such thing as a one-size-fits-all data warehouse. To that end, there is no singular approach to getting started. Getting started depends on your goals, needs, and where you are today. In this blog post, we’ll outline a few options 2nd Watch offers for getting started with a modern data warehouse and the details for each.

  • Option 1: Data Architecture Whiteboard Session
  • Option 2: Modern Data Warehouse Strategy Session
  • Option 3: Modern Data Warehouse Quickstart

Option 1: 60-Minute Data Architecture Assessment

A 60-minute data architecture assessment is a great option to see how a modern data warehouse would fit in your current environment and what would be involved to get from where you are now to where you want to be.

During this session, we will outline a plan to achieve your goals and help you understand the tools, technologies, timeline, and cost to get there.

Who is this for? Organizations in the very early planning stages

Duration: 60 minutes

More Information

Option 2: Modern Data Warehouse Strategy

In order to see ROI and business value from your modern data warehouse, you must have a clear plan on how you are going to use it. During a modern data warehouse strategy project, our team will work with your stakeholders to understand your business goals and design a tech strategy to ensure business value and ROI from your data environment.

Who is this for? Organizations in the early planning stages looking to establish the business use case, cost benefits, and ROI of a modern data warehouse before getting started

Duration: 2-, 4-, 6-, or 8-week strategies are available

More Information

Option 3: Modern Data Platform Quickstart

You have your strategy laid out and are ready to get started ASAP. The modern data platform quickstart is a great option to get your modern data warehouse up and running in as few as six weeks.

During this quickstart, we’ll create a scalable data warehouse; clean, normalize, and ingest data; and even provide reports for predefined use cases.

Who is this for? Organizations that have outlined their strategy and are ready to start seeing the benefits of a modern data warehouse

Duration: 6 weeks

More Information

Not sure where to begin? We recommend beginning with a 60-minute data architecture assessment. This session allows us to walk through your current architecture, understand your organization’s pain points and goals for analytics, brainstorm on a future state architecture based on your goals, and then come up with next steps. Furthermore, the assessment allows us to determine if your organization needs to make a change, what those changes are, and how you might go about implementing them. Simply put, we want to understand the current state, learn about the future state of what you want to build toward, and help you create a plan so you can successfully execute on a modern data warehouse project.

A Word of Warning

Modern data warehouses are a big step forward from traditional on-premise architectures. They allow organizations to innovate quicker and provide value to the business much faster. An organization has many options in the cloud and many vendors offer a cloud data warehouse, but be careful: building a modern data warehouse architecture is highly involved and may require multiple technologies to get you to the finish line.

The most important thing to do when embarking on a modern data warehouse initiative is to have an experienced partner to guide you through the process the right way from establishing why a cloud data warehouse is important to your organization to outlining what the future state vision should be to develop a plan to get you there.

Data warehouse architecture is changing, don’t fall behind your competition! With multiple options for getting started, there is no reason to wait.

We hope you found this information valuable. If you have any questions or would like to learn more, please contact us and we’ll schedule a time to connect.


The Data Supply Chain for Marketers

In this blog post, we’ll explore the basics of the data supply chain and why they matter for marketers, including:

  • Data ingestion
  • The differences between ETL and ELT
  • Data pipelines
  • Data storage options

If you haven’t read Data 101 for Marketers, start with that blog post here.

Data Ingestion

Data ingestion is the first step in any analytical undertaking. It’s a process where data from one or many sources are gathered and imported into one place. Data can be imported in real time (like POS data) or in batches (like billing systems).

Why It Matters for Marketers:

The process of data ingestion consolidates all of the relevant information from across your data sources into a single, centralized storage system. Through this process, you can begin to convert disparate data created in your CRM, POS, and other source systems into a unified format that is ready for real-time or batch analysis.

Real-World Examples:

Marketing teams pull data from a wide variety of resources, including Salesforce, Marketo, Facebook, Twitter, Google, Stripe, Zendesk, Shopify, Mailchimp, mobile devices, and more. It’s incredibly time-consuming to manually combine these data sources, but by using tools to automate some of these processes you can get data into the hands of your team faster.

This empowers marketers to answer more sophisticated questions about customer behavior, such as:

  • Why are customers leaving a specific product in their online shopping carts?
  • What is the probability that we’ll lose a customer early in the customer journey?
  • Which messaging pillar is resonating most with customers in the middle of the sales funnel who live in Germany?

Image 1: In this image, three source systems with varying formats and content are ingested into a central location in the data warehouse.

ETL vs. ELT

ETL and ELT are both data integration methods that make it possible to take data from various sources and move it into a singular storage space like a data warehouse. The difference is in when the transformation of data takes place.

Real-World Examples:

As your business scales, ELT tools are better-equipped to handle the volume and variety of marketing data on hand. However, a robust data plan will make use of both ELT and ETL tools.

For example, a member of your team wants to know which marketing channels are the most effective at converting customers with the highest average order value. The data you need to answer that question is likely spread across multiple structured data sources (e.g., referral traffic from Google Analytics, transaction history from your POS or e-commerce system, and customer data from your CRM).

Through your ETL process, you can extract relevant data from the above sources, transform it (e.g., updating customer contact info across files for uniformity and accuracy), and load the clean data into one final location. This enables your team to run your query in a streamlined way with limited upfront effort.

In comparison, your social media marketing team wants to see whether email click-through rates or social media interactions lead to more purchases. The ELT process allows them to extract and load all of the raw data in real time from the relevant source systems and run ad-hoc analytics reports, making adjustments to campaigns on the fly.

Extract, Transform, Load (ETL)

This method of data movement first copies data from the original database into the target system and then converts the data into a singular format. Lastly, the transformed data is uploaded into a data warehouse for analytics.

When You Should Use ETL:

ETL processes are preferable for moving small amounts of structured data with no rush on when that data is available for use. A robust ETL process would clean and integrate carefully selected data sources to provide a single source of truth that delivers faster analytics and makes understanding and using the data extremely simple.

Image 2: This image shows four different data sources with varying data formats being extracted from their sources, transformed to all be formatted the same, and then loaded into a data warehouse. Having all the data sources formatted the same way allows you to have consistent and accurate data in the chart that is built from the data in the data warehouse.

Extract, Load, Transform (ELT)

Raw data is read from source databases, then loaded into the database in its raw form. Raw data is usually stored in a cloud-based data lake or data warehouse, allowing you to transform only the data you need.

When You Should Use ELT:

ELT processes shine when there are large amounts of complex structured and unstructured data that need to be made available more immediately. ELT processes also upload and store all of your data in its raw format, making data ingestion faster. However, performing analytics on that raw data is a more complex process because cleaning and transformation happen post-upload.

Image 3: This image is showing four different data sources with the data formatted in different ways. The data is being extracted from the various sources, loaded into the data warehouse, and then transformed within the data warehouse to all be formatted the same. This allows for accurate reporting of the data in the chart seen above.

Data Pipeline

A data pipeline is a series of steps in an automated process that moves data from one system to another, typically using ETL or ELT practices.

Why It Matters for Marketers:

The automatic nature of a data pipeline removes the burden of data manipulation from your marketing team. There’s no need to chase down the IT team or manually download files from your marketing automation tool, CRM, or other data sources to answer a single question. Instead, you can focus on asking the questions and honing in on strategy while the technology takes away the burden of tracking down, manipulating, and refreshing the information.

Real-World Examples:

Say under the current infrastructure, your sales data is split between your e-commerce platform and your in-store POS systems. The different data formats are an obstacle to proper analysis, so you decide to move them to a new target system (such as a data warehouse).

A data pipeline would automate the process of selecting data sources, prioritizing the datasets that are most important, and transforming the data without any micromanagement of the tool. When you’re ready for analysis, the data will already be available in one destination and validated for accuracy and uniformity, enabling you to start your analysis without delay.

Data Storage Options

Databases, data warehouses, and data lakes are all systems for storing and using data, but there are differences to consider when choosing a solution for your marketing data.

Definitions:

  • A database is a central place where a structured and organized collection of data can be stored in a computer that is accessed via various applications such as MailChimp, Rollworks, Marketo, or even more traditional campaigns like direct mail. It is not meant for large-scale analytics.
  • A data warehouse is a specific way of structuring your data in database tables so that it is optimized for analytics. A data warehouse brings together all your various data sources under one roof and structures it for analytics.
  • A data lake is a vast repository of structured and unstructured data. It handles all types of data, and there is no hierarchy or organization to the storage.

Why It Matters for Marketers:

There are benefits and drawbacks to each type of data structure, and marketers should have a say in how data gets managed throughout the organization. For example, with a data lake, you will need to have a data scientist or other technical resource on staff to help make sense of all the data, but your marketing team can be more self-sufficient with a database or data warehouse.

Database

Benefits:

Without organization and structure, the insights your data holds can be unreliable and hard to find. Pulling data from various source systems is often time-consuming and requires tedious and error-prone reformatting of the data in order to tell a story or answer a question. A database can help to store data from multiple sources in an organized central location.

Real-World Example:

Without databases, your team would have to use multiple Excel sheets and manual manipulation to store the data needed for analysis. This means your team would have to manually match up or copy/paste each Excel sheet’s data in order to create one place to analyze all of your data.

Data Warehouse

Benefits:

A data warehouse delivers an extra layer of organization across all databases throughout your business. Your CRM, sales platform, and social media data differ in format and complexity but often contain data about similar subjects. A data warehouse brings together all of those varying formats into a standardized and holistic view structured to optimize reporting. When that data is consolidated from across your organization, you can obtain a complete view of your customers, their spending habits, and their motivations.

Real-World Example:

You might hear people say “enterprise data warehouse” or “EDW” when they talk about data. This is a way to structure data that makes answering questions via reports quick and easy. More importantly, EDWs often contain information from the entire company, not just your function or department. Not only can you answer questions about your customer or marketing-specific topics, but you can understand other concepts such as the inventory flow of your products. With that knowledge, you can determine, for example, how inventory delays are correlated to longer shipping times, which often result in customer churn.

Data Lake

Benefits:

A data lake is a great option for organizations that need more flexibility with their data. The ability for a data lake to hold all data—structured, semi-structured, or unstructured—makes it a good choice when you want the agility to configure and refigure models and queries as needed. Access to all the raw data also makes it easier for data scientists to manipulate the data.

Real-World Example:

You want to get real-time reports from each step of your SMS marketing campaign. Using a data lake enables you to perform real-time analytics on the number of messages sent, the number of messages opened, how many people replied, and more. Additionally, you can save the content of the messages for later analysis, delivering a more robust view of your customer and enabling you to increase personalization of future campaigns.

So, how do you choose?

You might not have to pick just one solution. In fact, it might make sense to use a combination of these systems. Remember, the most important thing is that you’re thinking about your marketing data, how you want to use it, what makes sense for your business, and the best way to achieve your results.

Hopefully this information has helped you better understand your options for data ingestion and storage. Feel free to contact us with any questions or to learn more about data ingestion and storage options for your marketing data.


How Snowflake Fits Into Your Existing Azure Environment

To get your source data ingested and loaded, or for a deep dive into how you can build a fully automated data integration process in Snowflake on Azure, schedule a Snowflake whiteboarding session with our team of data architects.

If you have a Microsoft ecosystem but have been wanting to take advantage of more tools on the market, Snowflake on Azure means additional opportunities for you to upgrade your analytics platform while not throwing away the investment in and keeping the uniformity of your current environment.

For those who are not familiar, Snowflake is a cloud-based, massively parallel processing (MPP), columnar storage database. It’s a newer option for data warehousing that is set up for more efficient querying of large data volumes. It also consumes structured and semi-structured data in a way that traditional relational databases are not designed to do as effectively – think “big data” without the “big overhead.”

With this release, Microsoft companies can evaluate using Snowflake to increase the performance and flexibility of their analytics environment by adding it on top of their existing data integration process. To determine where Snowflake might be a good fit, our 2nd Watch consultants took a dive into where Snowflake could sit in our current and prospective Microsoft clients’ ecosystems.

Where does Snowflake fit in your current Azure environment?

Snowflake is best used as the home of an analytical layer (or dimensional model, for the more technical) that enables reporting. Think of this as the substitute for products like SQL Server Analysis Services (SSAS).

While we still recommend that you maintain a data integration hub and related process for all of your consuming application needs (i.e., sending consolidated and cleansed data back to each source system to keep them in sync), Snowflake can sit right at the end of that process. Because it’s optimized for read activities, it complements the needs of business intelligence tools like Power BI, Looker, Tableau, etc., making it faster for business users to grab the information they need via those tools. Integration and ETL are possible with many tools and services, including Azure Data Factory.

Example architecture for adding Snowflake to the end of your data integration process 

When would you want to use Snowflake?

There are two primary ideas behind Snowflake’s competitive advantage when it comes to data warehousing platforms: its automatic optimization of query execution and the hands-off nature of its maintenance.

We recommend Snowflake for two main use cases:

  1. Developing a more efficient analytics platform
  2. Creating a platform for flexible data discovery

Our clients that fit within the above use cases usually had:

  • Multiple business users – The more people you have querying your database at one time, the more the database has to be configured to handle that load so as to not lock up other processes. Traditional databases can be scaled up to handle larger reads (think of a query that produces data), but this takes a decent amount of time and effort to achieve, and they are more often optimized for writes (think of loading data into a table). In Snowflake, a user can spin up resources for just the one query, then spin it back down right after. This allows for a more modular use of higher power resources.
  • Lots of data – If you’ve ever tried to perform a huge query on your current system, you likely noticed a slowdown from your usual processing. Traditional databases are not as optimized for read activities as columnar databases are. This makes options like Snowflake more attractive to those performing heavier analytical queries on a regular basis.

“Database systems have traditionally optimized performance for write-intensive workloads. Recently, there has been renewed interest in architectures that optimize read performance by using column-oriented data representation and light-weight compression. This previous work has shown that under certain broad classes of workloads, column-based systems can outperform row-based systems. ” –  MIT Computer Science and Artificial Intelligence Laboratory

  • A mix of structured and semi-structured data – Though many traditional databases offer options for consuming semi-structured data (e.g., JSON, XML, etc.), they aren’t optimized to do so. If you have a mix of structured and semi-structured data, options like Snowflake might be more efficient for your process.

What’s the tradeoff?

Snowflake can be a great option for clients who need a true analytical platform where they can perform data discovery or need to read out a lot of data at once. That said, the following are situations where an alternative to Snowflake might make more sense:

  • You need a data hub to keep your various applications in sync (application integration and API endpoints).
  • You aren’t trying to perform complex aggregations or data discovery.
  • You’re invested in an existing solution that doesn’t have many performance or management overhead concerns.

So what’s the gist?

Snowflake makes it easy to kick off a data warehouse project with its ease of use and start-small, scale-big approach. Despite this innovative architecture, we still recommend applying the same data warehousing fundamentals that you would have in the past.

Yes, Snowflake will allow you to create views on top of messy raw source data, but you will ultimately lose the battle on performance as your data grows in complexity. We suggest approaching a new Snowflake deployment with a vision of your data platform as a whole – not just the analytics layer. Performing all your complex business logic in Snowflake is possible, but due to the columnar architecture, integration use cases are better served in complementary cloud database platforms, such as Azure SQL.

With a roadmap and strategy in place for both your integration and analytics needs, you’ll be better able to select the right mix of tools and technologies for your solution. Learn more about Snowflake deployment best practices in our eBook.


A CTO’s Guide to a Modern Data Platform: How to Build a Data Warehouse in 6-8 Weeks

It’s true that in a very short period of time, you can get an operational data warehouse loaded with all your source data. Wondering how to build a data warehouse that yields these results? Snowflake and its technology partner ecosystem, such as partners like Fivetran, allow you to take your databases and SaaS applications, replicate them to Snowflake, and see results in a significantly shorter timeframe than we’ve seen before. Write some SQL views in Snowflake against this data, point your favorite BI tool at it, and get lightning-fast results.

With the right plan in place, you can (aggressively) deliver that first business “wow” in six to eight weeks. 2nd Watch typically recommends at least two analytical virtual warehouses in enterprise deployments: one for data discovery and one for more structured and governed reporting.

The practice of having both a data discovery layer and a governed layer for your Snowflake deployment not only saves you time in the initial build, but it creates a continuous integration and deployment pattern. It finally makes a dent in the notion that a data warehouse cannot be “agile.”

Future State Snowflake Data Architecture

Future State Snowflake Data Architecture

With this approach, you not only achieve governance and speed-to-insight, but you’ve also cut down your Snowflake consumption costs. Running complex queries at run-time, every time, can get expensive. Snowflake’s caching can help here, but if you’re constantly running a complex join across over 20 tables, that physical layer might help you move from an XL-sized Snowflake warehouse to a L or M. In the long run, those cost savings will add up. When “best practice” or risk avoidance isn’t enough to justify this approach, it’s possible the dollar savings might speak for themselves.

The initial six- to eight-week path assumes a focus on the data discovery layer, as depicted below. Loading all your data into a data discovery layer should be the primary development activity in your Snowflake proof of concept (POC) or pilot. Here are some tips:

  • Find several source systems that have the data your stakeholders need.
  • Begin the process of rapidly loading into your Snowflake data discovery layer.
  • Write iterative SQL in Snowflake views to build your business logic.
  • Connect your BI tool to Snowflake and build a very simple dashboard.
  • Get feedback from business, with the new dashboard as a facilitation tool.
  • Repeat this process – more feedback is better.

You’ve not only increased the speed to market, but you’ve also enabled developers and business users to execute an analytics project in a completely new way.

Highlighted: Speed-to-insight approach, with transformations in Snowflake

This all comes with a caveat: Yes, you can write all your logic in Snowflake and do everything you need from an analytics standpoint, but it will limit you in the long run. Every data project has sacrifices to be made, whether it’s time, features, cost, or flexibility. You need to balance these with a long-term vision for how your data will be used across the organization. Snowflake will get you part of the way there, but a good plan and strategy will take you even further.

In the meantime, if you have any questions or want to learn more about how to build a data warehouse using Snowflake, contact us today.


The Business Leader’s Guide to ETL

As a business leader, you want to get to impactful analytics ASAP. You want faster reporting, easy-to-use dashboards, and quick and accurate answers to your questions. Whether you’re the CFO, head of operations, or marketing lead, you want to use your data to make better, more insightful business decisions.

However, you can’t implement analytics without a solid data strategy. A common component of implementing a data management architecture is ETL, or “extract, transform, load” logic. We understand that your CEO wants to hear, “We’ve implemented robust analytics,” not “We’re working on ETL.” ETL doesn’t sound exciting. But to get you where you want to go, ETL should be a part of your data story.

At 2nd Watch we aim to help you feel confident through all phases of your data and analytics journey; so today we’re answering, “what is ETL?,” including where ETL fits into data management and why it may be an essential precursor to your organization’s analytics goals.

What is ETL?

ETL involves collecting the data that is most relevant to the key performance indicators you are trying to analyze – which could be all of the data across your organization or just pieces of the data stored in various source systems – centralizing, transforming, and modeling it, likely in a data warehouse.

As previously mentioned, ETL stands for “extract, transform, load.” Let’s break that down:

Extract

Data is gathered from the relevant data sources (e.g., ERP, CRM, application databases, SaaS applications, flat files, and/or Excel spreadsheets). This could be financials, customer details, and internal HR information. Or maybe your marketing department, in particular, needs an ETL solution; they may store customer data across multiple systems, such as your enterprise CRM, email marketing software, and a digital advertising provider.

This could be done manually; or it can be automated utilizing orchestration workflows and ETL that eliminates the manual process, making it much faster and more reliable to analyze your business data.

Transform

The data transformation stage is really “where the magic happens” as this is where you make your data usable. Data is cleaned, deduped, and transformed to align with how the business wants to analyze the data in your data warehouse, ensuring data quality and ease of use. This includes reformatting for uniformity, such as restructuring IL, Il, Ill., and Illinois to all read “Illinois.” Data is also cleansed of inconsistencies and missing values, duplicate values are removed, and data is prepared for the business to easily interact with it.

Most of the time spent on the ETL process will be used to come up with the right calculations, determine data quality rules, and model the data for analytics, all key pieces of data transformation. This stage is only becoming more crucial as organizations are seeing growing amounts of disparate data from many different sources.

Load

In the last stage of the ETL process, the transformed data is loaded into your data warehouse, so all data is easily accessible for future use in analytical dashboards and reporting. Depending on your business needs, this process can be done in batch or near-real-time, which allows you to access and analyze your data as it’s being loaded into your source systems.

Ultimately, ETL helps your data and modern business intelligence tools such as Looker, Qlik, Power BI, and/or Tableau work in harmony, giving you richer and deeper analytical capabilities.

What is ELT?

You may be wondering why this section has the same header as the previous section. If you look closely, you’ll notice two different acronyms: ETL and ELT. If you guessed that ELT stands for “extract, load, transform,” you would be correct. Why might we switch the order of the process?

ELT is most useful when speed is the primary concern. Data is loaded more quickly, and you can decide which data to transform at which point. This lets you access the exact data you need more quickly, though you will have to wait for additional data transformation if more of your data becomes necessary.

Of course, this approach can come with security concerns as large amounts of data are loaded into a system prior to being transformed. Additionally, ELT can impede analytics based on large data sets if your system isn’t capable of transforming the required data quickly enough.

An ETL/ELT tool, or a custom ETL solution built with multiple technologies, can displace some of the concerns you may have about either method as they’re built precisely for this purpose.

What ETL tools (and ELT tools) should you know about?

There are two ways to approach ETL: building a custom ETL solution (which 2nd Watch can help with) or using a prepackaged ETL tool. The decision to buy or build your ETL tool depends on multiple factors, but a few tools stand out if you decide to go the “buy” route.

Fivetran

Fivetran is cloud-based and offers more than 130 ready-to-use data sources, plus the ability to add custom integrations. In our experience, the Fivetran ETL process can take less than five minutes (but keep in mind, it’s streamlined). Fivetran deals primarily with data integration, not so much transformation. Fivetran’s pricing plan is consumption-based.

Matillion

Matillion is a cloud ETL platform with more than 100 integrated data sources. Pricing is based on the time the program is turned on, not based on usage.

Qlik Replicate

Formerly known as Attunity Replicate, Qlik Replicate works on-prem and in the cloud, supporting all major data sources. Pricing is based on the number of cores on your database source servers.

Talend

Talend is compatible with both on-prem and cloud data sources, and it includes hundreds of pre-built source integrations. There’s an open-source version, as well as a paid platform more suited to larger organizations that would benefit from additional tools and features.

Not sure where to start after all of this information? Set up a complimentary data strategy whiteboarding session with 2nd Watch. We’ll walk through the current state of your data and your data and analytics goals to determine the best path to the insights you need.


A High-Level Overview of Snowflake

Using a modern data warehouse, like Snowflake, can give your organization improved access to your data and dramatically improved analytics. When paired with a BI tool, like Tableau, or a data science platform, like Dataiku, you can gain even faster access to impactful insights that help your organization fuel innovation and drive business decisions.

In this post, we’ll provide a high-level overview of Snowflake, including a description of the tool, why you should use it, pros and cons, and complementary tools and technologies.

Overview of Snowflake

Snowflake was built from the ground up for the cloud, initially starting on AWS and scaling to Azure and GCP. With no servers to manage and near-unlimited scale in compute, Snowflake separates compute from storage and charges based on the size and length of time that compute clusters (known as “virtual warehouses”) are running queries.

Value Prop:

  • Cross cloud lets organizations choose the cloud provider to use
  • Dynamic compute scaling saves on cost
  • Micro-partitioned storage with automatic maintenance

Scalability:

  • Rapid auto-scaling of compute nodes allows for increased cost savings and high concurrency on demand, and compute and storage are separated

Performance:

  • Built for MPP (massive parallel processing)
  • Optimized for read via a columnar backend
  • Dedicated compute means no concurrency issues

Features:

  • Ability to assign dedicated compute
  • High visibility into spend
  • Native support for JSON, XML, Avro, Parquet, and ORC semi-structured data formats
  • SnowSQL has slight syntax differences
  • Introduction of Snowpark for Snowflake native development

Security:

  • Full visibility into queries executed, by whom, and how long they ran
  • Precision point-in-time restore available via “time-travel” feature

Why Use Snowflake

Decoupled from cloud vendors, it allows a true multi-cloud experience. You can deploy on Azure, AWS, GCP, or any combination of those cloud services. With near-unlimited scale and minimal management, it offers a best-in-class data platform but with a pay-for-what-you-use consumption model.

Pros of Snowflake

  • Allows for a multi-cloud experience built on top of existing AWS, Azure, or GCP resources, depending on your preferred platform
  • Highly-performant queries utilizing uniquely provisioned pay-as-you-go compute and automatically derived partitioning
  • Easy implementation of security and role definitions for less frustrating user experience and easier delineation of cost while keeping data secure
  • Integrated ability to share data to partners or other consumers outside of an organization and supplement data with publicly available datasets within Snowflake

Cons of Snowflake

  • Ecosystem of tooling continues to grow as adoption expands, but some features are not readily available
  • Due to the paradigm shift in a cloud-born architecture, taking full advantage of Snowflake’s advanced features requires a good understanding of cloud data architecture

Select Complementary Tools and Technologies for Snowflake

  • Apache Kafka
  • AWS Lambda
  • Azure Data Factory
  • Dataiku
  • Power BI
  • Tableau

We hope you found this high-level overview of Snowflake helpful. If you’re interested in learning more about Snowflake or other modern data warehouse tools like Amazon Redshift, Azure Synapse, and Google BigQuery, contact us to learn more.

The content of this blog is an excerpt of our Modern Data Warehouse Comparison Guide. Click here to download a copy of that guide.