The modern business world is data-centric. As more businesses turn to cloud computing, they must evaluate and choose the right data warehouse to support their digital modernization efforts and business outcomes. Data warehouses can increase the bottom line, improve analytics, enhance the customer experience, and optimize decision-making.
A data warehouse is a large repository of data businesses utilize for deep analytical insights and business intelligence. This data is collected from multiple data sources. A high-performing data warehouse can collect data from different operational databases and apply a uniform format for better analysis and quicker insights.
Snowflake is a cloud-based data warehousing solution that uses third-party cloud-compute resources, such as Azure, Google Cloud Platform, or Amazon Web Services (AWS.) It is designed to provide users with a fully managed, cloud-native database solution that can scale up or down as needed for different workloads. Snowflake separates compute from storage: a non-traditional approach to data warehousing. With this method, data remains in a central repository while compute instances are managed, sized, and scaled independently.
Snowflake is a good choice for companies that are conscious about their operational overhead and need to quickly deploy applications into production without worrying about managing hardware or software. It is also the ideal platform to use when query loads are lighter, and the workload requires frequent scaling.
The benefits of Snowflake include:
Easy integration with most components of data ecosystems
Minimal operational overhead: companies are not responsible for installing, configuring, or managing the underlying warehouse platform
Simple setup and use
Abstracted configuration for storage and compute instances
Robust and intuitive SQL interface
What is Amazon Redshift?
Amazon Redshift is an enterprise data warehouse built on Amazon Web Services (AWS). It provides organizations with a scalable, secure, and cost-effective way to store and analyze large amounts of data in the cloud. Its cloud-based compute nodes enable businesses to perform large-scale data analysis and storage.
Amazon Redshift is ideal for enterprises that require quick query outputs on large data sets. Additionally, Redshift has several options for efficiently managing its clusters using AWS CLI/Amazon Redshift Console, Amazon Redshift Query API, and AWS Software Development Kit. Redshift is a great solution for companies already using AWS services and running applications with a high query load.
The benefits of Amazon Redshift include:
Seamless integration with the AWS ecosystem
Multiple data output formatting support
Easy console to extract analytics and run queries
Customizable data and security models
Comparing Data Warehouse Solutions
Snowflake and Amazon Redshift both offer impressive performance capabilities, like scalability across multiple servers and high availability with minimal downtime. There are some differences between the two that will determine which one is the best fit for your business.
Both data warehouse solutions harness massively parallel processing (MPP) and columnar storage, which enables advanced analytics and efficiency on massive jobs. Snowflake boasts a unique architecture that supports structured and semi-structured data. Storage, computing, and cloud services are abstracted to optimize independent performance. Redshift recently unveiled concurrency scaling coupled with machine learning to compete with Snowflake’s concurrency scaling.
Snowflake is a pure SaaS platform that doesn’t require any maintenance. All software and hardware maintenance is handled by Snowflake. Amazon Redshift’s clusters require manual maintenance from the user.
Data and Security Customization
Snowflake supports fewer customization choices in data and security. Snowflake’s security utilizes always-on encryption enforcing strict security checks. Redshift supports data flexibility via partitioning and distribution. Additionally, Redshift allows you to tailor its end-to-end encryption and set up your own identity management system to manage user authentication and authorization.
Both platforms offer on-demand pricing but are packaged differently. Snowflake doesn’t bundle usage and storage in its pricing structure and treats them as separate entities. Redshift bundles the two in its pricing. Snowflake tiers its pricing based on what features you need. Your company can select a tier that best fits your feature needs. Redshift rewards businesses with discounts when they commit to longer-term contracts.
Which data warehouse is best for my business?
To determine the best fit for your business, ask yourself the following questions in these specific areas:
Do I want to bundle my features? Snowflake splits compute and storage, and its tiered pricing provides more flexibility to your business to purchase only the features you require. Redshift bundles compute and storage to unlock the immediate potential to scale for enterprise data warehouses.
Do I want a customizable security model? Snowflake grants security and compliance options geared toward each tier, so your company’s level of protection is relevant to your data strategy. Redshift provides fully customizable encryption solutions, so you can build a highly tailored security model.
Do I need JSON storage? Snowflake’s JSON storage support wins over Redshift’s support. With Snowflake, you can store and query JSON with native functions. With Redshift, JSON is split into strings, making it difficult to query and work with.
Do I need more automation? Snowflake automates issues like data vacuuming and compression. Redshift requires hands-on maintenance for these sorts of tasks.
A data warehouse is necessary to stay competitive in the modern business world. The two major data warehouse players – Snowflake and Amazon Redshift – are both best-in-class solutions. One product is not superior to the other, so choosing the right one for your business means identifying the one best for your data strategy.
2nd Watch is an AWS Certified Partner and an Elite Snowflake Consulting Partner. We can help you choose the right data warehouse solution for you and support your business regardless of which data warehouse your choose.
We have been recognized by AWS as a Premier Partner since 2012, as well as an audited and approved Managed Service Provider and Data and Analytics Competency partner for our outstanding customer experiences, depth and breadth of our products and services, and our ability to scale to meet customer demand. Our engineers and architects are 100% certified on AWS, holding more than 200 AWS certifications.
Our full team of certified SnowPros has proven expertise to help businesses implement modern data solutions using Snowflake. From creating a simple proof of concept to developing an enterprise data warehouse to customized Snowflake training programs, 2nd Watch will help you to utilize Snowflake’s powerful cloud-based data warehouse for all of your data needs.
In our previous blog post on how to build a data warehouse in 6-8 weeks, we showed you how to get lightning-fast results and effectively create a working data warehouse with Snowflake. Future state integrations and governance needs are coming, though. This is why 2nd Watch highly recommends executing a data strategy and governance project in parallel with your Snowflake proof-of-concept. Knowing how to leverage Snowflake’s strengths to avoid common pitfalls will save you time, money, and re-work.
Consider one company that spent a year using the data discovery layer-only approach. With data sources all centralized in the data warehouse and all transformations occurring at run-time in the BI tool, the data team was able to deliver a full analytical platform to its users in less time than ever before. Users were happy, at first, until the logic became more mature and more complex and ultimately required more compute power (translating to higher cost) to keep the same performance expectations. For some, however, this might not be a problem but an expected outcome.
For this company, enabling analytics and reporting was the only need for the first year, but integration of data across applications was coming full steam ahead. The primary line of business applications needed to get near-real-time updates from the others. For example, marketing automation didn’t rely 100% on humans; it needed data to execute its rules, from creating ad campaigns to sending email blasts based on events occurring in other systems.
This one use case poked a big hole in the architecture – you can’t just have a data warehouse in your enterprise data platform. There’s more to it. Even if it’s years away, you need to effectively plan for it or you’ll end up in a similar, costly scenario. That starts with data strategy and governance.
ETL vs. ELT in Snowflake
Identify where your transformations occur and how they impact your downstream systems.
The new paradigm is that you no longer need ETL (Extract, Transform, Load) – you need ELT (Extract, Load, Transform). This is true, but sometimes misleading. Some will interpret ELT as no longer needing to build and manage the expensive pipelines and business logic that delay speed-to-insight, are costly to maintain, and require constant upkeep for changing business rules. In effect, it’s interpreted as removing the “T” and letting Snowflake solve for this. Unfortunately, someone has to write the code and business logic, and it’s best to not have your business users trying to do this when they’re better served working on your organization’s core goals.
In reality, you are not removing the “T” – you are moving it to a highly scalable and performant database after the data has been loaded. This is still going to require someone to understand how your customer data in Salesforce ties to a customer in Google Analytics that corresponds to a sale in your ERP. You still need someone who knows both the data structures and the business rules. Unfortunately, the “T” will always need a place to go – you just need to find the right place.
Ensure your business logic is defined only once in the entire flow. If you’ve written complex transformation code to define what “customer” means, when that business logic inevitably changes, you’ll be guaranteed that this definition of “customer” will flow the same way to your BI users as it does to your ERP and CRM. When data science and machine learning enter the mix, you’ll also avoid time spent in data prep and instead focus on delivering predictive insights.
You might be thinking that this all sounds even more similar to the data warehouse you’ve already built and are trying to replace. There’s some good news: Snowflake does make this easier, and ELT is still exactly the right approach.
Defining and Adjusting the Business Logic and Views
Snowflake enables an iterative process of data discovery, proof-of-concept, business value, and long-term implementation.
Perhaps you’ve defined a sales hierarchy and a salesperson compensation metric. The developer can take that logic, put it into SQL against the raw data, and refresh the dashboard, all while the business user is sitting next to them. Is the metric not quite what the user expected, or is the hierarchy missing something they hadn’t thought of in advance? Tweak the SQL in Snowflake and refresh. Iterate like this until the user is happy and signs off, excited to start using the new dashboard in their daily routine.
By confirming the business logic in the salesperson compensation example above, you’ve removed a major part of what made ETL so painful in the past: developing, waiting for a load to finish, and showing business users. That gap between load finishing and the next development cycle is a considerable amount of lost time and money. With this approach, however, you’ve confirmed the business logic is correct and you have the SQL already written in Snowflake’s data discovery views.
Developing your initial logic in views in Snowflake’s data discovery layer allows you to validate and “certify” it for implementation into the physical model. When you’ve completed the physical path, you can change the BI tool for each completed subject area to point to the physical layer instead of the data discovery layer.
If you have any questions about data strategy and governance, or if you want to learn more about how Snowflake can fit into your organization, contact us today.
This blog originally appeared as a section of our eBook, “Snowflake Deployment Best Practices: A CTO’s Guide to a Modern Data Platform.” Click here to download the full eBook.
Snowflake is a powerhouse in the data world, but it can become even more powerful when paired with other technologies that suit your organization’s needs. 2nd Watch is a tech-agnostic firm, meaning we only recommend tools we truly believe are the best for your business. Because Snowflake partners with such a wide range of technologies, we have lots of options to create a tech stack uniquely suited to you.
Among the many tools that work in conjunction with Snowflake, the following Snowflake-adjacent technologies enhance Snowflake in different ways and are all noteworthy in their own right.
HVR (a Fivetran company)
HVR, which became part of Fivetran as of late 2021, is a cloud data replication tool that natively supports Snowflake. It executes real-time data replication by reading directly from database transaction logs, which allows for high performance and low impact on database servers.
After connecting your organization with the right MDM tool for your situation, like Riversand, 2nd Watch would oversee the MDM implementation and provide guidance on next steps. This includes sharing insights on best-in-class data warehouses, like Snowflake’s Data Cloud. Together, Riversand and Snowflake can prepare you for downstream analytics and long-term data governance.
Sisu accelerates and expands your analytics capabilities. Using Sisu, business users are able to self-serve and find the answers to increasingly complex questions.
Sisu connects directly to Snowflake’s Data Cloud, so you always have access to your most up-to-date data for comprehensive, actionable analytics. Sisu and Snowflake partner to empower you to make quick, decisive judgments.
These four technologies are just a taste of the many tools that work with Snowflake to improve your organization’s data and analytics capabilities. To discuss your specific data and analytics needs – and to determine the tools Snowflake partners with that will help you meet those needs – contact 2nd Watch today.
To get your source data ingested and loaded, or for a deep dive into how you can build a fully automated data integration process in Snowflake on Azure, schedule a Snowflake whiteboarding session with our team of data architects.
If you have a Microsoft ecosystem but have been wanting to take advantage of more tools on the market, Snowflake on Azure means additional opportunities for you to upgrade your analytics platform while not throwing away the investment in and keeping the uniformity of your current environment.
For those who are not familiar, Snowflake is a cloud-based, massively parallel processing (MPP), columnar storage database. It’s a newer option for data warehousing that is set up for more efficient querying of large data volumes. It also consumes structured and semi-structured data in a way that traditional relational databases are not designed to do as effectively – think “big data” without the “big overhead.”
With this release, Microsoft companies can evaluate using Snowflake to increase the performance and flexibility of their analytics environment by adding it on top of their existing data integration process. To determine where Snowflake might be a good fit, our 2nd Watch consultants took a dive into where Snowflake could sit in our current and prospective Microsoft clients’ ecosystems.
Where does Snowflake fit in your current Azure environment?
Snowflake is best used as the home of an analytical layer (or dimensional model, for the more technical) that enables reporting. Think of this as the substitute for products like SQL Server Analysis Services (SSAS).
While we still recommend that you maintain a data integration hub and related process for all of your consuming application needs (i.e., sending consolidated and cleansed data back to each source system to keep them in sync), Snowflake can sit right at the end of that process. Because it’s optimized for read activities, it complements the needs of business intelligence tools like Power BI, Looker, Tableau, etc., making it faster for business users to grab the information they need via those tools. Integration and ETL are possible with many tools and services, including Azure Data Factory.
When would you want to use Snowflake?
There are two primary ideas behind Snowflake’s competitive advantage when it comes to data warehousing platforms: its automatic optimization of query execution and the hands-off nature of its maintenance.
We recommend Snowflake for two main use cases:
Developing a more efficient analytics platform
Creating a platform for flexible data discovery
Our clients that fit within the above use cases usually had:
Multiple business users – The more people you have querying your database at one time, the more the database has to be configured to handle that load so as to not lock up other processes. Traditional databases can be scaled up to handle larger reads (think of a query that produces data), but this takes a decent amount of time and effort to achieve, and they are more often optimized for writes (think of loading data into a table). In Snowflake, a user can spin up resources for just the one query, then spin it back down right after. This allows for a more modular use of higher power resources.
Lots of data – If you’ve ever tried to perform a huge query on your current system, you likely noticed a slowdown from your usual processing. Traditional databases are not as optimized for read activities as columnar databases are. This makes options like Snowflake more attractive to those performing heavier analytical queries on a regular basis.
“Database systems have traditionally optimized performance for write-intensive workloads. Recently, there has been renewed interest in architectures that optimize read performance by using column-oriented data representation and light-weight compression. This previous work has shown that under certain broad classes of workloads, column-based systems can outperform row-based systems. ” – MIT Computer Science and Artificial Intelligence Laboratory
A mix of structured and semi-structured data – Though many traditional databases offer options for consuming semi-structured data (e.g., JSON, XML, etc.), they aren’t optimized to do so. If you have a mix of structured and semi-structured data, options like Snowflake might be more efficient for your process.
What’s the tradeoff?
Snowflake can be a great option for clients who need a true analytical platform where they can perform data discovery or need to read out a lot of data at once. That said, the following are situations where an alternative to Snowflake might make more sense:
You need a data hub to keep your various applications in sync (application integration and API endpoints).
You aren’t trying to perform complex aggregations or data discovery.
You’re invested in an existing solution that doesn’t have many performance or management overhead concerns.
So what’s the gist?
Snowflake makes it easy to kick off a data warehouse project with its ease of use and start-small, scale-big approach. Despite this innovative architecture, we still recommend applying the same data warehousing fundamentals that you would have in the past.
Yes, Snowflake will allow you to create views on top of messy raw source data, but you will ultimately lose the battle on performance as your data grows in complexity. We suggest approaching a new Snowflake deployment with a vision of your data platform as a whole – not just the analytics layer. Performing all your complex business logic in Snowflake is possible, but due to the columnar architecture, integration use cases are better served in complementary cloud database platforms, such as Azure SQL.
With a roadmap and strategy in place for both your integration and analytics needs, you’ll be better able to select the right mix of tools and technologies for your solution. Learn more about Snowflake deployment best practices in our eBook.
It’s true that in a very short period of time, you can get an operational data warehouse loaded with all your source data. Wondering how to build a data warehouse that yields these results? Snowflake and its technology partner ecosystem, such as partners like Fivetran, allow you to take your databases and SaaS applications, replicate them to Snowflake, and see results in a significantly shorter timeframe than we’ve seen before. Write some SQL views in Snowflake against this data, point your favorite BI tool at it, and get lightning-fast results.
With the right plan in place, you can (aggressively) deliver that first business “wow” in six to eight weeks. 2nd Watch typically recommends at least two analytical virtual warehouses in enterprise deployments: one for data discovery and one for more structured and governed reporting.
The practice of having both a data discovery layer and a governed layer for your Snowflake deployment not only saves you time in the initial build, but it creates a continuous integration and deployment pattern. It finally makes a dent in the notion that a data warehouse cannot be “agile.”
Future State Snowflake Data Architecture
Future State Snowflake Data Architecture
With this approach, you not only achieve governance and speed-to-insight, but you’ve also cut down your Snowflake consumption costs. Running complex queries at run-time, every time, can get expensive. Snowflake’s caching can help here, but if you’re constantly running a complex join across over 20 tables, that physical layer might help you move from an XL-sized Snowflake warehouse to a L or M. In the long run, those cost savings will add up. When “best practice” or risk avoidance isn’t enough to justify this approach, it’s possible the dollar savings might speak for themselves.
The initial six- to eight-week path assumes a focus on the data discovery layer, as depicted below. Loading all your data into a data discovery layer should be the primary development activity in your Snowflake proof of concept (POC) or pilot. Here are some tips:
Find several source systems that have the data your stakeholders need.
Begin the process of rapidly loading into your Snowflake data discovery layer.
Write iterative SQL in Snowflake views to build your business logic.
Connect your BI tool to Snowflake and build a very simple dashboard.
Get feedback from business, with the new dashboard as a facilitation tool.
Repeat this process – more feedback is better.
You’ve not only increased the speed to market, but you’ve also enabled developers and business users to execute an analytics project in a completely new way.
Highlighted: Speed-to-insight approach, with transformations in Snowflake
This all comes with a caveat: Yes, you can write all your logic in Snowflake and do everything you need from an analytics standpoint, but it will limit you in the long run. Every data project has sacrifices to be made, whether it’s time, features, cost, or flexibility. You need to balance these with a long-term vision for how your data will be used across the organization. Snowflake will get you part of the way there, but a good plan and strategy will take you even further.
In the meantime, if you have any questions or want to learn more about how to build a data warehouse using Snowflake, contact us today.
Data governance is a broad-ranging discipline that affects everyone in an organization, whether directly or indirectly. It is most often employed to improve and consistently manage data through deduplication and standardization, among other activities, and can have a significant and sustained effect on reducing operational costs, increasing sales, or both.
Data governance can also be part of a more extensive master data management (MDM) program. The MDM program an organization chooses and how they implement it depends on the issues they face and both their short- and long-term visions.
For example, in the insurance industry, many companies sell various types of insurance policies renewing annually over a number of years, such as industrial property coverages and workers’ compensation casualty coverages. Two sets of underwriters will more than likely underwrite the business. Having two sets of underwriters using data systems specific to their lines of business is an advantage when meeting the coverage needs of their customers but often becomes a disadvantage when considering all of the data — but it doesn’t have to be.
The disadvantage arises when an agent or account executive needs to know the overall status of a client, including long-term profitability during all the years of coverage. This involves pulling data from policy systems, claims systems, and customer support systems. An analyst may be tasked with producing a client report for the agent or account executive to truly understand their client and make better decisions on both the client and company’s behalf. But the analyst may not know where the data is stored, who owns the data, or how to link clients across disparate systems.
Fifteen years ago, this task was very time-consuming and even five years ago was still quite cumbersome. Today, however, this issue can be mitigated with the correct data governance plan. We will go deeper into data governance and MDM in upcoming posts; but for this one, we want to show you how innovators like Snowflake are helping the cause.
What is data governance?
Data governance ensures that data is consistent, accurate, and reliable, which allows for informed and effective decision-making. This can be achieved by centralizing the data into one location from few or many siloed locations. Ensuring that data is accessible in one location enables data users to understand and analyze the data to make effective decisions. One way to accomplish this centralization of data is to implement the Snowflake Data Cloud.
Snowflake not only enables a company to store their data inexpensively and query the data for analytics, but it can foster data governance. Dynamic data masking and object tagging are two new features from Snowflake that can supplement a company’s data governance initiative.
What is dynamic data masking?
Dynamic data masking is a Snowflake security feature that selectively omits plain-text data in table or view columns based on predefined policies for masking. The purpose of data masking or hiding data in specific columns is to ensure that data is accessed on a need-to-know basis. This kind of data is most likely sensitive and doesn’t need to be accessed by every user.
When is dynamic data masking used?
Data masking is usually implemented to protect personally identifiable information (PII), such as a person’s social security number, phone number, home address, or date of birth. An insurance company would likely want to reduce risk by hiding data pertaining to sensitive information if they don’t believe access to the data is necessary for conducting analysis.
However, data masking can also be used for non-production environments where testing needs to be conducted on an application. The users testing the environment wouldn’t need to know specific data if their role is just to test the environment and application. Additionally, data masking may be used to adhere to compliance requirements like HIPAA.
What is object tagging?
Another resource for data governance within Snowflake is object tagging. Object tagging enables data stewards to track sensitive data for compliance and discovery, as well as grouping desired objects such as warehouses, databases, tables or views, and columns.
When a tag is created for a table, view, or column, data stewards can determine if the data should be fully masked, partially masked, or unmasked. When tags are associated with a warehouse, a user with the tag role can view the resource usage of the warehouse to determine what, when, and how this object is being utilized.
When is object tagging used?
There are several instances where object tagging can be useful; one use would be tagging “PII” to a column and adding extra text to describe the type of PII data located there. For example, a tag can be created for a warehouse dedicated to the sales department, enabling you to track usage and deduce why a specific warehouse is being used.
Where can data governance be applied?
Data governance applies to many industries that maintain a vast amount of data from their systems, including healthcare, supply chain and logistics, and insurance; and an effective data governance strategy may use data masking and object tagging in conjunction with each other.
As previously mentioned, one common use case for data masking is for insurance customers’ PII. Normally, analysts wouldn’t need to analyze the personal information of a customer to uncover useful information leading to key business decisions. Therefore, the administrator would be able to mask columns for the customer’s name, phone number, address, social security number, and account number without interfering with analysis.
Object tagging is also valuable within the insurance industry as there is such a vast amount of data collected and consumed. A strong percentage of that data is sensitive information. Because there is so much data and it can be difficult to track those individual pieces of information, Snowflake’s object tagging feature can help with identifying and tracking the usage of those sensitive values for the business user.
Using dynamic data masking and object tagging together, you will be able to gain insights into the locations of your sensitive data and the amount specific warehouses, tables, or columns are being used.
Think back to the situation we mentioned earlier where the property coverage sales department is on legacy system X. During that same time period, the workers’ compensation sales department is on another legacy system Y. How are you supposed to create a report to understand the profitability of these two departments?
One option is to use Snowflake to store all of the data from both legacy systems. Once the information is in the Snowflake environment, object tagging would allow you to tag the databases or tables that involve data about their respective departments. One tag can be specified for property coverage and another tag can be set for workers’ compensation data. When you’re tasked with creating a report of profitability involving these two departments, you can easily identify which information can be used. Because the tag was applied to the database, it will also be applied to all of the tables and their respective columns. You would be able to understand what columns are being used. After the data from both departments is accessible within Snowflake, data masking can then be used to ensure that the new data is only truly accessible to those who need it.
This was just a small introduction to data governance and the new features that Snowflake has available to enable this effort. Don’t forget that this data governance effort can be a part of a larger, more intricate MDM initiative. In other blog posts, we touch more on MDM and other data governance capabilities to maintain and standardize your data, helping you make the most accurate and beneficial business decisions. If you have any questions in the meantime, feel free to get in touch.