How Snowflake Fits Into Your Existing Azure Environment

To get your source data ingested and loaded, or for a deep dive into how you can build a fully automated data integration process in Snowflake on Azure, schedule a Snowflake whiteboarding session with our team of data architects.

If you have a Microsoft ecosystem but have been wanting to take advantage of more tools on the market, Snowflake on Azure means additional opportunities for you to upgrade your analytics platform while not throwing away the investment in and keeping the uniformity of your current environment.

For those who are not familiar, Snowflake is a cloud-based, massively parallel processing (MPP), columnar storage database. It’s a newer option for data warehousing that is set up for more efficient querying of large data volumes. It also consumes structured and semi-structured data in a way that traditional relational databases are not designed to do as effectively – think “big data” without the “big overhead.”

With this release, Microsoft companies can evaluate using Snowflake to increase the performance and flexibility of their analytics environment by adding it on top of their existing data integration process. To determine where Snowflake might be a good fit, our 2nd Watch consultants took a dive into where Snowflake could sit in our current and prospective Microsoft clients’ ecosystems.

Where does Snowflake fit in your current Azure environment?

Snowflake is best used as the home of an analytical layer (or dimensional model, for the more technical) that enables reporting. Think of this as the substitute for products like SQL Server Analysis Services (SSAS).

While we still recommend that you maintain a data integration hub and related process for all of your consuming application needs (i.e., sending consolidated and cleansed data back to each source system to keep them in sync), Snowflake can sit right at the end of that process. Because it’s optimized for read activities, it complements the needs of business intelligence tools like Power BI, Looker, Tableau, etc., making it faster for business users to grab the information they need via those tools. Integration and ETL are possible with many tools and services, including Azure Data Factory.

How Snowflake Fits Into Your Existing Azure Environment
Example architecture for adding Snowflake to the end of your data integration process 

When would you want to use Snowflake?

There are two primary ideas behind Snowflake’s competitive advantage when it comes to data warehousing platforms: its automatic optimization of query execution and the hands-off nature of its maintenance.

We recommend Snowflake for two main use cases:

  1. Developing a more efficient analytics platform
  2. Creating a platform for flexible data discovery

Our clients that fit within the above use cases usually had:

  • Multiple business users – The more people you have querying your database at one time, the more the database has to be configured to handle that load so as to not lock up other processes. Traditional databases can be scaled up to handle larger reads (think of a query that produces data), but this takes a decent amount of time and effort to achieve, and they are more often optimized for writes (think of loading data into a table). In Snowflake, a user can spin up resources for just the one query, then spin it back down right after. This allows for a more modular use of higher power resources.
  • Lots of data – If you’ve ever tried to perform a huge query on your current system, you likely noticed a slowdown from your usual processing. Traditional databases are not as optimized for read activities as columnar databases are. This makes options like Snowflake more attractive to those performing heavier analytical queries on a regular basis.

“Database systems have traditionally optimized performance for write-intensive workloads. Recently, there has been renewed interest in architectures that optimize read performance by using column-oriented data representation and light-weight compression. This previous work has shown that under certain broad classes of workloads, column-based systems can outperform row-based systems. ” –  MIT Computer Science and Artificial Intelligence Laboratory

  • A mix of structured and semi-structured data – Though many traditional databases offer options for consuming semi-structured data (e.g., JSON, XML, etc.), they aren’t optimized to do so. If you have a mix of structured and semi-structured data, options like Snowflake might be more efficient for your process.

What’s the tradeoff?

Snowflake can be a great option for clients who need a true analytical platform where they can perform data discovery or need to read out a lot of data at once. That said, the following are situations where an alternative to Snowflake might make more sense:

  • You need a data hub to keep your various applications in sync (application integration and API endpoints).
  • You aren’t trying to perform complex aggregations or data discovery.
  • You’re invested in an existing solution that doesn’t have many performance or management overhead concerns.

So what’s the gist?

Snowflake makes it easy to kick off a data warehouse project with its ease of use and start-small, scale-big approach. Despite this innovative architecture, we still recommend applying the same data warehousing fundamentals that you would have in the past.

Yes, Snowflake will allow you to create views on top of messy raw source data, but you will ultimately lose the battle on performance as your data grows in complexity. We suggest approaching a new Snowflake deployment with a vision of your data platform as a whole – not just the analytics layer. Performing all your complex business logic in Snowflake is possible, but due to the columnar architecture, integration use cases are better served in complementary cloud database platforms, such as Azure SQL.

With a roadmap and strategy in place for both your integration and analytics needs, you’ll be better able to select the right mix of tools and technologies for your solution. Learn more about Snowflake deployment best practices in our eBook.

rss
Facebooktwitterlinkedinmail

A CTO’s Guide to a Modern Data Platform: How to Build a Data Warehouse in 6-8 Weeks

It’s true that in a very short period of time, you can get an operational data warehouse loaded with all your source data. Wondering how to build a data warehouse that yields these results? Snowflake and its technology partner ecosystem, such as partners like Fivetran, allow you to take your databases and SaaS applications, replicate them to Snowflake, and see results in a significantly shorter timeframe than we’ve seen before. Write some SQL views in Snowflake against this data, point your favorite BI tool at it, and get lightning-fast results.

With the right plan in place, you can (aggressively) deliver that first business “wow” in six to eight weeks. 2nd Watch typically recommends at least two analytical virtual warehouses in enterprise deployments: one for data discovery and one for more structured and governed reporting.

The practice of having both a data discovery layer and a governed layer for your Snowflake deployment not only saves you time in the initial build, but it creates a continuous integration and deployment pattern. It finally makes a dent in the notion that a data warehouse cannot be “agile.”

Future State Snowflake Data ArchitectureFuture State Snowflake Data Architecture

Future State Snowflake Data Architecture

With this approach, you not only achieve governance and speed-to-insight, but you’ve also cut down your Snowflake consumption costs. Running complex queries at run-time, every time, can get expensive. Snowflake’s caching can help here, but if you’re constantly running a complex join across over 20 tables, that physical layer might help you move from an XL-sized Snowflake warehouse to a L or M. In the long run, those cost savings will add up. When “best practice” or risk avoidance isn’t enough to justify this approach, it’s possible the dollar savings might speak for themselves.

The initial six- to eight-week path assumes a focus on the data discovery layer, as depicted below. Loading all your data into a data discovery layer should be the primary development activity in your Snowflake proof of concept (POC) or pilot. Here are some tips:

  • Find several source systems that have the data your stakeholders need.
  • Begin the process of rapidly loading into your Snowflake data discovery layer.
  • Write iterative SQL in Snowflake views to build your business logic.
  • Connect your BI tool to Snowflake and build a very simple dashboard.
  • Get feedback from business, with the new dashboard as a facilitation tool.
  • Repeat this process – more feedback is better.

You’ve not only increased the speed to market, but you’ve also enabled developers and business users to execute an analytics project in a completely new way.

Generic Snowflake Data ArchitectureHighlighted: Speed-to-insight approach, with transformations in Snowflake

This all comes with a caveat: Yes, you can write all your logic in Snowflake and do everything you need from an analytics standpoint, but it will limit you in the long run. Every data project has sacrifices to be made, whether it’s time, features, cost, or flexibility. You need to balance these with a long-term vision for how your data will be used across the organization. Snowflake will get you part of the way there, but a good plan and strategy will take you even further.

In the meantime, if you have any questions or want to learn more about how to build a data warehouse using Snowflake, contact us today.

Snowflake Deployment Best Practices Button (1)

rss
Facebooktwitterlinkedinmail

Snowflake’s Role in Data Governance for Insurance: Data Masking and Object Tagging Features

Data governance is a broad-ranging discipline that affects everyone in an organization, whether directly or indirectly. It is most often employed to improve and consistently manage data through deduplication and standardization, among other activities, and can have a significant and sustained effect on reducing operational costs, increasing sales, or both.

Snowflake’s Role in Data Governance for Insurance

Data governance can also be part of a more extensive master data management (MDM) program. The MDM program an organization chooses and how they implement it depends on the issues they face and both their short- and long-term visions.

For example, in the insurance industry, many companies sell various types of insurance policies renewing annually over a number of years, such as industrial property coverages and workers’ compensation casualty coverages. Two sets of underwriters will more than likely underwrite the business. Having two sets of underwriters using data systems specific to their lines of business is an advantage when meeting the coverage needs of their customers but often becomes a disadvantage when considering all of the data — but it doesn’t have to be.

The disadvantage arises when an agent or account executive needs to know the overall status of a client, including long-term profitability during all the years of coverage. This involves pulling data from policy systems, claims systems, and customer support systems. An analyst may be tasked with producing a client report for the agent or account executive to truly understand their client and make better decisions on both the client and company’s behalf. But the analyst may not know where the data is stored, who owns the data, or how to link clients across disparate systems.

Fifteen years ago, this task was very time-consuming and even five years ago was still quite cumbersome. Today, however, this issue can be mitigated with the correct data governance plan. We will go deeper into data governance and MDM in upcoming posts; but for this one, we want to show you how innovators like Snowflake are helping the cause.

What is data governance?

Data governance ensures that data is consistent, accurate, and reliable, which allows for informed and effective decision-making. This can be achieved by centralizing the data into one location from few or many siloed locations. Ensuring that data is accessible in one location enables data users to understand and analyze the data to make effective decisions. One way to accomplish this centralization of data is to implement the Snowflake Data Cloud.

Snowflake not only enables a company to store their data inexpensively and query the data for analytics, but it can foster data governance. Dynamic data masking and object tagging are two new features from Snowflake that can supplement a company’s data governance initiative.

What is dynamic data masking?

Dynamic data masking is a Snowflake security feature that selectively omits plain-text data in table or view columns based on predefined policies for masking. The purpose of data masking or hiding data in specific columns is to ensure that data is accessed on a need-to-know basis. This kind of data is most likely sensitive and doesn’t need to be accessed by every user.

When is dynamic data masking used?

Data masking is usually implemented to protect personally identifiable information (PII), such as a person’s social security number, phone number, home address, or date of birth. An insurance company would likely want to reduce risk by hiding data pertaining to sensitive information if they don’t believe access to the data is necessary for conducting analysis.

However, data masking can also be used for non-production environments where testing needs to be conducted on an application. The users testing the environment wouldn’t need to know specific data if their role is just to test the environment and application. Additionally, data masking may be used to adhere to compliance requirements like HIPAA.

What is object tagging?

Another resource for data governance within Snowflake is object tagging. Object tagging enables data stewards to track sensitive data for compliance and discovery, as well as grouping desired objects such as warehouses, databases, tables or views, and columns.

When a tag is created for a table, view, or column, data stewards can determine if the data should be fully masked, partially masked, or unmasked. When tags are associated with a warehouse, a user with the tag role can view the resource usage of the warehouse to determine what, when, and how this object is being utilized.

When is object tagging used?

There are several instances where object tagging can be useful; one use would be tagging “PII” to a column and adding extra text to describe the type of PII data located there. For example, a tag can be created for a warehouse dedicated to the sales department, enabling you to track usage and deduce why a specific warehouse is being used.

Where can data governance be applied?

Data governance applies to many industries that maintain a vast amount of data from their systems, including healthcare, supply chain and logistics, and insurance; and an effective data governance strategy may use data masking and object tagging in conjunction with each other.

As previously mentioned, one common use case for data masking is for insurance customers’ PII. Normally, analysts wouldn’t need to analyze the personal information of a customer to uncover useful information leading to key business decisions. Therefore, the administrator would be able to mask columns for the customer’s name, phone number, address, social security number, and account number without interfering with analysis.

Object tagging is also valuable within the insurance industry as there is such a vast amount of data collected and consumed. A strong percentage of that data is sensitive information. Because there is so much data and it can be difficult to track those individual pieces of information, Snowflake’s object tagging feature can help with identifying and tracking the usage of those sensitive values for the business user.

Using dynamic data masking and object tagging together, you will be able to gain insights into the locations of your sensitive data and the amount specific warehouses, tables, or columns are being used.

Think back to the situation we mentioned earlier where the property coverage sales department is on legacy system X. During that same time period, the workers’ compensation sales department is on another legacy system Y. How are you supposed to create a report to understand the profitability of these two departments?

One option is to use Snowflake to store all of the data from both legacy systems. Once the information is in the Snowflake environment, object tagging would allow you to tag the databases or tables that involve data about their respective departments. One tag can be specified for property coverage and another tag can be set for workers’ compensation data. When you’re tasked with creating a report of profitability involving these two departments, you can easily identify which information can be used. Because the tag was applied to the database, it will also be applied to all of the tables and their respective columns. You would be able to understand what columns are being used. After the data from both departments is accessible within Snowflake, data masking can then be used to ensure that the new data is only truly accessible to those who need it.

This was just a small introduction to data governance and the new features that Snowflake has available to enable this effort. Don’t forget that this data governance effort can be a part of a larger, more intricate MDM initiative. In other blog posts, we touch more on MDM and other data governance capabilities to maintain and standardize your data, helping you make the most accurate and beneficial business decisions. If you have any questions in the meantime, feel free to get in touch.

rss
Facebooktwitterlinkedinmail

A High-Level Overview of Snowflake

Using a modern data warehouse, like Snowflake, can give your organization improved access to your data and dramatically improved analytics. When paired with a BI tool, like Tableau, or a data science platform, like Dataiku, you can gain even faster access to impactful insights that help your organization fuel innovation and drive business decisions.

In this post, we’ll provide a high-level overview of Snowflake, including a description of the tool, why you should use it, pros and cons, and complementary tools and technologies.

Overview of Snowflake

Overview of Snowflake

Snowflake was built from the ground up for the cloud, initially starting on AWS and scaling to Azure and GCP. With no servers to manage and near-unlimited scale in compute, Snowflake separates compute from storage and charges based on the size and length of time that compute clusters (known as “virtual warehouses”) are running queries.

Value Prop:

  • Cross cloud lets organizations choose the cloud provider to use
  • Dynamic compute scaling saves on cost
  • Micro-partitioned storage with automatic maintenance

Scalability:

  • Rapid auto-scaling of compute nodes allows for increased cost savings and high concurrency on demand, and compute and storage are separated

Performance:

  • Built for MPP (massive parallel processing)
  • Optimized for read via a columnar backend
  • Dedicated compute means no concurrency issues

Features:

  • Ability to assign dedicated compute
  • High visibility into spend
  • Native support for JSON, XML, Avro, Parquet, and ORC semi-structured data formats
  • SnowSQL has slight syntax differences
  • Introduction of Snowpark for Snowflake native development

Security:

  • Full visibility into queries executed, by whom, and how long they ran
  • Precision point-in-time restore available via “time-travel” feature

Why Use Snowflake

Decoupled from cloud vendors, it allows a true multi-cloud experience. You can deploy on Azure, AWS, GCP, or any combination of those cloud services. With near-unlimited scale and minimal management, it offers a best-in-class data platform but with a pay-for-what-you-use consumption model.

Pros of Snowflake

  • Allows for a multi-cloud experience built on top of existing AWS, Azure, or GCP resources, depending on your preferred platform
  • Highly-performant queries utilizing uniquely provisioned pay-as-you-go compute and automatically derived partitioning
  • Easy implementation of security and role definitions for less frustrating user experience and easier delineation of cost while keeping data secure
  • Integrated ability to share data to partners or other consumers outside of an organization and supplement data with publicly available datasets within Snowflake

Cons of Snowflake

  • Ecosystem of tooling continues to grow as adoption expands, but some features are not readily available
  • Due to the paradigm shift in a cloud-born architecture, taking full advantage of Snowflake’s advanced features requires a good understanding of cloud data architecture

Select Complementary Tools and Technologies for Snowflake

  • Apache Kafka
  • AWS Lambda
  • Azure Data Factory
  • Dataiku
  • Power BI
  • Tableau

We hope you found this high-level overview of Snowflake helpful. If you’re interested in learning more about Snowflake or other modern data warehouse tools like Amazon Redshift, Azure Synapse, and Google BigQuery, contact us to learn more.

The content of this blog is an excerpt of our Modern Data Warehouse Comparison Guide. Click here to download a copy of that guide.

rss
Facebooktwitterlinkedinmail

A CTO’s Guide to a Modern Data Platform: Methods to Implement a Snowflake Project

Now that you’re prepared to build your modern data platform with Snowflake, you need to determine the best path to get there. Here are two recommended methods to implement a Snowflake project that will get you to the same end state:

  1. End-to-End Path: Build your end-to-end solution, from raw layer to data hub to data warehouse.
  2. Analytics-First Path: Build your Snowflake data warehouse and analytics environment first, and work your way back in future phases.

This blog post originally appeared as a section of our eBook, “Snowflake Deployment Best Practices: A CTO’s Guide to a Modern Data Platform.” Click here to download the full eBook.

End-to-End Path

The end-to-end path will be a higher initial investment, but you will enable both integration and analytics by the time you are finished. When you know what you want and have a fairly stable business environment, this path will ensure you avoid accruing technical debt by continuously putting off work that needs to be done.

Modern Data Platform

Identify a key subject area, such as financial reporting, and bring in only that data throughout the data flow. By the time you finish your first phase, you will have enabled financial analytics, integration of financial data into other applications, and a full end-to-end solution that can be expanded upon with more data in future iterative phases.

Analytics-First Path

The analytics-first path takes a different approach and is likely the most common one you will see in Snowflake POCs. This path is a lower initial cost from a services and work standpoint, and it can produce business value in a much shorter time frame. If you need to show results to get buy-in to a larger project, the analytics-first approach will allow you to load raw data from your source applications and leverage Snowflake views for analytics tools to consume.

Implement a Snowflake Project

The trade-off with this approach comes in the form of an incomplete solution. By identifying several key subject areas, such as both financial and sales reporting, you will be able to get buy-in from multiple stakeholders in the organization. The temptation will be high to continue to add more functionality and more data as business demand grows. Working your way back to create a full end-to-end solution, however, is the right next step. The implementation costs will be higher, but you will be able to create a better-governed data platform, reduce Snowflake costs, and allow for application integration by following the plan.

Both options will lead you to success. If you have buy-in from business leaders and feel comfortable with your vision, building the end-to-end path will allow you to create a strong foundation at the beginning. If you need to show fast results or need to pivot quickly, the analytics-first path will help you sell the value of the solution by showing stakeholders results in dashboards before making a bigger investment up front. Both require a plan to see the entire vision executed but take different routes to get to the end state.

Conclusion

Knowing where to go in your journey is worth the investment.

Snowflake implementation partners, such as 2nd Watch, know how to leverage Snowflake for its strengths and other systems to complement and augment. We know the ecosystems, such as Azure and AWS, and we know how to help you put a strategy in place so Snowflake doesn’t become an analytics silo. We will help you select the right path that creates success for both you and your organization.

If you’re in the process of implementing Snowflake, wondering how it might fit in your organization, or just dreading your next daily ETL load, we would be happy to schedule a free whiteboarding session to brainstorm some ideas together. Click here to get started.

Analytics-First Path

rss
Facebooktwitterlinkedinmail

Snowflake Summit 2022 Keynote Recap: Disrupting Application Development

Dating back to 2014, Snowflake disrupted analytics by introducing the Snowflake Elastic Data Warehouse, the first data warehouse built from the ground up for the cloud with patented architecture that would revolutionize the data platform landscape. Four years later, Snowflake continued to disrupt the data warehouse industry by introducing Snowflake Data Sharing, an innovation for data collaboration that would eliminate the barriers of traditional data sharing methods in favor of enabling enterprises to easily share live data in real time without moving the shared data. This year, in 2022, under the bright Sin City lights, Snowflake intends to disrupt application development by echoing a unified platform for developing data applications from coding to monetization.

Snowflake Summit 2022 Keynote Recap

Currently, data teams looking to add value, whether it be improving their team’s analytical efficiency or reducing costs to their enterprise’s processes, develop internal data products such as ML-powered product categorization models or C-suite dashboards in whichever flavor their team is savvy in. However, to produce an external data product that brings value to their enterprise, there is only one metric executives truly care about: revenue.

To bridge the value gap between internal and external data products comes the promise of the Snowflake Native Application Framework. This framework will now enable developers to build, distribute, and deploy applications natively in the Data Cloud landscape through Snowflake. Moreover, these applications can be monetized on the Snowflake Marketplace, where consumers can securely purchase, install, and run these applications natively in their Snowflake environments, with no data movement required. It’s important to note that Snowflake’s goal is not to compete with OLTP Oracle DB workloads, but rather to disrupt how cloud applications are built by seamlessly blending the transactional and analytical capabilities Snowflake has to offer.

To round out the Snowflake Native Application Framework, a series of product announcements were made at the Summit:

  • Unistore (Powered by Hybrid Tables): To bridge transactional and analytical data in a single platform, Snowflake developed a new workload called Unistore. At its core, the new workload enables customers to unify their datasets across multiple solutions and streamline application development by incorporating all the same simplicity, performance, security, and governance customers expect from the Snowflake Data Cloud platform. To power the core, Snowflake developed Hybrid Tables. This new table type supports fast single-row operations driven by a ground-breaking row-based storage engine that will allow transactional applications to be built entirely in Snowflake. Hybrid Tables will also support primary key enforcement to protect against duplicate record inserts.
  • Snowpark for Python: Snowpark is a development framework designed to bridge the skill sets of engineers, data scientists, and developers. Previously, Snowpark only supported Java and Scala, but Snowflake knew what the people wanted – the language of choice across data engineers/scientists and application developers, Python. Allowing Python workloads into Snowflake removes the burden of security, a challenge developers often face within their enterprises.
  • Snowflake Worksheets for Python: Though in private preview currently, Snowflake will support Python development natively within Snowsight Worksheets, to develop pipelines, ML models, and applications. This will allow streamlining development features like auto-complete and the ability to code custom Python logic in seconds.
  • Streamlit Acquisition: To democratize access to data, a vision both Snowflake and Streamlit share, Snowflake acquired Streamlit, an open-source Python project for building data-based applications. Streamlit helps fill the void of bringing a simplified app-building experience for data scientists who want to quickly translate an ML model into a visual application that anyone within their enterprise can access.
  • Large Memory Warehouses: Still in development (out for preview in AWS in EU-Ireland), Snowflake will soon allow consumers to access 5x and 6x larger warehouses. These larger warehouses will enable developers to execute memory-intensive operations such as training ML models on large datasets through open-source Python libraries that will be natively available through Anaconda integration.

On top of all those features released for application development, Snowflake also released key innovations to improve data accessibility, such as:

  • Snowpipe Streaming: To eliminate the boundaries between batch and streaming pipelines, Snowflake introduced Snowpipe Streaming. This new feature will simplify stitching together both real-time and batch data into one single system. Users can now ingest via a client API endpoint aggregated log data for IoT devices without adding event hubs and even ingest CDC streams at a lower latency.
  • External Apache Iceberg Tables: Developed by Netflix, Apache Iceberg tables are open-source tables that can support a variety of file formats (e.g., Parquet, ORC, Avro). Snowflake will now allow consumers to query Iceberg tables in place, without moving the table data or existing metadata. This translates to being able to access customer-supplied storage buckets with Iceberg tables without compromising on security and taking advantage of the consistent governance of the Snowflake platform.
  • External On-Prem Storage Tables: For many enterprises, moving data into the Data Cloud is not a reality due to a variety of reasons, including size, security concerns, cost, etc. To overcome this setback, Snowflake has released in private preview the ability to create External Stages and External Tables on storage systems such as Dell or Pure Storage that can expose a highly compliant S3 API. This will allow customers to access a variety of storage devices using Snowflake without worrying about concurrency issues or the effort of maintaining compute platforms.

Between the Native Application Framework and the new additions for data accessibility, Snowflake has taken a forward-thinking approach on how to effectively disrupt the application framework. Developers should be keen to take advantage of all the new features this year while understanding that some key features such as Unistore and Snowpipe Streaming will have bumps along the road as they are still under public/private preview.

Wondering is Snowflake is the right choice for your organization? Read our high-level overview of Snowflake here or contact us to discuss your organization’s data and analytics needs.

rss
Facebooktwitterlinkedinmail