Top 4 Data Management Solutions for Snowflake Success

The Data Insights practice at 2nd Watch saw the potential of Snowflake from the time it was a tech-unicorn in 2015. Its innovative approach to storing and aggregating data is a game-changer in the industry! On top of that, Snowflake’s value proposition to their customers complements the data management expertise that 2nd Watch has been developing since its inception. Whether you’re a mid-sized insurance carrier or a Fortune 500 manufacturer, Snowflake and 2nd Watch know how to build scalable, tailored solutions for your business problems.Data Management Solutions for Snowflake Success

On top of skills in AI and machine learning, app development, and data visualization, here are the top four data engineering services 2nd Watch uses to deploy a successful cloud data platform initiative using a tool like the Snowflake Data Cloud.

Data Warehousing 

Snowflake offers powerful features in the data warehousing space that allow 2nd Watch delivery teams to stay laser-focused on business outcomes. They use innovative technologies that optimize your data for storage, movement, and active use (cloud computing). They also have an ever-increasing array of valuable tools that significantly improve an organization’s ability to enrich and share large amounts of data with other companies. 

But it doesn’t happen by magic…

2nd Watch can leverage our vast industry and technical experience to create a data warehouse for your organization that provides a fast, accurate, and consistent view of your data from multiple sources. Using best practices and well-established methodologies, 2nd Watch combines data from different sources into a centralized repository, creating a single version of the truth and a unified view.

The final design contains a user-friendly enterprise data warehouse that connects with both legacy and modern business intelligence tools to help you analyze data across your organization. The data warehouse is optimized for performance, scaling, and ease-of-use by downstream applications.

Potential Deliverables

  • Conceptual and physical data models for dimensional and analytical systems
  • Deployment of three semantic layers for tracking data in a central hub (raw, business using data vault, and data warehouse optimized for visualizations)
  • Design and development of departmental data marts of curated data
  • Training of end users for the cloud-based data solution and critical data applications and tools

Data Integration 

Snowflake has a lot of flexibility when it comes to the data integration process, meaning Snowflake’s Data Cloud allows companies to go beyond traditional extract, transform, and load data flows. With the Snowflake ecosystem, companies can leverage data integration solutions that do everything from data preparation, migration, movement, and management, all in an automated and scalable way.

The consultants at 2nd Watch will partner with you every step of the way and guide the entire team in the right direction to meet your decision-makers’ specific goals and your organization’s business data needs. These are some of the popular data integration tools and technologies that 2nd Watch can help integrate to Snowflake:

  • Azure Data Factory
  • AWS Glue and Lambda
  • Google Cloud Data Fusion
  • Fivetran/HVR
  • Etlworks 
  • IBM DataStage 
  • SnapLogic 
  • Plus, all the classics, including SQL Server Integration Services (SSIS) and Informatica

Potential Deliverables

  • Integration of any number of sources to a centralized data hub
  • Establishment of a custom system that operates well with niche sources
  • Speeding up the ingestion process and increasing the auditing power
  • End-game integration to a data warehouse and other target systems

Data Modernization

Snowflake is a paradigm-shifting platform. Micro partition storage, decentralized compute, and cross-cloud sharing opens up new opportunities for companies to solve pain in their analytics processing. Our consultants at 2nd Watch are trained in the latest technologies and have the technical expertise to tackle the challenges posed by making your legacy systems “just work” in modern ecosystems like Snowflake.

Using supplemental tools like dbt or sqlDBM, this process will transform your data platform by eliminating complexities, reducing latency, generating documentation, integrating siloed sources, and unlocking the ability to scale and upgrade your existing data solutions.

Potential Deliverables

  • Migration to Snowflake from existing high-maintenance deployments
  • Refactoring, redesigning, and performance tuning of data architecture 
  • Deploying Snowpark API for integrating with Scala or Python applications 
  • Supporting modern tool selection and integration

Data Governance 

Data governance is critical to organizations hoping to achieve and maintain long-term success. Snowflake offers outstanding features such as object tagging or data classification that improve the security, quality, and value of the data. Additionally, when you work with 2nd Watch, we can help your organization establish a data governance council and program.

2nd Watch will assist you in identifying and coaching early adopters and champions. We will help with establishing roles and responsibilities (e.g., business owners, stewards, custodians), as well as creating and documenting principles, policies, processes, and standards. Finally, we will identify the right technology to help automate these processes and improve your data governance maturity level.

Potential Deliverables

  • Data governance strategy
  • Change management: identification of early adopters and champions
  • Master data management implementation
  • Data quality: data profiling, cleansing, and standardization
  • Data security and compliance (e.g., PII, HIPAA, GRC)

2nd Watch will make sure your team is equipped to make the most of your Snowflake ecosystem and analytics tools, guiding the entire process through deployment of a successful initiative. Get started with our Snowflake Value Accelerator.

rss
Facebooktwitterlinkedinmail

26 Quick Tips to Save You Money on Your Snowflake Deployment

One of the benefits to 2nd Watch’s partnership with Snowflake is access to advanced trainings and certifications. Combined with our Snowflake project work, these trainings help us become Snowflake experts and find opportunities to assist our clients in making Snowflake work better for them. The most recent training helped us identify some important tactics for solving one of clients’ biggest concerns: How do I optimize for cost? Here is a list of actions that you should take to make sure you are not overspending on your Snowflake computation or storage.

First, for context, here is an extremely simplified diagram of how Snowflake is functioning in the background:

Snowflake Deployment

Since the most expensive part of any Snowflake deployment is compute, we have identified some useful tactics to store data strategically for efficient reads, write supercharged SQL scripts, and balance your performance vs cost.

Loading

Although very different than storing data on traditional disk, there are many benefits to loading Snowflake data strategically.

1. Sort on ingestion: Data is automatically partitioned in SF on natural ingestion order.

– Sorting an S3 bucket (using something like syncsort) before bulk load via copy could be way faster than inserting with an order by

2. CSV (Gzipped) is the best format for loading to SF (2-3x faster than Parquet or ORC).

3. Use COPY INTO instead of INSERT because it utilizes the more efficient bulk loading processes.

Sizing

Take advantage of the native cloud ability to scale, create, and optimize your compute resources.

4. Scale up or out appropriately.

– As seen above, when you run a query, Snowflake will:

+ Find required FDN files.

+ Pull files down into SSD VMs. (Note: If >160 GB for AWS or >400 GB for Azure, will spill over to remote IO.)

+ Perform compute.

+ Keep files on VM until DW is suspended.

1 big query = increase size of data warehouse

Lots of small queries = queries are queuing = increase # of DWs or # of clusters (if enterprise, you can enable multi-cluster)

5. Turn your VW on and off for certain workloads.

Turn on for batch, then immediately turn off (no reason to wait for auto-suspend).

Use auto-resume when it makes sense.

6. Control query processing and concurrency with parameters.

Max_concurrency_level/p>

Statement queued timeout in seconds

Statement timeout in seconds

7. Use warehouse monitoring to size and limit cost per workload (not per database → this is a shift from the on-prem mentality).

If your workload is queuing, then add more clusters.

If your workload is slow with no queuing, then size up.

Data Modeling

Often overlooked, organizing your information into a mature data model will allow for high-performance SQL scripting better caching potential. Shameless plug, this is 2nd Watch’s bread and butter. Please reach out for a free working session to discuss data modeling for your company.

8. Do a data model for analytics.

Star Schema, 3NF, and data vault are optimal for SF.

Snowflake is NOT ideal for OLTP workloads.

9. Bake your constraints into design because SF DOES NOT enforce them.

Build queries to check for violations

10. Build a process to alert you of loading issues (use an ETL framework).

Information_Schema.load_history

2nd Watch ETL Toolkit

Tracking Usage

Snowflake preserves a massive amount of usage data for analysis. At the very least, it allows you to see which workflows are the most expensive.

11. Use Account Usage Views (eg warehouse_metering_history) for tracking history, performance, and cost.

12. Don’t use AccountAdmin or Public roles for creating objects or accessing data (only for looking at costs)…create securable objects with the “correct” role and integrate new roles into the existing hierarchy.

– Create roles by business functions to track spending by line of business.

13. Use Resource Monitors to cut off DWs when you hit predefined credit amount limits.

Create one resource monitor per DW.

Enable notifications.

Performance Tuning

The history profiler is the primary tool to observe poorly written queries and make the appropriate changes.

14. Use history profiler to optimize queries.

Goal is to put the most expensive node in the bottom right hand corner of profiler diagram.

SYSTEM$CLUSTERING_DEPTH shows how effective the partitions are – the smaller the average depth, the better clustered the table is with regards to the specified columns.

+ Hot tip: You can add a new automatic reclustering service, but I don’t think it is worth the money right now.

15. Analyze Bytes Scanned: remote vs cache.

Make your Bytes Scanned column use “Cache” or “Local” memory most of the time, otherwise consider creating a cluster key to scan more efficiently.

16. Make the ratio of partitions scanned to partition used as small as possible by pruning.

SQL Coding

The number one issue driving costs in a Snowflake deployment is poorly written code! Resist the tendency to just increase the power (and therefore the cost) and focus some time on improving your SQL scripts.

17. Drop temporary and transient tables when done using.

18. Don’t use “CREATE TABLE AS”; SF hates trunc and reloads for time travel issues. Instead, use “CREATE OR REPLACE.”

Again, Use COPY INTO not INSERT INTO.

Use staging tables to manage transformation of imported data.

Validate the data BEFORE loading into SF target tables.

19. Use ANSI Joins because they are better for the optimizer.

Use “JOIN ON a.id = b.id” format.

NOT the “WHERE a.id=b.id”.

20. Use “WITH” clauses for windowing instead of temp tables or sub-selects.

21. Don’t use ORDER BY. Sorting is very expensive!

Use integers over strings if you must order.

22. Don’t handle duplicate data using DISTINCT or GROUP BY.

Storing

Finally, set up the Snowflake deployment to work well in your entire data ecosystem.

23. Locate your S3 buckets in the same geographic region.

24. Set up the buckets to match how the files are coming across (eg by date or application).

25. Keep files between 60-100 MB to take advantage of parallelism.

26. Don’t use materialized views except in specific use cases (e.g., pre-aggregating).

Snowflake is shifting the paradigm when it comes to data warehousing in the cloud. However, by fundamentally processing data differently than other solutions, Snowflake has a whole new set of challenges for implementation.

Whether you’re looking for support implementing Snowflake or need to drive better performance, 2nd Watch’s Snowflake Value Accelerator will help save you money on your Snowflake investment. Click here to learn more.

rss
Facebooktwitterlinkedinmail

A CTO’s Guide to a Modern Data Platform: What is Snowflake, How is it Different, and Where Does it Fit in Your Ecosystem?

Chances are, you’ve been here before – a groundbreaking new data and analytics technology has started making waves in the market, and you’re trying to gauge the right balance between marketing hype and reality. Snowflake promises to be a self-managing data warehouse that can get you speed-to-insight in weeks, as opposed to years. Does Snowflake live up to the hype? Do you still need to approach implementation with a well-defined strategy? The answer to both of these questions is “yes.”

A CTO's Guide to a Modern Data Platform

What Is Snowflake and How Is It Different?

Massive scale….Low overhead

Snowflake is one of the few enterprise-ready cloud data warehouses that brings simplicity without sacrificing features. It automatically scales, both up and down, to get the right balance of performance vs. cost. Snowflake’s claim to fame is that it separates compute from storage. This is significant because almost every other database, Redshift included, combines the two together, meaning you must size for your largest workload and incur the cost that comes with it.

With Snowflake, you can store all your data in a single place and size your compute independently. For example, if you need near-real-time data loads for complex transformations, but have relatively few complex queries in your reporting, you can script a massive Snowflake warehouse for the data load, and scale it back down after it’s completed – all in real time. This saves on cost without sacrificing your solution goals.

Elastic Development and Testing Environments

Development and testing environments no longer require duplicate database environments. Rather than creating multiple clusters for each environment, you can spin up a test environment as you need it, point it at the Snowflake storage, and run your tests before moving the code to production. With Redshift, you’re feeling the maintenance and cost impact of three clusters all running together. With Snowflake, you stop paying as soon as your workload finishes because Snowflake charges by the second.

With the right DevOps processes in place for CI/CD (Continuous Integration/Continuous Delivery), testing each release becomes closer to a modern application development approach than it does a traditional data warehouse. Imagine trying to do this in Redshift.

Avoiding FTP with External Data Sharing

The separated storage and compute also enables some other differentiating features, such as data sharing. If you’re working with external vendors, partners, or customers, you can share your data, even if the recipient is not a Snowflake customer. Behind the scenes, Snowflake is creating a pointer to your data (with your security requirements defined). If you commonly write scripts to share your data via FTP, you now have a more streamlined, secure, and auditable path for accessing your data outside the organization. Healthcare organizations, for example, can create a data share for their providers to access, rather than cumbersome manual processes that can lead to data security nightmares.

Where Snowflake Fits Into Your Ecosystem

Snowflake is a part of Your Data Ecosystem, but It’s not in a Silo.

Always keep this at the top of your mind. A modern data platform involves not only analytics, but application integration, data science, machine learning, and many other components that will evolve with your organization. Snowflake solves the analytics side of the house, but it’s not built for the rest.

When you’re considering your Snowflake deployment, be sure to draw out the other possible components, even if future tools are not yet known. Knowing which Snowflake public cloud flavor to choose (Azure or AWS) will be the biggest decision you will make. Do you see SQL Server, Azure ML, or other Azure PaaS services in the mix; or is the AWS ecosystem more likely to fit better in the organization?

As a company, Snowflake has clearly recognized that they aren’t built for every type of workload. Snowflake partnered with Databricks to allow heavy data science and other complex workloads to run against your data. The recent partnership with Microsoft will ensure Azure services continue to expand their Snowflake native integrations – expect to see a barrage of new partnership announcements during the next 12 months.

Snowflake data sources

If you have any questions or want to learn more about how Snowflake can fit into your organization, contact us today.

CTA 2nd Watch

 

rss
Facebooktwitterlinkedinmail