26 Quick Tips to Save You Money on Your Snowflake Deployment

One of the benefits to 2nd Watch’s partnership with Snowflake is access to advanced trainings and certifications. Combined with our Snowflake project work, these trainings help us become Snowflake experts and find opportunities to assist our clients in making Snowflake work better for them. The most recent training helped us identify some important tactics for solving one of clients’ biggest concerns: How do I optimize for cost? Here is a list of actions that you should take to make sure you are not overspending on your Snowflake computation or storage.

First, for context, here is an extremely simplified diagram of how Snowflake is functioning in the background:

Since the most expensive part of any Snowflake deployment is compute, we have identified some useful tactics to store data strategically for efficient reads, write supercharged SQL scripts, and balance your performance vs cost.

Loading

Although very different than storing data on traditional disk, there are many benefits to loading Snowflake data strategically.

1. Sort on ingestion: Data is automatically partitioned in SF on natural ingestion order.

– Sorting an S3 bucket (using something like syncsort) before bulk load via copy could be way faster than inserting with an order by

2. CSV (Gzipped) is the best format for loading to SF (2-3x faster than Parquet or ORC).

3. Use COPY INTO instead of INSERT because it utilizes the more efficient bulk loading processes.

Sizing

Take advantage of the native cloud ability to scale, create, and optimize your compute resources.

4. Scale up or out appropriately.

– As seen above, when you run a query, Snowflake will:

+ Find required FDN files.

+ Pull files down into SSD VMs. (Note: If >160 GB for AWS or >400 GB for Azure, will spill over to remote IO.)

+ Perform compute.

+ Keep files on VM until DW is suspended.

1 big query = increase size of data warehouse

Lots of small queries = queries are queuing = increase # of DWs or # of clusters (if enterprise, you can enable multi-cluster)

5. Turn your VW on and off for certain workloads.

Turn on for batch, then immediately turn off (no reason to wait for auto-suspend).

Use auto-resume when it makes sense.

6. Control query processing and concurrency with parameters.

Max_concurrency_level/p>

Statement queued timeout in seconds

Statement timeout in seconds

7. Use warehouse monitoring to size and limit cost per workload (not per database → this is a shift from the on-prem mentality).

If your workload is queuing, then add more clusters.

If your workload is slow with no queuing, then size up.

Data Modeling

Often overlooked, organizing your information into a mature data model will allow for high-performance SQL scripting better caching potential. Shameless plug, this is 2nd Watch’s bread and butter. Please reach out for a free working session to discuss data modeling for your company.

8. Do a data model for analytics.

Star Schema, 3NF, and data vault are optimal for SF.

Snowflake is NOT ideal for OLTP workloads.

9. Bake your constraints into design because SF DOES NOT enforce them.

Build queries to check for violations

10. Build a process to alert you of loading issues (use an ETL framework).

Information_Schema.load_history

2nd Watch ETL Toolkit

Tracking Usage

Snowflake preserves a massive amount of usage data for analysis. At the very least, it allows you to see which workflows are the most expensive.

11. Use Account Usage Views (eg warehouse_metering_history) for tracking history, performance, and cost.

12. Don’t use AccountAdmin or Public roles for creating objects or accessing data (only for looking at costs)…create securable objects with the “correct” role and integrate new roles into the existing hierarchy.

– Create roles by business functions to track spending by line of business.

13. Use Resource Monitors to cut off DWs when you hit predefined credit amount limits.

Create one resource monitor per DW.

Enable notifications.

Performance Tuning

The history profiler is the primary tool to observe poorly written queries and make the appropriate changes.

14. Use history profiler to optimize queries.

Goal is to put the most expensive node in the bottom right hand corner of profiler diagram.

SYSTEM$CLUSTERING_DEPTH shows how effective the partitions are – the smaller the average depth, the better clustered the table is with regards to the specified columns.

+ Hot tip: You can add a new automatic reclustering service, but I don’t think it is worth the money right now.

15. Analyze Bytes Scanned: remote vs cache.

Make your Bytes Scanned column use “Cache” or “Local” memory most of the time, otherwise consider creating a cluster key to scan more efficiently.

16. Make the ratio of partitions scanned to partition used as small as possible by pruning.

SQL Coding

The number one issue driving costs in a Snowflake deployment is poorly written code! Resist the tendency to just increase the power (and therefore the cost) and focus some time on improving your SQL scripts.

17. Drop temporary and transient tables when done using.

18. Don’t use “CREATE TABLE AS”; SF hates trunc and reloads for time travel issues. Instead, use “CREATE OR REPLACE.”

Again, Use COPY INTO not INSERT INTO.

Use staging tables to manage transformation of imported data.

Validate the data BEFORE loading into SF target tables.

19. Use ANSI Joins because they are better for the optimizer.

Use “JOIN ON a.id = b.id” format.

NOT the “WHERE a.id=b.id”.

20. Use “WITH” clauses for windowing instead of temp tables or sub-selects.

21. Don’t use ORDER BY. Sorting is very expensive!

Use integers over strings if you must order.

22. Don’t handle duplicate data using DISTINCT or GROUP BY.

Storing

Finally, set up the Snowflake deployment to work well in your entire data ecosystem.

23. Locate your S3 buckets in the same geographic region.

24. Set up the buckets to match how the files are coming across (eg by date or application).

25. Keep files between 60-100 MB to take advantage of parallelism.

26. Don’t use materialized views except in specific use cases (e.g., pre-aggregating).

Snowflake is shifting the paradigm when it comes to data warehousing in the cloud. However, by fundamentally processing data differently than other solutions, Snowflake has a whole new set of challenges for implementation.

Whether you’re looking for support implementing Snowflake or need to drive better performance, 2nd Watch’s Snowflake Value Accelerator will help save you money on your Snowflake investment. Click here to learn more.


A CTO’s Guide to a Modern Data Platform: What is Snowflake, How is it Different, and Where Does it Fit in Your Ecosystem?

Chances are, you’ve been here before – a groundbreaking new data and analytics technology has started making waves in the market, and you’re trying to gauge the right balance between marketing hype and reality. Snowflake promises to be a self-managing data warehouse that can get you speed-to-insight in weeks, as opposed to years. Does Snowflake live up to the hype? Do you still need to approach implementation with a well-defined strategy? The answer to both of these questions is “yes.”

What Is Snowflake and How Is It Different?

Massive scale…. Low overhead

Snowflake is one of the few enterprise-ready cloud data warehouses that brings simplicity without sacrificing features. It automatically scales, both up and down, to get the right balance of performance vs. cost. Snowflake’s claim to fame is that it separates compute from storage. This is significant because almost every other database, Redshift included, combines the two together, meaning you must size for your largest workload and incur the cost that comes with it.

With Snowflake, you can store all your data in a single place and size your compute independently. For example, if you need near-real-time data loads for complex transformations, but have relatively few complex queries in your reporting, you can script a massive Snowflake warehouse for the data load, and scale it back down after it’s completed – all in real time. This saves on cost without sacrificing your solution goals.

Elastic Development and Testing Environments

Development and testing environments no longer require duplicate database environments. Rather than creating multiple clusters for each environment, you can spin up a test environment as you need it, point it at the Snowflake storage, and run your tests before moving the code to production. With Redshift, you’re feeling the maintenance and cost impact of three clusters all running together. With Snowflake, you stop paying as soon as your workload finishes because Snowflake charges by the second.

With the right DevOps processes in place for CI/CD (Continuous Integration/Continuous Delivery), testing each release becomes closer to a modern application development approach than it does a traditional data warehouse. Imagine trying to do this in Redshift.

Avoiding FTP with External Data Sharing

The separated storage and compute also enables some other differentiating features, such as data sharing. If you’re working with external vendors, partners, or customers, you can share your data, even if the recipient is not a Snowflake customer. Behind the scenes, Snowflake is creating a pointer to your data (with your security requirements defined). If you commonly write scripts to share your data via FTP, you now have a more streamlined, secure, and auditable path for accessing your data outside the organization. Healthcare organizations, for example, can create a data share for their providers to access, rather than cumbersome manual processes that can lead to data security nightmares.

Where Snowflake Fits Into Your Ecosystem

Snowflake is a part of Your Data Ecosystem, but It’s not in a Silo.

Always keep this at the top of your mind. A modern data platform involves not only analytics, but application integration, data science, machine learning, and many other components that will evolve with your organization. Snowflake solves the analytics side of the house, but it’s not built for the rest.

When you’re considering your Snowflake deployment, be sure to draw out the other possible components, even if future tools are not yet known. Knowing which Snowflake public cloud flavor to choose (Azure or AWS) will be the biggest decision you will make. Do you see SQL Server, Azure ML, or other Azure PaaS services in the mix; or is the AWS ecosystem more likely to fit better in the organization?

As a company, Snowflake has clearly recognized that they aren’t built for every type of workload. Snowflake partnered with Databricks to allow heavy data science and other complex workloads to run against your data. The recent partnership with Microsoft will ensure Azure services continue to expand their Snowflake native integrations – expect to see a barrage of new partnership announcements during the next 12 months.

If you have any questions or want to learn more about how Snowflake can fit into your organization, contact us today.

 


Comparing Modern Data Warehouse Options

To remain competitive, organizations are increasingly moving towards modern data warehouses, also known as cloud-based data warehouses or modern data platforms, instead of traditional on-premise systems. Modern data warehouses differ from traditional warehouses in the following ways:

    • There is no need to purchase physical hardware.
    • They are less complex to set up.
    • It is much easier to prototype and provide business value without having to build out the ETL processes right away.
    • There is no capital expenditure and a low operational expenditure.
    • It is quicker and less expensive to scale a modern data warehouse.
    • Modern cloud-based data warehouse architectures can typically perform complex analytical queries much faster because of how the data is stored and their use of massively parallel processing (MPP).

Modern data warehousing is a cost-effective way for companies to take advantage of the latest technology and architectures without the upfront cost to purchase, install, and configure the required hardware, software, and infrastructure.

Comparing Modern Data Warehousing Options

  • Traditional data warehouse deployed on (IaaS): Requires our customers to install traditional data warehouse software on computers provided by a cloud provider (e.g., Azure, AWS, Google).
  • Platform as a service (PaaS): The cloud provider manages the hardware deployment, software installation, and software configuration. However, the customer is responsible for managing the environment, tuning queries, and optimizing the data warehouse software.
  • A true SaaS data warehouse (SaaS): In a SaaS approach, software and hardware upgrades, security, availability, data protection, and optimization are all handled for you. The cloud provider provides all hardware and software as part of its service, as well as aspects of managing the hardware and software.

With all of the above scenarios, the tasks of purchasing, deploying, and configuring the hardware to support the data warehouse environment falls on the cloud provider instead of the customer.

IaaS, PaaS, and SaaS – What Is the Best Option for My Organization?

Infrastructure as a service (IaaS) is an instant computing infrastructure, provisioned and managed over the internet. It helps you avoid the expense and complexity of buying and managing your own physical servers and other data center infrastructure. In other words, if you’re prepared to buy the engine and build the car around it, the IaaS model may be for you.

In the scenario of platform as a service (PaaS), a cloud provider merely supplies the hardware and its traditional software via the cloud; the solution is likely to resemble its original, on-premise architecture and functionality. Many vendors offer a modern data warehouse that was originally designed and deployed for on-premises environments. One such technology is Amazon Redshift. Amazon acquired rights to ParAccel, named it Redshift, and hosted it in the AWS cloud environment. Redshift is a highly successful modern data warehouse service. It is easy in AWS to instantiate a Redshift cluster, but then you need to complete all of the administrative tasks.

You have to reclaim space after rows are deleted or updated (the process of vacuuming in Redshift), manage capacity planning, provisioning compute and storage nodes, determine your distribution keys, etc. All of the things you had to do with ParAccel (or with any traditional architecture), you have to do with Redshift.

Alternatively, any data warehouse solution built for the cloud using a true software as a solution (SaaS) data warehouse architecture allows for the cloud provider to include all hardware and software as part of its service as well as aspects of managing the hardware and software. One such technology, which requires no management and features separate compute, storage, and cloud services that can scale and change independently, is Snowflake. It differentiates itself from IaaS and PaaS cloud data warehouses because it was built from the ground up on cloud architecture.

All administrative tasks, tuning, patching, and management of the environment falls on the vendor. In lieu of the architecture we have seen with IaaS and PaaS solutions in the market today, Snowflake has a new architecture called a multi-clustered shared data that essentially makes the administrative headache of maintaining solutions go away. However, that doesn’t mean it’s the absolute right choice for your organization – that’s where an experienced consulting partner like 2nd Watch comes in.

If you depend on your data to better serve your customers, streamline your operations, and lead (or disrupt) your industry, a modern data platform built on the cloud is a must-have for your organization. Contact us to learn what a modern data warehouse would look like for your organization.