One of the benefits to 2nd Watch’s partnership with Snowflake is access to advanced trainings and certifications. Combined with our Snowflake project work, these trainings help us become Snowflake experts and find opportunities to assist our clients in making Snowflake work better for them. The most recent training helped us identify some important tactics for solving one of clients’ biggest concerns: How do I optimize for cost? Here is a list of actions that you should take to make sure you are not overspending on your Snowflake computation or storage.
First, for context, here is an extremely simplified diagram of how Snowflake is functioning in the background:
Since the most expensive part of any Snowflake deployment is compute, we have identified some useful tactics to store data strategically for efficient reads, write supercharged SQL scripts, and balance your performance vs cost.
Although very different than storing data on traditional disk, there are many benefits to loading Snowflake data strategically.
1. Sort on ingestion: Data is automatically partitioned in SF on natural ingestion order.
– Sorting an S3 bucket (using something like syncsort) before bulk load via copy could be way faster than inserting with an order by
2. CSV (Gzipped) is the best format for loading to SF (2-3x faster than Parquet or ORC).
3. Use COPY INTO instead of INSERT because it utilizes the more efficient bulk loading processes.
Take advantage of the native cloud ability to scale, create, and optimize your compute resources.
4. Scale up or out appropriately.
– As seen above, when you run a query, Snowflake will:
+ Find required FDN files.
+ Pull files down into SSD VMs. (Note: If >160 GB for AWS or >400 GB for Azure, will spill over to remote IO.)
+ Perform compute.
+ Keep files on VM until DW is suspended.
– 1 big query = increase size of data warehouse
– Lots of small queries = queries are queuing = increase # of DWs or # of clusters (if enterprise, you can enable multi-cluster)
5. Turn your VW on and off for certain workloads.
–Turn on for batch, then immediately turn off (no reason to wait for auto-suspend).
–Use auto-resume when it makes sense.
6. Control query processing and concurrency with parameters.
– Statement queued timeout in seconds
– Statement timeout in seconds
7. Use warehouse monitoring to size and limit cost per workload (not per database → this is a shift from the on-prem mentality).
– If your workload is queuing, then add more clusters.
– If your workload is slow with no queuing, then size up.
Often overlooked, organizing your information into a mature data model will allow for high-performance SQL scripting better caching potential. Shameless plug, this is 2nd Watch’s bread and butter. Please reach out for a free working session to discuss data modeling for your company.
8. Do a data model for analytics.
– Star Schema, 3NF, and data vault are optimal for SF.
– Snowflake is NOT ideal for OLTP workloads.
9. Bake your constraints into design because SF DOES NOT enforce them.
– Build queries to check for violations
10. Build a process to alert you of loading issues (use an ETL framework).
Snowflake preserves a massive amount of usage data for analysis. At the very least, it allows you to see which workflows are the most expensive.
11. Use Account Usage Views (eg warehouse_metering_history) for tracking history, performance, and cost.
12. Don’t use AccountAdmin or Public roles for creating objects or accessing data (only for looking at costs)…create securable objects with the “correct” role and integrate new roles into the existing hierarchy.
– Create roles by business functions to track spending by line of business.
13. Use Resource Monitors to cut off DWs when you hit predefined credit amount limits.
– Create one resource monitor per DW.
– Enable notifications.
The history profiler is the primary tool to observe poorly written queries and make the appropriate changes.
14. Use history profiler to optimize queries.
– Goal is to put the most expensive node in the bottom right hand corner of profiler diagram.
– SYSTEM$CLUSTERING_DEPTH shows how effective the partitions are – the smaller the average depth, the better clustered the table is with regards to the specified columns.
+ Hot tip: You can add a new automatic reclustering service, but I don’t think it is worth the money right now.
15. Analyze Bytes Scanned: remote vs cache.
– Make your Bytes Scanned column use “Cache” or “Local” memory most of the time, otherwise consider creating a cluster key to scan more efficiently.
16. Make the ratio of partitions scanned to partition used as small as possible by pruning.
The number one issue driving costs in a Snowflake deployment is poorly written code! Resist the tendency to just increase the power (and therefore the cost) and focus some time on improving your SQL scripts.
17. Drop temporary and transient tables when done using.
18. Don’t use “CREATE TABLE AS”; SF hates trunc and reloads for time travel issues. Instead, use “CREATE OR REPLACE.”
– Again, Use COPY INTO not INSERT INTO.
– Use staging tables to manage transformation of imported data.
– Validate the data BEFORE loading into SF target tables.
19. Use ANSI Joins because they are better for the optimizer.
– Use “JOIN ON a.id = b.id” format.
– NOT the “WHERE a.id=b.id”.
20. Use “WITH” clauses for windowing instead of temp tables or sub-selects.
21. Don’t use ORDER BY. Sorting is very expensive!
– Use integers over strings if you must order.
22. Don’t handle duplicate data using DISTINCT or GROUP BY.
Finally, set up the Snowflake deployment to work well in your entire data ecosystem.
23. Locate your S3 buckets in the same geographic region.
24. Set up the buckets to match how the files are coming across (eg by date or application).
25. Keep files between 60-100 MB to take advantage of parallelism.
26. Don’t use materialized views except in specific use cases (e.g., pre-aggregating).
Snowflake is shifting the paradigm when it comes to data warehousing in the cloud. However, by fundamentally processing data differently than other solutions, Snowflake has a whole new set of challenges for implementation.
Whether you’re looking for support implementing Snowflake or need to drive better performance, 2nd Watch’s Snowflake Value Accelerator will help save you money on your Snowflake investment. Click here to learn more.