Snowpark: Streamlining Workflow in Big Data Processing and Analysis

The Snowflake Data Cloud’s utility expanded further with the introduction of its Snowpark API in June of 2021. Snowflake has staked its claim as a significant player in cloud data storage and accessibility, enabling workloads including data engineering, data science, data sharing, and everything in between.

Snowflake provides a unique single engine with instant elasticity that is interoperable across different clouds and regions so users can focus on getting value out of their data, rather than trying to manage it. In today’s data-driven world, businesses must be able to quickly analyze, process, and derive insights from large volumes of data. This is where Snowpark comes in.

Snowpark expands Snowflake’s functionality, enabling users to leverage the full power of programming languages and libraries within the Snowflake environment. The Snowpark API provides a new framework for developers to bring DataFrame-style programming to common programming languages like Python, Java, and Scala. By integrating Snowpark into Snowflake, users can perform advanced data transformations, build complex data pipelines, and execute machine learning algorithms seamlessly.

The interoperability empowers organizations to extract greater value from their data, accelerating their speed of innovation.

What is Snowpark?

Snowpark’s API enables data scientists, data engineers, and software developers to perform complex data processing tasks efficiently and seamlessly. It has eliminated the need for data transfer through its high-level programming interface that allows users to write and execute code in their preferred programming language, all within the Snowflake platform. Snowpark comprises a client-side library and a server-side sandbox that enables users to work with their preferred tools and languages while leveraging the benefits of Snowflake virtual warehouses.

When developing applications, users can leverage the capabilities of Snowpark’s DataFrame API to process and analyze complex data structures and support various data processing operations such as filtering, aggregations, and sorting. In addition, users can create User Defined Functions (UDFs) whose code is uploaded to an internal stage in the Snowpark library that, when called on, is executed on the server side.

This enables the creation of custom functions to process and transform data according to their specific needs, along with greater flexibility and customization in data processing and analysis. These DataFrames are executed lazily, meaning they only run when an action to retrieve, store, or view the data they represent is run. Users write code within the client-side API in Snowpark, which is executed in Snowflake, so no data leaves unless the app asks.

Moreover, users can build queries within the DataFrame API, providing an easy way to work with data within the Structured Query Language (SQL) framework while integrating common languages like Python, Java, and Scala. Those queries are then converted to SQL within Snowpark before they distribute computation through Snowflake’s Elastic Performance Engine which enables collaboration across multiple clouds and regions.

From its support of the DataFrame API, UDFs, and seamless integration with data in Snowflake, Snowpark is an ideal tool for data scientists, data engineers, and software developers who need to work with big data in a fast and efficient manner.

Snowpark for Python

With the growth in data science and machine learning (ML) in past years, Python is closing the gap on SQL as a popular choice for data processing. Both are powerful in their own right, but they’re most valuable when they’re able to work together. Knowing this, Snowflake built Snowpark for Python “to help modern analytics, data engineering, data developers, and data science teams generate insights without complex infrastructure management for separate languages” (Snowflake, 2022). Snowpark for Python enables users to build scalable data pipelines and machine-learning workflows while utilizing the performance, elasticity, and security benefits of Snowflake.

Furthermore, with Snowflake virtual warehouses optimized for Snowpark, machine learning training is now possible due to its ability to process larger data sets by providing resources such as CPU, memory, and temporary storage. This enables Snowpark functions, including the execution of SQL statements that require compute sources (e.g., retrieving rows from tables) and performing Data Manipulation Language (DML) operations such as updating rows in tables, loading data into tables, and unloading data from tables.

With the compute infrastructure to execute memory-intensive operations, data scientists and teams can further streamline ML pipelines at scale with the interoperability of Snowpark and Snowflake.

Snowpark and Apache Spark 

If you’re familiar with the world of big data, you may know a thing or two about Apache Spark. In short, Spark is a distributed system used for big data processing and analysis.

While Apache Spark and Snowpark share similar utilities, there are some distinct differences and advantages to leveraging Snowpark over Apache Spark. Within Snowpark, users can manage all data within Snowflake as opposed to the need to transfer data to Spark. This not only streamlines workflows but also eliminates the potential adverse effects of sensitive data being taken out of the databases you’re working within and into a new ecosystem.

Additionally, the ability to remain in the Snowflake ecosystem simplifies processing by reducing the complexity of setup and management. While Spark requires significant hands-on time due to its more complicated setup, the ease of data transfer that is present between Snowflake and Snowpark requires no setup. You simply choose a warehouse and are ready to run commands within the database of your choosing.

Another major advantage Snowpark offers against its more complex counterpart is the simplified security measures. Leveraging the same security architecture that is in place within Snowflake eliminates the need to build out a specific complex security protocol like what is necessary within Spark.

The interoperability of Snowpark within the Snowflake ecosystem provides an assortment of advantages when compared with Apache Spark. Being a stand-alone processing engine, Spark comes with a significant amount of complexity from setup, ongoing management, transference of data, and creating specific security protocols. By choosing Snowpark, you opt out of the unnecessary complexity and into a streamlined functional process that can improve the efficiency and accuracy of any actions surrounding the big data you are handling – two things that are front of mind for any business in any industry whose decisions are derived from their ability to process and analyze complex data.

Why It Matters

Regardless of the industry, there is a growing need to process big data and understand how to leverage it for maximum value. When looking specifically at Snowpark’s API, leveraging a simplified programming interface with support for UDFs simplifies processing large data volumes in the users programming languages of choice. In uniting the simplified process with all the benefits of the Snowflake Data Cloud platform, there is a unique opportunity for businesses to take advantage of.

As a proud strategic Snowflake consulting partner, 2nd Watch recognizes the unique value that Snowflake provides. We have a team of certified SnowPros to help businesses implement and utilize their powerful cloud-based data warehouse and all the possibilities that their Snowpark API has to offer.

In a data-rich world, the ability to democratize data across your organization and make data-driven decisions can accelerate your continued growth. To learn more about implementing the power of Snowflake with the help of the 2nd Watch team, contact us and start extracting all the value your data has to offer.


Why the Healthcare Industry Needs to Modernize Analytics

It’s difficult to achieve your objectives when the goalposts are always in motion. Yet that’s often the reality for the healthcare industry. Ongoing changes in competition, innovation, regulation, and care standards demand real-time insight. Otherwise, it’s all too easy to miss watershed moments to change, evolve, and thrive.

Advanced or modernized analytics are often presented as the answer to reveal these hidden patterns, trends, or predictive insights. Yet when spoken about in an abstract or technical way, it’s hard to imagine the tangible impact that unspecified data can have on your organization. Here are some of the real-world use cases of big data analytics in healthcare, showing the valuable and actionable intelligence within your reach.

Improve Preventative Care

It’s been reported that six in ten Americans suffer from chronic diseases that impact their quality of life – many of which are preventable. Early identification and mediation reduce risk of long-term health problems, but only if organizations can accurately identify vulnerable patients or members. The success of risk scoring depends on a tightrope walk exploring populace overviews and individual specifics – a feat that depends on a holistic view of each patient or member.

A wide range of data contributes to risk scoring (e.g., patient/member records, social health determinants, etc.) and implementation (e.g., service utilization, outreach results, etc.). With data contained in an accessible, centralized infrastructure, organizations can pinpoint at-risk individuals and determine how best to motivate their participation in their preventive care. This can reduce instances of diabetes, heart disease, and other preventable ailments.

Encouraging healthy choices and self-care is just one potential example. Big data analytics has also proven to be an effective solution for preventing expensive 30-day hospital readmissions. Researchers at the University of Washington Tacoma used a predictive analytics model on clinical data and demographics metrics to predict the return of congestive heart failure patients with accurate results.

From there, other organizations have repurposed the same algorithmic framework to identify other preventable health issues and reduce readmission-related costs. One Chicago-based health system implemented a data-driven nutrition risk assessment that identified those patients at risk for readmissions. With that insight, they employed programs that combated patient malnutrition, cut readmissions, and saved $4.8 million. Those are huge results from one data set.

Boost Operational Efficiency

It’s well known that healthcare administrative costs in the United States are excessive. But it’s hard to keep your jaw from hitting the floor when you learn Canadian practices spend 27% of what U.S. organization do for the same claims processing. That’s a clear sign of operational waste, yet one that doesn’t automatically illuminate the worst offenders. Organizations can shine a light on wastage with proper healthcare analytics and data visualizations.

For instance, the right analytics and BI platform is capable of accelerating improvements. It can cross-reference patient intake data, record-keeping habits, billing- and insurance-related costs, supply chain expenses, employee schedules, and other data points to extract hidden insight. With BI visualization tools, you can obtain actionable insight and make adjustments in a range of different functions and practices.

Additionally, predictive analytics solutions can help to improve the forecasting of both provider organizations. For healthcare providers, a predictive model can help anticipate fluctuations in patient flow, enabling an appropriate workforce response to patient volume. Superior forecasting at this level manages to reduce two types of waste: labor dollars from overscheduling and diminished productivity from under-scheduling.

Enhance Insurance Plan Designs

There is a distinct analytics opportunity for payers, third-party administrators, and brokers: enhancing their insurance plan designs. Whether you want to retain or acquire customers, your organization’s ability to provide a more competitive and customized plan than the competition will be a game-changer.

All of the complicated factors that contribute to the design of an effective insurance plan can be streamlined. Though most organizations have lots of data, it can be difficult to discern the big picture. But machine learning programs have the ability to take integrated data sources such as demographics, existing benefit plans, medical and prescription claims, risk scoring, and other attributes to build an ideal individualized program. The result? Organizations are better at catering to members and controlling costs.

Plenty of Other Use Cases Exist

And these are just a sample of what’s possible. Though there are still new and exciting ways you can analyze your data, there are also plenty of pre-existing roadmaps to elicit incredible results for your business. To get the greatest ROI, your organization needs guidance through the full potential of these groundbreaking capabilities.

Want to explore the possibilities of data analytics in healthcare situations? Learn more about our healthcare data analytics services and schedule a no-cost strategy session.


Healthcare Big Data Use Cases and Benefits

Improving Care Quality, Reducing Nurse Turnover, and More

Hospital patient satisfaction (HCAHPS scores) and nurse turnover are two big issues hospitals are currently facing. Both issues have major implications with regard to patient outcomes and hospital efficiency and profitability. For example: On top of an already-high nurse turnover rate of 19.1%, the US healthcare system is facing an alarming rate of nurse retirements by the end of 2022. The cost of turnover is also incredibly high – it is estimated that each percentage point of nurse turnover costs an average hospital almost $350,000 in terms of recruiting, onboarding, and training. And of course, high nurse turnover has a negative impact on patient care.

To reduce the rate of nurse turnover, care facilities and hospitals must understand the causes of turnover, address those issues, and track their progress to adjust as needed. Ideally, hospitals would survey patients and nurses to receive feedback directly. However, response rates are typically very low. The alternative? Harness the power of healthcare big data. Healthcare big data can be used in a variety of use cases with several key benefits:

Big Data Use Cases in Healthcare

  • Capture publicly available data across social media and online review sites to aggregate sentiment and find patterns.
  • Understand the root causes of various issues, including such issues as HCAHPS (Hospital Consumer Assessment of Healthcare Providers and Systems) quality scores, nurse turnover, and patient satisfaction.
  • Benchmark your results against national data for similar organizations.
  • Forecast how changes in the healthcare industry will affect sentiment in the future.

Benefits of Big Data in Healthcare

  • Improve patient care quality.
  • Reduce nurse turnover rates.
  • Increase patient acquisition.
  • Ultimately, help improve patient care and healthcare outcomes.

There’s a wealth of information on social media and online review sites like Glassdoor, Yelp, Google reviews, Becker’s, Facebook, and WebMD. We investigated how a Chicago-area hospital could use this publicly available data in conjunction with HCAHPS survey results and nurse satisfaction surveys to address the above healthcare big data use cases and reap the benefits.

Download the white paper to learn about our findings.


3 Common Limitations with Business Intelligence Tools and How to Fix Them with Data Science

If you’re like most modern data-driven organizations, you’re probably already using business intelligence tools such as Power BI, Tableau, or Looker to visualize various KPIs, trends, and other detailed information related to daily functions. When implemented correctly, these tools help you to quickly answer questions around what is CURRENTLY happening and make your day to day tasks and vital business decisions much more informed and effective. One of the most common limitations with business intelligence tools, however, is that they don’t often enable you to predict what’s likely to happen in the FUTURE.

Case Study: See the benefits of data science in action in this case study on how we used a machine learning solution to predict customer lifetime value (CLV) for a Chicago retailer.

While understanding your past and current state is a great start, I’ll let you in on a little-known secret: adding predictive analytics and automated insights into your existing dashboards can give you better insights, can help you predict future outcomes, and is not as difficult as you may think. Data science tactics such as statistical modeling and machine learning make identifying what may happen equally as accessible as understanding what has or is currently happening.  This allows you to respond quickly to changing conditions or get in front of potential business challenges.

 

Common Limitations with Business Intelligence Tools and How Data Science Can Help

 

Insights from BI Tools

Insights from BI Tools + Data Science

  • What has happened in the past
  • What is currently happening 
  • What has happened in the past
  • What is currently happening 
  • What may happen in the future

3 Common Challenges with Business Intelligence (BI) Implementations

Typical BI implementations allow business users to easily consume data specific to their goals and daily tasks. The ability to analyze both past and present events unlocks information about the current state and is essential for remaining competitive in today’s data forward market. With that in mind, there are some common limitations that many organizations encounter when relying on these tools alone.

Limitation 1: Useful Insights, Trends, and Patterns Arise Only When Looking at the Right Data, within the Proper Context

The good news is that a modern data warehouse eliminates the risk of reporting on inaccurate or untimely data by organizing information in a manner that enables fast and reliable reporting. That being said, you must also rely on your business users to ask the right questions to develop helpful reports. This often results in a delayed discovery of vital insights and overlooking key data. Additionally, you have higher chances of missing key insights due to human error and the inability for efficient reporting to fully cover all segments of detailed data. Even the best dashboard can exclude important information since they focus solely on specific business questions.

For example, at a logistics company, dashboarding shows every detail around the supply chain and warehouse inventory. With so many variables that could affect the timeliness of your orders (number of employees, truck availability windows, congestion in areas of warehouses, etc.) it’s nearly impossible to combine all of the information and see the bigger picture in a timely manner. Especially when changes are happening in real time. It is also hard for a single person to separate the individual events from the overall effects. With machine learning you can ingest large amounts of data to identify orders at risk of being late based on key variables. Using statistical techniques, you can differentiate the sources of inefficiencies by cutting through all the noise in your data to find systemic issues.

Limitation 2: There Is a Reliance on Static, and Sometimes Arbitrary Business Rules

Many effective dashboards use benchmark metrics to show if a department is doing well or not. For example, a sales organization has BI tools that use data to track engagement with their leads. Under their current business rules, a lead is considered “cold” if there hasn’t been communication in five days. When a lead goes cold, the sales and management teams are alerted so action can be taken to re-engage the lead. A good dashboard would somehow present the number of cold leads and the number of leads at risk of becoming cold. But how do you really know that five days is the appropriate amount of time? What if millions of data points show that leads are likely to go cold if you haven’t contacted them in two days? That could be a lot of missed opportunities. In such a fast-changing environment, sometimes even based on the context of your business questions, business rules set by people may be misunderstood, inaccurate, or outdated.

Limitation 3: Since Most BI Tools Utilize Historical Data, They Lend Themselves to Highlighting Past Events Rather Than Future Ones

Visualizations based on this information are framed around questions of what has happened or is happening. While there is no doubt that understanding the past is essential to improving future decision-making, adding on a layer of predictive analytics would enable a culture of reactive, data-driven decisions to shift toward more forward-thinking and innovative choices.

Using advanced analytics to look toward the future is a practice all businesses should employ. To exemplify the significant impacts this practice can have, we will look to the healthcare industry. Many medical providers track their readmission rates, how often a patient returns with related health problems after they have been discharged. This metric helps evaluate the quality of care among other factors. Using data science, they can zone in on certain subsets of patients who pose a high risk of readmission. This provides healthcare providers with real-time knowledge of their most at-risk patients, allowing them to make proactive actions so their patients leave healthy and with less chance of readmission. This proactive approach is much more effective than looking back at historical data to later figure out which subsets of patients had higher readmission.

How to Solve These Challenges with Data Science

Upgrading Your BI Tool with Data Science Enhances Your Current Information by Answering “What Will Happen”

Data science helps businesses extract insights from large amounts of data and create outputs to automatically detect significant changes that may arise from patterns spotted in data. In many cases, it is because of the benefits of data science initiatives that companies begin to see significant ROI on their data investments. This is because data science better equips you to:

  • Make predictions for future events based on trends in historical data
  • Detect significant changes in business events and in determining their outcome
  • Assess potential results of business decisions
  • Analyze broad sets of data with many inputs to find key insights
  • Understand data points that impact the whole company rather than a specific siloed department

While the benefits of data science are undisputed, for many organizations, data science initiatives seem unapproachable. Whether it’s because your data science team finds it difficult to consistently communicate insights, there is a lack of understanding as to how a prediction is being reached, or you don’t know where to start because the process seems so large-scale, your company is not alone. One of the easiest ways to tackle these barriers is to combine your current BI tool and analytics practices with data science.

4 Benefits of Adding Data Science to Your Business Intelligence

Combining data science and business intelligence helps overcome many of the previously discussed challenges of both data science and BI tools by:

Incorporating Data Science Findings with Current BI Tools

This enables consistent communication of data science trends in a clear cut and approachable manner. Rather than requiring a data scientist to explain each output via a presentation, it can be communicated clearly to business users through a tool they are already familiar with through clear visualizations. This drives higher levels of adoption by business users and can help data science teams find additional use cases to bolster trust and collaboration.

Answering Questions of Trends and Patterns with Quantifiable Values

As mentioned before, the use of arbitrary and static business rules can have a major, negative impact on your decisions. Data science allows business users to identify dynamic and definite metrics against which they can measure success. Incorporating these insights with existing dashboards and benchmarks furthers the value of this form of decision-making.

Enabling Dashboards to Pull Key Information from a Broader Data Set

Ingesting through big data and picking up patterns from the deepest level of business data through data science practices increases operational efficiency and insight generation. Data science algorithms learn from the deepest level of historical data, then forecast on new data to detect the most relevant points to communicate to business users in real time.

Shifting the Focus of Certain Dashboards to be More Forward-Looking

By integrating the ability or results of using custom-built forecasting models with existing dashboards, business users can compare the current trajectory and history with the predicted future values. This enables forward-thinking decision-making and a natural platform to apply advanced insights to existing analysis.

Getting Started with Data Science Enhanced Business Intelligence

It’s hard to believe how accessible it is to gain predictive insights by updating your BI tool with data science capabilities. If you have a strong strategy already, it just takes implementing a data science and a bit of dashboard reworking. 2nd Watch has helped numerous organizations across multiple industries gain the benefits of data science through similar solutions using BI Tools like Tableau, Power BI, Looker, and more.

Whether you have ideas on where to start and want to incorporate your results with your BI tool or you need help understanding how to best incorporate data science, reach out to our consultants at 2nd Watch to unlock these insights as quickly as possible.


3 Productivity-Killing Data Problems and How to Solve Them

With the typical enterprise using over 1,000 Software as a Service applications (source: Kleiner Perkins), each with its own private database, it’s no wonder people complain their data is siloed. Picture a thousand little silos, all locked up!

Number of cloud applications used per enterprise, by industry vertical

Then, imagine you start building a dashboard out of all those data silos. You’re squinting at it and wondering, can I trust this dashboard? You placate yourself because at least you have data to look at, but this creates more questions for which data doesn’t yet exist.

If you’re in a competitive industry, and we all are, you need to take your data analysis to the next level. You’re either gaining competitive advantage over your competition or being left behind.

As a business leader, you need data to support your decisions. These three data complexities are at the core of every leader’s difficulties with gaining business advantages from data:

  1. Siloed data
  2. Untrustworthy data
  3. No data

 

  1. Siloed data

Do you have trouble seeing your data at all? Are you mentally scanning your systems and realizing just how many different databases you have? A recent customer of ours was collecting reams of data from their industrial operations but couldn’t derive the data’s value due to the siloed nature of their datacenter database. The data couldn’t reach any dashboard in any meaningful way. It is a common problem. With enterprise data doubling every few years, it takes modern tools and strategies to keep up with it.

For our customer, we started with defining the business purpose of their industrial data – to predict demand in the coming months so they didn’t have a shortfall. That business purpose, which had team buy-in at multiple corporate levels, drove the entire engagement. It allowed us to keep the technology simple and focused on the outcome.

One month into the engagement, they had clean, trustworthy, valuable data in a dashboard. Their data was unlocked from the database and published.

Siloed data takes some elbow grease to access, but it becomes a lot easier if you have a goal in mind for the data. It cuts through noise and helps you make decisions more easily if you know where you are going.

  1. Untrustworthy data

Do you have trouble trusting your data? You have a dashboard, yet you’re pretty sure the data is wrong, or lots of it is missing. You can’t take action on it, because you hesitate to trust it. Data trustworthiness is a prerequisite for making your data action oriented. But, most data has problems – missing values, invalid dates, duplicate values, and meaningless entries. If you don’t trust the numbers, you’re better off without the data.

Data is there for you to take action on, so you should be able to trust it. One key strategy is to not bog down your team with maintaining systems, but rather use simple, maintainable, cloud-based systems that use modern tools to make your dashboard real.

  1. No data

Often you don’t even have the data you need to make a decision. “No data” comes in many forms:

  • You don’t track it. For example, you’re an ecommerce company that wants to understand how email campaigns can help your sales, but you don’t have a customer email list.
  • You track it but you can’t access it. For example, you start collecting emails from customers, but your email SaaS system doesn’t let you export your emails. Your data is so “siloed” that it effectively doesn’t exist for analysis.
  • You track it but need to do some calculations before you can use it. For example, you have a full customer email list, a list of product purchases, and you just need to join the two together. This is a great place to be and is where we see the vast majority of customers.

That means finding patterns and insights not just within datasets, but across datasets. This is only possible with a modern, cloud-native data lake.

The solution: define your business need and build a data lake

Step one for any data project – today, tomorrow and forever – is to define your business need.

Do you need to understand your customer better? Whether it is click behavior, email campaign engagement, order history, or customer service, your customer generates more data today than ever before that can give you clues as to what she cares about.

Do you need to understand your costs better? Most enterprises have hundreds of SaaS applications generating data from internal operations. Whether it is manufacturing, purchasing, supply chain, finance, engineering, or customer service, your organization is generating data at a rapid pace.

(AWS :What is a Data Lake?)

Don’t be overwhelmed. You can cut through the noise by defining your business case.

The second step in your data project is to take that business case and make it real in a cloud-native data lake. Yes, a data lake. I know the term has been abused over the years, but a data lake is very simple; it’s a way to centrally store all (all!) of your organization’s data, cheaply, in open source formats to make it easy to access from any direction.

Data lakes used to be expensive, difficult to manage, and bulky. Now, all major cloud providers (AWS, Azure, GCP) have established best practices to keep storage dirt-cheap and data accessible and very flexible to work with. But data lakes are still hard to implement and require specialized, focused knowledge of data architecture.

How does a data lake solve these three problems?

  1. Data lakes de-silo your data. Since the data stored in your data lake is all in the same spot, in open-source formats like JSON and CSV, there aren’t any technological walls to overcome. You can query everything in your data lake from a single SQL client. If you can’t, then that data is not in your data lake and you should bring it in.
  2. Data lakes give you visibility into data quality. Modern data lakes and expert consultants build in a variety of checks for data validation, completeness, lineage, and schema drift. These are all important concepts that together tell you if your data is valuable or garbage. These sorts of patterns work together nicely in a modern, cloud-native data lake.
  3. Data lakes welcome data from anywhere and allow for flexible analysis across your entire data catalog. If you can format your data into CSV, JSON, or XML, then you can put it in your data lake. This solves the problem of “no data.” It is very easy to create the relevant data, either by finding it in your organization, or engineering it by analyzing across your data sets. An example would be joining data from Sales (your CRM) and Customer Service (Zendesk) to find out which product category has the best or worst customer satisfaction scores.

The 2nd Watch Dataops Foundation Platform

You should only build a data lake if you have clear business outcomes in mind. Most cloud consulting partners will robotically build a bulky data lake without any thought to the business outcome. What sets 2nd Watch apart is our focus on your business needs. Do you need to make better decisions? Speed up a process? Reduce costs somewhere? We keep your goal front and center throughout the entire engagement. We’ve deployed data lakes dozens of times for enterprises with this unique focus in mind.

Our ready-to-deploy data lake captures years of cloud experience and best practices, with integration from governance to data exploration and storage. We explain the reasons behind the decisions and make changes based on your requirements, while ingesting data from multiple sources and exploring it as soon as possible. In the above image, the core of the data lake are the three zones represented by green S3 bucket squares.

Here is a tour of each zone:

  • Drop Zone: As the “single source of truth,” this is a copy of your data in its most raw format, always available to verify what the actual truth is. Place data here with minimal or no formatting. For example, you can take a daily “dump” of a relational database in CSV format.
  • Analytics Zone: To support general analytics, data in the Analytics Zone is compressed and reformatted for fast analytics. From here, you can use a single SQL Client, like Athena, to run SQL queries over your entire enterprise dataset — all from a single place. This is the core value add of your data lake.
  • Curated Zone: The “golden” or final, polished, most-valued datasets for your company go here. This is where you save and refresh data that will be used for dashboards or turned into visualizations.

Our Classic 3-zone data lake on S3 features immutable data by default. You’ll never lose data, nor do you have to configure a lot of settings to accomplish this. Using AWS Glue, data is automatically compressed and archived to minimize storage costs. Convenient search with always-up-to-date data catalog allows you to easily discover all your enterprise datasets.

In the Curated Zone, only the most important “data marts” – approved datasets – get loaded into more costly Redshift or RDS, minimizing costs and complexity. And with Amazon SageMaker, tapping into your Analytics and Curated Zone, you are prepared for effective machine learning. One of the most overlooked aspects of machine learning and advanced analytics is the great importance of clean, available data. Our data lake solves that issue.

If you’re struggling with one of these three core data issues, the solution is to start with a crisp definition of your business need, and then build a data lake to execute on that need. A data lake is just a central repository for flexible and cheap data storage. If you focus on keeping your data lake simple and geared towards the analysis you need for your business, these three core data problems will be a thing of the past.

If you want more information on creating a data lake for your business, download our DataOps Foundation datasheet to learn about our 4-8 week engagement that helps you build a flexible, scalable data lake for centralizing, exploring and reporting on your data.

-Rob Whelan, Practice Manager, Data Engineering & Analytics

 

 


Cloud Crunch Podcast: 5 Strategic IT Business Drivers CXOs are Contemplating Now

What is the new normal for life and business after COVID-19, and how does that impact IT? We dive into the 5 strategic IT business drivers CXOs are contemplating now and the motivation behind those drivers. Read the corresponding blog article at https://www.2ndwatch.com/blog/five-strategic-business-drivers-cxos-contemplating-now/. We’d love to hear from you! Email us at CloudCrunch@2ndwatch.com with comments, questions and ideas. Listen now on Spotify, iTunes, iHeart Radio, Stitcher, or wherever you get your podcasts.


Cloud Crunch Podcast: Examining the Cloud Center of Excellence

What is a Cloud Center of Excellence (CCOE), and how can you ensure its success? Joe Kinsella, CTO of CloudHealth, talks with us today about the importance of a CCOE, the steps to cloud maturity, and how to move through the cloud maturity journey. We’d love to hear from you! Email us at CloudCrunch@2ndwatch.com with comments, questions and ideas. Listen now on Spotify, iTunes, iHeart Radio, Stitcher, or wherever you get your podcasts.


Cloud Crunch Podcast: Diving into Data Lakes and Data Platforms

Data Engineering and Analytics expert, Rob Whelan, joins us today to dive into all things data lakes and data platforms. Data is the key to unlocking the path to better business decisions. What do you need data for? We look at the top 5 problems customers have with their data, how the cloud has helped solve these challenges, and how you can leverage the cloud for your data use. We’d love to hear from you! Email us at CloudCrunch@2ndwatch.com with comments, questions and ideas. Listen now on Spotify, iTunes, iHeart Radio, Stitcher, or wherever you get your podcasts.


What to Expect at AWS re:Invent 2017

The annual Amazon Web Services (AWS) re:Invent conference is just around the corner (the show kicks off November 27 in Las Vegas). Rest assured, there will be lots of AWS-related products, partners, and customer news. Not to mention, more than a few parties. Here’s what to expect at AWS re:Invent 2017—and a few more topics we hope to hear about.

1.)  Focus on IOT, Machine Learning, and Big Data

IOT, Machine Learning, and Big Data are top of mind with much of the industry—insert your own Mugatu “so hot right now” meme here – and we expect all three to be front and center at this year’s re:Invent conference. These Amazon Web Services are ripe for adoption, as most IT shops lack the capabilities to deploy these types of services on their own.  We expect to see advancements in AWS IOT usability and features. We’ve already seen some early enhancements to AWS Greengrass, most notably support for additional programming languages, and would expect additional progress to be displayed at re:Invent. Other products that we expect to see advancement made are with AWS Athena and AWS Glue.

In the Machine Learning space, we were certainly excited about the recent partnership between Amazon Web Services and Microsoft around Gluon, and expect a number of follow-up announcements geared toward making it easier to adopt ML into one’s applications. As for Big Data, we imagine Amazon Web Service to continue sniping at open source tools that can be used to develop compelling services. We also would be eager to see more use of AWS Lambda for in-flight ETL work, and perhaps a long-running Lambda option for batch jobs.

2.)  Enterprise Security

To say that data security has been a hot topic these past several months, would be a gross understatement. From ransomware to the Experian breach to the unsecured storage of private keys, data security has certainly been in the news. In our September Enterprise Security Survey, 73% of respondents who are IT professionals don’t fully understand the public cloud shared responsibility model.

Last month, we announced our collaboration with Palo Alto Networks to help enterprises realize the business and technical benefits of securely moving to the public cloud. The 2nd Watch Enterprise Cloud Security Service blends 2nd Watch’s Amazon Web Services expertise and architectural guidance with Palo Alto Networks’ industry-leading VM series of security products. To learn more about security and compliance, join our re:Invent breakout session—Continuous Compliance on AWS at Scale— by registering for ID number SID313 from the AWS re:Invent Session Catalogue. The combination delivers a proven enterprise cloud security offering that is designed to protect customer organizations from cyberattacks, in hybrid or cloud architectures. 2nd Watch is recognized as the first public cloud-native managed security provider to join the Palo Alto Networks, NextWave Channel Partner Program. We are truly excited about this new service and collaboration, and hope you will visit our booth (#1104) or Palo Alto Networks (#2409) to learn more.

As for Amazon Web Services, we fully expect to see a raft of announcements. Consistent with our expectations around ML and Big Data, we expect to hear about enhanced ML-based anomaly detection, logging and log analytics, and the like. We also expect to see advancements to AWS Shield and AWS Organizations, which were both announced at last year’s show. Similarly, we wouldn’t be surprised by announced functionality to their web app firewall, AWS WAF. A few things we know customers would like are easier, less labor-intensive management and even greater integration into SecDevOps workflows. Additionally, customers are looking for better integration with third-party and in-house security technologies – especially   application scanning and SIEM solutions – for a more cohesive security monitoring, analysis, and compliance workflow.

The dynamic nature of the cloud creates specific challenges for security. Better security and visibility for ephemeral resources such as containers, and especially for AWS Lambda, are a particular challenge, and we would be extremely surprised not to see some announcements in this area.

Lastly, General Data Protection Regulations (GDPR) will be kicking in soon, and it is critical that companies get on top of this. We expect Amazon Web Service to make several announcements about improved, secure storage and access, especially with respect to data sovereignty. More broadly, we expect that Amazon Web Service will announce improved tools and services around compliance and governance, particularly with respect to mapping deployed or planned infrastructure against the control matrices of various regulatory schemes.

3.)  Parties!

We don’t need to tell you that AWS’ re:Play Party is always an amazing, veritable visual, and auditory playground.  Last year, we played classic Street Fighter II while listening to Martin Garrix bring the house down (Coin might have gotten ROFLSTOMPED playing Ken, but it was worth it!).  Amazon Web Services always pulls out all the stops, and we expect this year to be the best yet.

2nd Watch will be hosting its annual party for customers at the Rockhouse at the Palazzo.  There will be great food, an open bar, an awesome DJ, and of course, a mechanical bull. If you’re not yet on the guest list, request your invitation TODAY! We’d love to connect with you, and it’s a party you will not want to miss.

Bonus: A wish list of things 2nd Watch would like to see released at AWS re:Invent 2017

Blockchain – Considering the growing popularity of blockchain technologies, we wouldn’t be surprised if Amazon Web Service launched a Blockchain as a Service (BaaS) offering, or at least signaled their intent to do so, especially since Azure already has a BaaS offering.

Multi-region Database Option – This is something that would be wildly popular but is incredibly hard to accomplish. Having an active-active database strategy across regions is critical for production workloads that operate nationwide and require high uptime.  Azure already offers it with their Cosmos DB (think of it as a synchronized, multi-region DynamoDB), and we doubt Amazon Web Service will let that challenge stand much longer. It is highly likely that Amazon Web Service has this pattern operating internally, and customer demand is how Amazon Web Service services are born.

Nifi – The industry interest in Nifi data-flow orchestration, often analogized to the way parcel services move and track packages, has been accelerating for many reasons, including its applicability to IoT and for its powerful capabilities around provenance. We would love to see AWS DataPipeline re-released as Nifi, but with all the usual Amazon Web Services provider integrations built in.

If even half our expectations for this year’s re:Invent are met, you can easily see why the 2nd Watch team is truly excited about what Amazon Web Services has in store for everyone. We are just as excited about what we have to offer to our customers, and so we hope to see you there!

Schedule a meeting with one of our AWS Professional Certified Architects, DevOps or Engineers and don’t forget to come visit us in booth #1104 in the Expo Hall!  See you at re:Invent 2017!

 

— Coin Graham, Senior Cloud Consultant and John Lawler, Senior Product Manager, 2nd Watch


High Performance Computing (HPC) – An Introduction

When we talk about high performance computing ( HPC ) we are typically trying to solve some type of problem. These problems will generally fall into one of four types:

  • Compute Intensive – A single problem requiring a large amount of computation.
  • Memory Intensive – A single problem requiring a large amount of memory.
  • Data Intensive – A single problem operating on a large data set.
  • High Throughput – Many unrelated problems that are be computed in bulk.

 

In this post, I will provide a detailed introduction to High Performance Computing ( HPC ) that can help organizations solve the common issues listed above.

Compute Intensive Workloads

First, let us take a look at compute intensive problems. The goal is to distribute the work for a single problem across multiple CPUs to reduce the execution time as much as possible. In order for us to do this, we need to execute steps of the problem in parallel. Each process­—or thread—takes a portion of the work and performs the computations concurrently. The CPUs typically need to exchange information rapidly, requiring specialization communication hardware. Examples of these types of problems are those that can be found when analyzing data that is relative to tasks like financial modeling and risk exposure in both traditional business and healthcare use cases. This is probably the largest portion of HPC problem sets and is the traditional domain of HPC.

When attempting to solve compute intensive problems, we may think that adding more CPUs will reduce our execution time. This is not always true. Most parallel code bases have what we call a “scaling limit”. This is in no small part due to the system overhead of managing more copies, but also to more basic constraints.

This is summed up brilliantly in Amdahl’s law.

In computer architecture, Amdahl’s law is a formula which gives the theoretical speedup in latency of the execution of a task at fixed workload that can be expected of a system whose resources are improved. It is named after computer scientist Gene Amdahl, and was presented at the AFIPS Spring Joint Computer Conference in 1967.

Amdahl’s law is often used in parallel computing to predict the theoretical speedup when using multiple processors. For example, if a program needs 20 hours using a single processor core, and a particular part of the program which takes one hour to execute cannot be parallelized, while the remaining 19 hours (p = 0.95) of execution time can be parallelized, then regardless of how many processors are devoted to a parallelized execution of this program, the minimum execution time cannot be less than that critical one hour. Hence, the theoretical speedup is limited to at most 20 times (1/(1 − p) = 20). For this reason, parallel computing with many processors is useful only for very parallelizable programs.

– Wikipedia

Amdahl’s law can be formulated the following way:

 

where

  • S latency is the theoretical speedup of the execution of the whole task.
  • s is the speedup of the part of the task that benefits from improved system resources.
  • p is the proportion of execution time that the part benefiting from improved resources originally occupied.

Chart Example: If 95% of the program can be parallelized, the theoretical maximum speed up using parallel computing would be 20 times.

Bottom line: As you create more sections of your problem that are able to run concurrently, you can split the work between more processors and thus, achieve more benefits. However, due to complexity and overhead, eventually using more CPUs becomes detrimental instead of actually helping.

There are libraries that help with parallelization, like OpenMP or Open MPI, but before moving to these libraries, we should strive to optimize performance on a single CPU, then make p as large as possible.

Memory Intensive Workloads

Memory intensive workloads require large pools of memory rather than multiple CPUs. In my opinion, these are some of the hardest problems to solve and typically require great care when building machines for your system. Coding and porting is easier because memory will appear seamless, allowing for a single system image.  Optimization becomes harder, however, as we get further away from the original creation date of your machines because of component uniformity. Traditionally, in the data center, you don’t replace every single server every three years. If we want more resources in our cluster, and we want performance to be uniform, non-uniform memory produces actual latency. We also have to think about the interconnect between the CPU and the memory.

Nowadays, many of these concerns have been eliminated by commodity servers. We can ask for thousands of the same instance type with the same specs and hardware, and companies like Amazon Web Services are happy to let us use them.

Data Intensive Workloads

This is probably the most common workload we find today, and probably the type with the most buzz. These are known as “Big Data” workloads. Data Intensive workloads are the type of workloads suitable for software packages like Hadoop or MapReduce. We distribute the data for a single problem across multiple CPUs to reduce the overall execution time. The same work may be done on each data segment, though not always the case. This is essentially the inverse of a memory intensive workload in that rapid movement of data to and from disk is more important than the interconnect. The type of problems being solved in these workloads tend to be Life Science (genomics) in the academic field and have a wide reach in commercial applications, particularly around user data and interactions.

High Throughput Workloads

Batch processing jobs (jobs with almost trivial operations to perform in parallel as well as jobs with little to no inter-CPU communication) are considered High Throughput workloads. In high throughput workloads, we create an emphasis on throughput over a period rather than performance on any single problem. We distribute multiple problems independently across multiple CPUs to reduce overall execution time. These workloads should:

  • Break up naturally into independent pieces.
  • Have little or no inter CPU communcation
  • Be performed in separate processes or threads on a separate CPU (concurrently)

 

Workloads that are compute intensive jobs can likely be broken into high throughput jobs, however, high throughput jobs do not necessarily mean they are CPU intensive.

HPC On Amazon Web Services

Amazon Web Services (AWS) provides on-demand scalability and elasticity for a wide variety of computational and data-intensive workloads, including workloads that represent many of the world’s most challenging computing problems: engineering simulations, financial risk analyses, molecular dynamics, weather prediction, and many more.   

– AWS: An Introduction to High Performance Computing on AWS

Amazon literally has everything you could possibly want in an HPC platform. For every type of workload listed here, AWS has one or more instance classes to match and numerous sizes in each class, allowing you to get very granular in the provisioning of your clusters.

Speaking of provisioning, there is even a tool called CfnCluster which creates clusters for HPC use. CfnCluster is a tool used to build and manage High Performance Computing (HPC) clusters on AWS. Once created, you can log into your cluster via the master node where you will have access to standard HPC tools such as schedulers, shared storage, and an MPI environment.

For data intensive workloads, there a number of options to help get your data closer to your computer resources.

  • S3
  • Redshift
  • DynamoDB
  • RDS

 

EBS is even a viable option for creating large scale parallel file systems to meet high-volume, high-performance, and throughput requirements of workloads.

HPC Workloads & 2nd Watch

2nd Watch can help you solve complex science, engineering, and business problems using applications that require high bandwidth, enhanced networking, and very high compute capabilities.

Increase the speed of research by running high performance computing ( HPC ) in the cloud and reduce costs by paying for only the resources that you use, without large capital investments. With 2nd Watch, you have access to a full-bisection, high bandwidth network for tightly coupled, IO-intensive workloads, which enables you to scale out across thousands of cores for throughput-oriented applications. Contact us today to learn more about High Performance Computing ( HPC )

2nd Watch Customer Success

Celgene is an American biotechnology company that manufactures drug therapies for cancer and inflammatory disorders. Read more about their cloud journey and how they went from doing research jobs that previously took weeks or months, to just hours. Read the case study.

We have also helped a global finance & insurance firm prove their liquidity time and time again in the aftermath of the 2008 recession. By leveraging the batch computing solution that we provided for them, they are now able to scale out their computations across 120,000 cores while validating their liquidity with no CAPEX investment. Read the case study.

 

– Lars Cromley, Director of Engineering, Automation, 2nd Watch