The widespread adoption of the value-based care model is encouraging more healthcare organizations to revisit the management of their data. Increased emphasis on the quality of service, elevating care outcomes along the way, means that organizations depend more than ever on consistent, accessible, and high-quality data.
The problem is that the current state of data management is inconsistent and disorganized. Less than half of healthcare CIOs trust the current quality of their clinical, operational, and financial data. In turn, the low credibility of their data sources calls into question their reporting and analytics, which ripples outward, inhibiting the entirety of their decision-making. Clinical diagnoses, operational assessments, insurance policy designs, and patient/member satisfaction reports all suffer with poor data governance.
Fortunately, most healthcare organizations can take straightforward steps to improve their data governance – if they are aware of what’s hindering their reporting and analytics. With that goal in mind, here are some of the most common challenges and oversights for data governance and what your organization can do to overcome them.
Most healthcare organizations are now aware of the idea of data silos. As a whole, the industry has made commendable progress breaking down these barriers and unifying large swaths of raw data into centralized repositories. Yet the ongoing addition of new data sources can lead to the return of analytical blind spots if your organization doesn’t create permanent protocols to prevent them.
Consider this situation: Your billing department just implemented a live chat feature on your website or app, providing automated answers to a variety of patient or member questions. If there is not an established protocol automatically integrating data from these interactions into your unified view, then you’ll miss valuable pieces of each patient or member’s overall story. The lack of data might result in missed opportunities for outreach campaigns or even expanded services.
Adding any new technology (e.g., live chat, healthcare diagnostic devices, virtual assistants) creates a potential threat to the comprehensiveness of your insights. Yet by creating a data pipeline and a data-centric culture, you can prevent data siloing from reasserting itself. Remember that your data ecosystem is dynamic, and your data governance practices should be too.
Lack of Uniformity
None of the data within a healthcare organization exists in a vacuum. Even if the data within your EHR or medical practice management (MPM) software is held to the highest quality standards, a lack of consistency between these or other platforms can diminish the overall accuracy of analytics. Worst of all, this absence of standardization can impact your organization in a number of ways.
When most people think of inconsistencies, it probably relates to the accuracy of the data itself. There are the obviously harmful clinical inconsistencies (e.g., a pathology report indicates cancerous cells are acute while a clinical report labels them chronic) and less glaring but damaging organizational inconsistencies (e.g., two or more different contact numbers that hamper communication). In these examples and others, data inaccuracies muddy the waters and impair the credibility of your analytics. The other issue is more subtle, sneaking under the radar: mismatched vocabulary, terminology, or representations.
Here’s an example. Let’s say a healthcare provider is trying to analyze data from two different sources, their MPM and their EHR. Both deal with patient demographics, but might have different definitions of what constitutes their demographics. Their age brackets might vary (one might set a limit at ages 18 to 29 and another might draw the line at 18 to 35), which can prevent seamless integration for demographic analysis. Though less harmful, this lack of uniformity can curtail the ability of departments to have a common understanding and derive meaningful business intelligence from their shared data.
In all of the above instances, establishing a single source of truth with standardized information and terminology is essential if you’re going to extract accurate and meaningful insights during your analyses.
To combat these problems, your organization needs to decide upon a standardized representation of core data entities that create challenges upon analysis. Then, rather than cleansing the data in their respective source systems, you can use an ELT process to extract and load structured and unstructured data into a centralized repository. Once the data has been centralized, you can evaluate the data for inaccuracies, standardize the data by applying data governance rules against it, and finally normalize the data so your organization can analyze it with greater uniformity.
Even when your data is high-quality and consistent, your organization might still fall short of data governance best practices. The reason why? The accessibility of your data might not thread the needle between HIPAA compliance and appropriate end-user authorization.
Some organizations, dedicated to protecting the protected health information (PHI) of their patients or members, clip their own wings when the time comes to analyze data. In an attempt to avoid expensive HIPAA violations, they restrict stakeholders, analysts, or other covered entities from accessing the data. Though it’s essential to remain HIPAA compliant, data analysis can be conducted in ways that safeguard PHI while also improving treatment quality or reducing the cost of care.
Your organization can de-identify records (removing names, geographic indicators, contact info, social security numbers, etc.) in a specific data warehouse. Presenting scrubbed files to authorized users can help them gain a wide range of insights that can transform care outcomes, reduce patient outmigration, reduce waste, and more.
Elevating Your Overall Data Governance
With all of these challenges in sight, it’s easy to get overwhelmed about the next steps. Though we’ve provided some actions your organization can utilize, it’s important to recognize effective data governance is as much a change in your mindset as it is a series of best practices. Here are some additional considerations to keep in mind as you work to improve your data governance:
You Need a Defined Data Governance Strategy.
An ad hoc approach to data governance will fail in the long run. There needs to be agreement among your data stakeholders about data availability, consistency, and quality. Often, it helps to start with a pilot project on a single line of business or department to ensure that all of the kinks of the transition are ironed out before your data governance strategy is taken enterprise wide.
Even then, compromise between standardization and distributed action is important so users within your organization are following the same best practices as they conduct dispersed analytics.
Your Culture Likely Needs to Change.
Eliminating data inconsistencies or adjusting inaccuracies are temporary fixes if only your executives are committed to making a change. Employees across your organization need to embrace the ideals of effective data governance if your organization is going to gain useful and accurate intelligence from your data.
Is your organization suffering from poor data governance? Find out the ways you can improve your data management by scheduling a whiteboard session with a member of the 2nd Watch team.
Jim Anfield – Principal, Healthcare Practice Leader 2nd Watch
To get your source data ingested and loaded, or for a deep dive into how you can build a fully automated data integration process in Snowflake on Azure, schedule a Snowflake whiteboarding session with our team of data architects.
If you have a Microsoft ecosystem but have been wanting to take advantage of more tools on the market, Snowflake on Azure means additional opportunities for you to upgrade your analytics platform while not throwing away the investment in and keeping the uniformity of your current environment.
For those who are not familiar, Snowflake is a cloud-based, massively parallel processing (MPP), columnar storage database. It’s a newer option for data warehousing that is set up for more efficient querying of large data volumes. It also consumes structured and semi-structured data in a way that traditional relational databases are not designed to do as effectively – think “big data” without the “big overhead.”
With this release, Microsoft companies can evaluate using Snowflake to increase the performance and flexibility of their analytics environment by adding it on top of their existing data integration process. To determine where Snowflake might be a good fit, our 2nd Watch consultants took a dive into where Snowflake could sit in our current and prospective Microsoft clients’ ecosystems.
Where does Snowflake fit in your current Azure environment?
Snowflake is best used as the home of an analytical layer (or dimensional model, for the more technical) that enables reporting. Think of this as the substitute for products like SQL Server Analysis Services (SSAS).
While we still recommend that you maintain a data integration hub and related process for all of your consuming application needs (i.e., sending consolidated and cleansed data back to each source system to keep them in sync), Snowflake can sit right at the end of that process. Because it’s optimized for read activities, it complements the needs of business intelligence tools like Power BI, Looker, Tableau, etc., making it faster for business users to grab the information they need via those tools. Integration and ETL are possible with many tools and services, including Azure Data Factory.
When would you want to use Snowflake?
There are two primary ideas behind Snowflake’s competitive advantage when it comes to data warehousing platforms: its automatic optimization of query execution and the hands-off nature of its maintenance.
We recommend Snowflake for two main use cases:
Developing a more efficient analytics platform
Creating a platform for flexible data discovery
Our clients that fit within the above use cases usually had:
Multiple business users – The more people you have querying your database at one time, the more the database has to be configured to handle that load so as to not lock up other processes. Traditional databases can be scaled up to handle larger reads (think of a query that produces data), but this takes a decent amount of time and effort to achieve, and they are more often optimized for writes (think of loading data into a table). In Snowflake, a user can spin up resources for just the one query, then spin it back down right after. This allows for a more modular use of higher power resources.
Lots of data – If you’ve ever tried to perform a huge query on your current system, you likely noticed a slowdown from your usual processing. Traditional databases are not as optimized for read activities as columnar databases are. This makes options like Snowflake more attractive to those performing heavier analytical queries on a regular basis.
“Database systems have traditionally optimized performance for write-intensive workloads. Recently, there has been renewed interest in architectures that optimize read performance by using column-oriented data representation and light-weight compression. This previous work has shown that under certain broad classes of workloads, column-based systems can outperform row-based systems. ” – MIT Computer Science and Artificial Intelligence Laboratory
A mix of structured and semi-structured data – Though many traditional databases offer options for consuming semi-structured data (e.g., JSON, XML, etc.), they aren’t optimized to do so. If you have a mix of structured and semi-structured data, options like Snowflake might be more efficient for your process.
What’s the tradeoff?
Snowflake can be a great option for clients who need a true analytical platform where they can perform data discovery or need to read out a lot of data at once. That said, the following are situations where an alternative to Snowflake might make more sense:
You need a data hub to keep your various applications in sync (application integration and API endpoints).
You aren’t trying to perform complex aggregations or data discovery.
You’re invested in an existing solution that doesn’t have many performance or management overhead concerns.
So what’s the gist?
Snowflake makes it easy to kick off a data warehouse project with its ease of use and start-small, scale-big approach. Despite this innovative architecture, we still recommend applying the same data warehousing fundamentals that you would have in the past.
Yes, Snowflake will allow you to create views on top of messy raw source data, but you will ultimately lose the battle on performance as your data grows in complexity. We suggest approaching a new Snowflake deployment with a vision of your data platform as a whole – not just the analytics layer. Performing all your complex business logic in Snowflake is possible, but due to the columnar architecture, integration use cases are better served in complementary cloud database platforms, such as Azure SQL.
With a roadmap and strategy in place for both your integration and analytics needs, you’ll be better able to select the right mix of tools and technologies for your solution. Learn more about Snowflake deployment best practices in our eBook.
It’s true that in a very short period of time, you can get an operational data warehouse loaded with all your source data. Wondering how to build a data warehouse that yields these results? Snowflake and its technology partner ecosystem, such as partners like Fivetran, allow you to take your databases and SaaS applications, replicate them to Snowflake, and see results in a significantly shorter timeframe than we’ve seen before. Write some SQL views in Snowflake against this data, point your favorite BI tool at it, and get lightning-fast results.
With the right plan in place, you can (aggressively) deliver that first business “wow” in six to eight weeks. 2nd Watch typically recommends at least two analytical virtual warehouses in enterprise deployments: one for data discovery and one for more structured and governed reporting.
The practice of having both a data discovery layer and a governed layer for your Snowflake deployment not only saves you time in the initial build, but it creates a continuous integration and deployment pattern. It finally makes a dent in the notion that a data warehouse cannot be “agile.”
Future State Snowflake Data Architecture
Future State Snowflake Data Architecture
With this approach, you not only achieve governance and speed-to-insight, but you’ve also cut down your Snowflake consumption costs. Running complex queries at run-time, every time, can get expensive. Snowflake’s caching can help here, but if you’re constantly running a complex join across over 20 tables, that physical layer might help you move from an XL-sized Snowflake warehouse to a L or M. In the long run, those cost savings will add up. When “best practice” or risk avoidance isn’t enough to justify this approach, it’s possible the dollar savings might speak for themselves.
The initial six- to eight-week path assumes a focus on the data discovery layer, as depicted below. Loading all your data into a data discovery layer should be the primary development activity in your Snowflake proof of concept (POC) or pilot. Here are some tips:
Find several source systems that have the data your stakeholders need.
Begin the process of rapidly loading into your Snowflake data discovery layer.
Write iterative SQL in Snowflake views to build your business logic.
Connect your BI tool to Snowflake and build a very simple dashboard.
Get feedback from business, with the new dashboard as a facilitation tool.
Repeat this process – more feedback is better.
You’ve not only increased the speed to market, but you’ve also enabled developers and business users to execute an analytics project in a completely new way.
Highlighted: Speed-to-insight approach, with transformations in Snowflake
This all comes with a caveat: Yes, you can write all your logic in Snowflake and do everything you need from an analytics standpoint, but it will limit you in the long run. Every data project has sacrifices to be made, whether it’s time, features, cost, or flexibility. You need to balance these with a long-term vision for how your data will be used across the organization. Snowflake will get you part of the way there, but a good plan and strategy will take you even further.
In the meantime, if you have any questions or want to learn more about how to build a data warehouse using Snowflake, contact us today.
Any functioning company has a vision, goals, and direction. A successful company follows a strategy – set forth by its executives – to realize its vision, achieve its goals, and determine its direction. Executing executive strategy requires cross-functional collaboration amongst teams across the organization.
Having a data strategy is crucial to achieving your greater business goals. To set attainable and tangible business objectives, you need to utilize data. Your IT department will implement and oversee the technologies powering your data strategy; therefore, business processes and goals must factor in IT resources and capabilities. Your organization’s business unit should work closely with your IT team to build an effective data strategy.
Utilizing Data Analysis to Pursue Strategies
When it comes to business goals, each goal should have a quantitative KPI or metric that is extracted from data and analytics. Of course, this is easier said than done! Even the simplest of business questions can spawn dozens of data threads needed for analysis. For example, if you are looking to track business profitability, you’ll have to look at key components such as revenue, COGS, and overhead. Each of these data sources must be broken down even further to fully understand the several metrics that make up the greater question of profitability. The ultimate goal of having a data strategy is to define all the inputs required to calculate key metrics and incorporate them into your production processes.
Your Data Strategy Checklist
Before engaging in any data management project, it’s important to outline your to-do list. Below is a checklist with the foundational steps to building out a data strategy:
Scope the project (what business problem are you trying to solve?)
Generate a work breakdown structure for strategy.
Create a timeline with milestones.
Define teams and individual role definitions.
Determine how you collaborate with others.
Schedule meetings (kick-off, status, sprint planning, and steering meetings).
Begin research of your technology stack.
Budget for your data strategy project.
Kick off with the team!
Let’s Get Started!
As you work toward building a data strategy, there are four deliverables you should focus on: a current state understanding, a future state architecture, a gap analysis, and a road map.
Current State Understanding
Your data strategy should reinforce and advance your overall business strategy, which refers to the processes you use to operate and improve your business. To grow, you must establish your baseline to measure your growth. To gain an understanding of your current state, review your current data architecture, data flows, and source systems by:
Collecting existing diagrams and analyzing existing discovery
Verifying source systems (you have more than you think!)
Diagramming and visualizing
Having a logical model (what are the existing business domains?)
Future State Architecture
Once you’ve established a baseline, imagine where you’d like the future state architecture. Your future state architecture should support the data platform at your company, including a logical model of the data warehouse and technology decisions. To determine your future state, define your data threads, which requires you to identify your source systems and domains.
What changes need to happen to move your company from the current state to the desired future state? Establish your technical framework and identify the gaps. This framework should be built for moving, curating, and delivering data to your business use cases.
Roading mapping is the most time-consuming and the most important deliverable. Your data strategy roadmap will outline how and when to implement your data vision and mission. This will help you manage change and ensure that IT goals are aligned with business goals. A well-thought-out roadmap reveals true dependencies so you can adjust as priorities shift. There are several important steps to building out a road map, which we will cover below.
Step 1: Build and estimate a backlog.
Your backlog should be based on your intended future state and what business deliverables you want to generate. Questions to answer:
What data threads need to be engineered?
How are you going to deliver the inputs to the artifact?
How are you going to productionalize the result to create the action?
Step 2: Define your vision and interim goals.
Before you can embark on any project, you’ll need to establish your end goals and the goals along the way.
Questions to answer:
What does success look like?
What priorities should you target?
Step 3: Identify your KPIs.
Determine the data and metrics that will be the most useful for operational and strategic decisions.
Questions to answer:
Who needs what information?
How valuable is that information to guide your decisions?
Step 4: Understand everything about your data.
This means understanding your sources of data, data relationships, and data quality issues.
Questions to answer:
What data are you sourcing from?
What data issues can be resolved by code versus process changes?
How can you map this data to a new, clean, holistic data model for reporting and analytics?
Step 5: Recommend a selection of technologies and tools.
What architecture considerations are required to best use your technology stack and ecosystem?
What orchestration tools will you need?
What additional tools will you need to consider in the future as the platform matures?
Step 6: Design a high-level solution architecture.
Use this architecture on your data marts, leading to quicker access to data and faster insights.
Questions to answer:
How should raw data, data collection, data storage, and access be structured for ease of use?
How frequently – and with what method – should this data be refreshed?
Who will own and support this in the future?
Step 7: Generate a specific backlog of tasks for implementing o9 development.
Questions to answer:
How will you build out the end-to-end foundation?
How will you coordinate with groups to gain the most immediate benefit?
Step 8: Develop a go-forward plan with a budget, milestones, and a timeline.
These components of your roadmap will make your future state a reality.
Questions to answer:
When can you complete this project?
How will you allocate the work?
How much will it all cost?
What internal and external resources are needed to put everything into action?
A Timeline of Your Data Strategy Road Map
At 2nd Watch, our approach is based on a 5- to 7-month data strategy timeline. Our data experts and consultants think about implementing a data strategy in phases.
Phase 1: Foundation and process – 3 to 4 months
Set up your ETL and audit framework.
Ingest your primary data set into a persistent storage layer.
Engineer a business model for Phase 1 scope.
Create a data dictionary, source to target, and support documentation for scaling the platform.
Develop business use cases based on Phase 1 scope.
Establish a strategic foundation for ongoing maturity and add future data sources.
Assign a data engineer and architect to support model development.
Phase 2: Development analytics – 2 to 3 months
Ingest your primary data sources for development.
Add integration dataflows to increase and expand the business layer for Phase 2 scope.
Identify SMEs and data champions to drive QA and identify new report requirements.
Phases 3 and 4: Secondary sources
Ingest all other data sources to supplement data warehouse development.
Add integration dataflows to increase and expand the business layer for Phase 3 and 4 scope.
Integrate new dataflows into existing reporting and create opportunities for new audiences and enterprise level roll-ups.
Phase 5 and beyond: Support
Continue to model data for Phases 5 and beyond.
Expand advanced analytics capabilities for predictive and clustering models.
Eliminate downstream dependencies on legacy systems to allow sunsetting applications.
Integrate additional data consumers and data applications for expanded functionality.
2nd Watch’s consultants work with your team to understand your business goals and design a tech strategy to ensure success throughout the organization. From planning to technology selection to user adoption, our team’s data advisory services enhance business optimization via data-driven insights for the long haul. Contact us today to learn how 2nd Watch can build a modern data foundation for your business intelligence and predictive analytics to ensure your organization can make better decisions now and in the future.
As a business leader, you want to get to impactful analytics ASAP. You want faster reporting, easy-to-use dashboards, and quick and accurate answers to your questions. Whether you’re the CFO, head of operations, or marketing lead, you want to use your data to make better, more insightful business decisions.
However, you can’t implement analytics without a solid data strategy. A common component of implementing a data management architecture is ETL, or “extract, transform, load” logic. We understand that your CEO wants to hear, “We’ve implemented robust analytics,” not “We’re working on ETL.” ETL doesn’t sound exciting. But to get you where you want to go, ETL should be a part of your data story.
At 2nd Watch we aim to help you feel confident through all phases of your data and analytics journey; so today we’re answering, “what is ETL?,” including where ETL fits into data management and why it may be an essential precursor to your organization’s analytics goals.
What is ETL?
ETL involves collecting the data that is most relevant to the key performance indicators you are trying to analyze – which could be all of the data across your organization or just pieces of the data stored in various source systems – centralizing, transforming, and modeling it, likely in a data warehouse.
As previously mentioned, ETL stands for “extract, transform, load.” Let’s break that down:
Data is gathered from the relevant data sources (e.g., ERP, CRM, application databases, SaaS applications, flat files, and/or Excel spreadsheets). This could be financials, customer details, and internal HR information. Or maybe your marketing department, in particular, needs an ETL solution; they may store customer data across multiple systems, such as your enterprise CRM, email marketing software, and a digital advertising provider.
This could be done manually; or it can be automated utilizing orchestration workflows and ETL that eliminates the manual process, making it much faster and more reliable to analyze your business data.
The data transformation stage is really “where the magic happens” as this is where you make your data usable. Data is cleaned, deduped, and transformed to align with how the business wants to analyze the data in your data warehouse, ensuring data quality and ease of use. This includes reformatting for uniformity, such as restructuring IL, Il, Ill., and Illinois to all read “Illinois.” Data is also cleansed of inconsistencies and missing values, duplicate values are removed, and data is prepared for the business to easily interact with it.
Most of the time spent on the ETL process will be used to come up with the right calculations, determine data quality rules, and model the data for analytics, all key pieces of data transformation. This stage is only becoming more crucial as organizations are seeing growing amounts of disparate data from many different sources.
In the last stage of the ETL process, the transformed data is loaded into your data warehouse, so all data is easily accessible for future use in analytical dashboards and reporting. Depending on your business needs, this process can be done in batch or near-real-time, which allows you to access and analyze your data as it’s being loaded into your source systems.
Ultimately, ETL helps your data and modern business intelligence tools such as Looker, Qlik, Power BI, and/or Tableau work in harmony, giving you richer and deeper analytical capabilities.
What is ELT?
You may be wondering why this section has the same header as the previous section. If you look closely, you’ll notice two different acronyms: ETL and ELT. If you guessed that ELT stands for “extract, load, transform,” you would be correct. Why might we switch the order of the process?
ELT is most useful when speed is the primary concern. Data is loaded more quickly, and you can decide which data to transform at which point. This lets you access the exact data you need more quickly, though you will have to wait for additional data transformation if more of your data becomes necessary.
Of course, this approach can come with security concerns as large amounts of data are loaded into a system prior to being transformed. Additionally, ELT can impede analytics based on large data sets if your system isn’t capable of transforming the required data quickly enough.
An ETL/ELT tool, or a custom ETL solution built with multiple technologies, can displace some of the concerns you may have about either method as they’re built precisely for this purpose.
What ETL tools (and ELT tools) should you know about?
There are two ways to approach ETL: building a custom ETL solution (which 2nd Watch can help with) or using a prepackaged ETL tool. The decision to buy or build your ETL tool depends on multiple factors, but a few tools stand out if you decide to go the “buy” route.
Fivetran is cloud-based and offers more than 130 ready-to-use data sources, plus the ability to add custom integrations. In our experience, the Fivetran ETL process can take less than five minutes (but keep in mind, it’s streamlined). Fivetran deals primarily with data integration, not so much transformation. Fivetran’s pricing plan is consumption-based.
Matillion is a cloud ETL platform with more than 100 integrated data sources. Pricing is based on the time the program is turned on, not based on usage.
Formerly known as Attunity Replicate, Qlik Replicate works on-prem and in the cloud, supporting all major data sources. Pricing is based on the number of cores on your database source servers.
Talend is compatible with both on-prem and cloud data sources, and it includes hundreds of pre-built source integrations. There’s an open-source version, as well as a paid platform more suited to larger organizations that would benefit from additional tools and features.
Not sure where to start after all of this information? Set up a complimentary data strategy whiteboarding session with 2nd Watch. We’ll walk through the current state of your data and your data and analytics goals to determine the best path to the insights you need.
When using a modern data warehouse, your organization is likely to see improved access to your data and more impactful analytics. One such data warehouse is Azure Synapse, a Microsoft service. When paired with a powerful BI tool, like Looker, or a data science platform, like Dataiku, your organization can more quickly gain access to impactful insights that will help you drive business decisions across the enterprise.
In this post, we’ll provide a high-level overview of Azure Synapse, including a description of the tool, why you should use it, pros and cons, and complementary tools and technologies.
Overview of Azure Synapse
Azure Synapse is Microsoft’s service that puts an umbrella around various existing and new offerings, including Azure DW, Azure Databricks, and on-demand SQL querying, but lacks the tight integration across these services. Similar to Redshift, Azure DW is charged by size of instance and time running, while other Synapse services offer more of a consumption-based model.
Once data is stored within Azure Data Lake, no need to stage data again within the warehouse
Easy to scale up or down on the fly with use of Azure
Increases in pricing tiers only increase concurrent queries by 4 at each level
Built for MPP (massive parallel processing)
Performance optimal for data volumes larger than 1TB
Not suitable for running high volumes of concurrent queries (four concurrent requests per service level)
Requires active performance tuning (indexes, etc.)
Native connection with Power BI
Can select either a serverless SQL pool or a dedicated SQL pool based on the needs of the organization
Supports the ability to run Spark on Databricks
Core product still relies on Azure DW, an older technology
Supports row-level and column-level security, multi-factor authentication, and Azure AD integration
Why Use Azure Synapse
The Microsoft SQL Server ecosystem is familiar, with tighter integrations into Azure’s data ecosystem, including Azure Databricks and the MPP version of SQL Server, Azure DW – just don’t expect a turnkey solution quite yet.
Pros of Azure Synapse
Can be easily provisioned with existing Azure subscription and provides pay-as-you-go pricing
Integration with Azure Active Directory and Azure Purview can provide an easy way to manage user roles and insights into data
Transferable knowledge from on-premise Microsoft SQL Server background
Cons of Azure Synapse
“Synapse” is largely a marketing umbrella of technologies, with Azure DW at its core, requiring management of disparate services
Difficulty managing high volumes of concurrent queries due to tuning and cost of higher service tiers
Requires complex database administration tasks, including performance tuning, which other cloud data solutions have made more turnkey
Serverless capabilities are limited to newer Azure services, and lacks the on-demand, frictionless sizing of compute within Azure DW
Select Complementary Tools and Technologies for Azure Synapse
Azure Analysis Services
Azure Data Factory
We hope you found this high-level overview of Azure Synapse helpful. If you’re interested in learning more about Azure Synapse or other modern data warehouse tools like Amazon Redshift, Google BigQuery, and Snowflake, contact us to learn more.
Master data management, commonly called MDM, is an increasingly hot topic. You may have heard the term thrown around and been wondering, “What is MDM?” and, “Does my business need it?” We’re sharing a crash course in MDM to cover those questions, and we may even answer some you haven’t thought of asking yet.
What is MDM?
MDM allows for successful downstream analytics as well as synchronization of data to systems across your business. The process involves three major steps:
Ingest all relevant data in a repository.
Use an MDM tool (such as Riversand or Semarchy) to “goldenize” the data. In other words, create one current, complete, and accurate record.
Send the goldenized data downstream for analytics and back to the original source.
Let’s say one part of your business stores customer data in Oracle and a different area has customer data in Salesforce. And maybe you’re acquiring a business that stores customer data in HubSpot in some instances. You want to be able to access all of this information, understand what you’re accessing, and then analyze the data to help you make better business decisions. This is when you would turn to MDM.
Do we need an MDM solution?
First, answer a few questions:
This chart is a bit of an oversimplification but serves as a starting point. If you want to better understand how MDM could impact your business and ensure your MDM solution is customized to your needs, you can work with a data and analytics consulting firm like 2nd Watch.
When should I get help implementing an MDM solution?
An MDM company will be skilled at creating the “golden record” of your data. (Reminder: This is one current, complete, and accurate record.) However, they typically lack the ability to provide guidance going forward.
A data and analytics consulting firm like 2nd Watch will take a broad view of the current state of your business and its individual entities, while also considering where you want your organization to go. We partner with leading MDM providers like Riversand to ensure you get the best possible MDM solution. We can then guide the MDM implementation and set you up for next steps like creating reports and developing a data governance strategy.
Having an advocate for your future state goals during an MDM implementation is particularly important because there are four types of MDM architecture styles: registry, consolidation, coexistence, and centralized. You don’t need to understand the nuances of these MDM styles; that’s our job. 2nd Watch’s MDM experience has helped us develop recommendations on which style works best for different needs, so we can quickly generate value but also follow data best practices.
Even more important is how an MDM solution evolves over time. As your business changes, your data will change with it. 2nd Watch can develop an MDM solution that keeps up with your needs, moving through the progressively complex styles of MDM architecture as your organization grows and expands.
Data governance is a broad-ranging discipline that affects everyone in an organization, whether directly or indirectly. It is most often employed to improve and consistently manage data through deduplication and standardization, among other activities, and can have a significant and sustained effect on reducing operational costs, increasing sales, or both.
Data governance can also be part of a more extensive master data management (MDM) program. The MDM program an organization chooses and how they implement it depends on the issues they face and both their short- and long-term visions.
For example, in the insurance industry, many companies sell various types of insurance policies renewing annually over a number of years, such as industrial property coverages and workers’ compensation casualty coverages. Two sets of underwriters will more than likely underwrite the business. Having two sets of underwriters using data systems specific to their lines of business is an advantage when meeting the coverage needs of their customers but often becomes a disadvantage when considering all of the data — but it doesn’t have to be.
The disadvantage arises when an agent or account executive needs to know the overall status of a client, including long-term profitability during all the years of coverage. This involves pulling data from policy systems, claims systems, and customer support systems. An analyst may be tasked with producing a client report for the agent or account executive to truly understand their client and make better decisions on both the client and company’s behalf. But the analyst may not know where the data is stored, who owns the data, or how to link clients across disparate systems.
Fifteen years ago, this task was very time-consuming and even five years ago was still quite cumbersome. Today, however, this issue can be mitigated with the correct data governance plan. We will go deeper into data governance and MDM in upcoming posts; but for this one, we want to show you how innovators like Snowflake are helping the cause.
What is data governance?
Data governance ensures that data is consistent, accurate, and reliable, which allows for informed and effective decision-making. This can be achieved by centralizing the data into one location from few or many siloed locations. Ensuring that data is accessible in one location enables data users to understand and analyze the data to make effective decisions. One way to accomplish this centralization of data is to implement the Snowflake Data Cloud.
Snowflake not only enables a company to store their data inexpensively and query the data for analytics, but it can foster data governance. Dynamic data masking and object tagging are two new features from Snowflake that can supplement a company’s data governance initiative.
What is dynamic data masking?
Dynamic data masking is a Snowflake security feature that selectively omits plain-text data in table or view columns based on predefined policies for masking. The purpose of data masking or hiding data in specific columns is to ensure that data is accessed on a need-to-know basis. This kind of data is most likely sensitive and doesn’t need to be accessed by every user.
When is dynamic data masking used?
Data masking is usually implemented to protect personally identifiable information (PII), such as a person’s social security number, phone number, home address, or date of birth. An insurance company would likely want to reduce risk by hiding data pertaining to sensitive information if they don’t believe access to the data is necessary for conducting analysis.
However, data masking can also be used for non-production environments where testing needs to be conducted on an application. The users testing the environment wouldn’t need to know specific data if their role is just to test the environment and application. Additionally, data masking may be used to adhere to compliance requirements like HIPAA.
What is object tagging?
Another resource for data governance within Snowflake is object tagging. Object tagging enables data stewards to track sensitive data for compliance and discovery, as well as grouping desired objects such as warehouses, databases, tables or views, and columns.
When a tag is created for a table, view, or column, data stewards can determine if the data should be fully masked, partially masked, or unmasked. When tags are associated with a warehouse, a user with the tag role can view the resource usage of the warehouse to determine what, when, and how this object is being utilized.
When is object tagging used?
There are several instances where object tagging can be useful; one use would be tagging “PII” to a column and adding extra text to describe the type of PII data located there. For example, a tag can be created for a warehouse dedicated to the sales department, enabling you to track usage and deduce why a specific warehouse is being used.
Where can data governance be applied?
Data governance applies to many industries that maintain a vast amount of data from their systems, including healthcare, supply chain and logistics, and insurance; and an effective data governance strategy may use data masking and object tagging in conjunction with each other.
As previously mentioned, one common use case for data masking is for insurance customers’ PII. Normally, analysts wouldn’t need to analyze the personal information of a customer to uncover useful information leading to key business decisions. Therefore, the administrator would be able to mask columns for the customer’s name, phone number, address, social security number, and account number without interfering with analysis.
Object tagging is also valuable within the insurance industry as there is such a vast amount of data collected and consumed. A strong percentage of that data is sensitive information. Because there is so much data and it can be difficult to track those individual pieces of information, Snowflake’s object tagging feature can help with identifying and tracking the usage of those sensitive values for the business user.
Using dynamic data masking and object tagging together, you will be able to gain insights into the locations of your sensitive data and the amount specific warehouses, tables, or columns are being used.
Think back to the situation we mentioned earlier where the property coverage sales department is on legacy system X. During that same time period, the workers’ compensation sales department is on another legacy system Y. How are you supposed to create a report to understand the profitability of these two departments?
One option is to use Snowflake to store all of the data from both legacy systems. Once the information is in the Snowflake environment, object tagging would allow you to tag the databases or tables that involve data about their respective departments. One tag can be specified for property coverage and another tag can be set for workers’ compensation data. When you’re tasked with creating a report of profitability involving these two departments, you can easily identify which information can be used. Because the tag was applied to the database, it will also be applied to all of the tables and their respective columns. You would be able to understand what columns are being used. After the data from both departments is accessible within Snowflake, data masking can then be used to ensure that the new data is only truly accessible to those who need it.
This was just a small introduction to data governance and the new features that Snowflake has available to enable this effort. Don’t forget that this data governance effort can be a part of a larger, more intricate MDM initiative. In other blog posts, we touch more on MDM and other data governance capabilities to maintain and standardize your data, helping you make the most accurate and beneficial business decisions. If you have any questions in the meantime, feel free to get in touch.
It is commonly understood that data science can bring tremendous value to an organization. That being said, a pitfall companies often fall into when pursuing data science initiatives is hiring data scientists without having a clear vision around their goals, business impact, and expected results.
Before embarking on the lengthy (and expensive) journey of hiring a data scientist, take a step back and make sure your organization is data science ready. This includes developing a concrete, results-focused data science strategy and auditing your underlying data to ensure your data is accurate, consistent, and complete enough to support reliable analysis.
Step 1: Develop your data science strategy.
The process of hiring a data scientist requires an immense amount of time, money, and effort. It could cost your company up to $30,000 just to find a candidate with the desired skill set and personality to fit your company. In addition to the steadily increasing salary, which currently averages around $113,000 (not including benefits), it is a huge investment. If you hire a data scientist without having a clearly defined business goal for data science, you run the risk of burning through that investment and burning out the talent.
Showing a candidate that you have a strategy will inspire confidence in your organization and help them determine if they are up for the challenge. If you plan to hire and onboard a data scientist, you should not leave it up to them to determine their mission and where they fit. To get started on developing a strategy, have your IT team and business leaders join forces to find answers to some of the questions below:
What are our business problems and opportunities? Do the goals of our data science initiatives match the goals of our organization?
What data do we have to support analytics?
Which business or metric definitions vary across departments in our organization? Why do these knowledge silos exist, and how can they be overcome?
Can our current infrastructure support data science needs?
Are we prepared to change as an organization based on data science initiatives?
How can we effectively communicate data science results?
Step 2: Evaluate your company’s data science readiness.
Accurate and readily available data is essential for any data science project. The quality of data that you use for analysis directly impacts your outcome. In other words, if nobody trusts your results, they will not use those insights to inform their decision-making, and your entire data science strategy will flop. Set your data science team up for success by providing clean and centralized data so they can hit the ground running.
While your data does not need to be perfect, you should at least ensure that your data is centralized and does not contain duplicated records or large amounts of missing information. Centralizing key information in a data warehouse eliminates time wasted on searching for the data or finding ways to work around data silos. Creating a system that cleans, organizes, and standardizes your data guarantees reliable information for everyone. It will not only help your new data scientist produce results faster, but it will also increase trust in their results around your organization and save hours of menial data cleansing done by your IT team. While the steps to achieve data science readiness are different for every company, they should all consider the same objectives.
Step 3: Define clear and actionable business cases for data science.
A massive part of a successful data science strategy is to understand the insights data science can provide and how your business can act on that information. Start by brainstorming a variety of use cases. Determine which ones are the most actionable, relevant and provide the best competitive edge. If any of your ideas could save money, that is another great place to start. During this process there are no wrong answers. Identifying use cases can seem intimidating at first, but there are some very easy ways to get started:
Ask employees which common business questions go unanswered.
Look into what industry leaders (and your competitors) are doing. Whether it’s personalizing marketing messaging to customers or using models to identify insurance fraud, data science has use cases in any industry.
Find out what executives wish they could predict about your organization.
Reach out to experts. Many organizations (consulting companies and vendors) have implemented data science solutions at a variety of clients.
Identify time-consuming and complicated manual processes. Data scientists can likely automate these and make them more reliable.
So, you’ve been tasked with building an analytics dashboard. It’s tempting to jump into development straight away, but hold on a minute! There are numerous pitfalls that are easy to fall into and can ruin your plans for an attractive, useful dashboard. Here are five important principles for dashboard development to keep in mind every time you open up Power BI, Tableau, Looker, or any other BI tool.
1. Keep it focused and defined.
Before you start answering questions, you need to know exactly what you’re trying to find out. The starting point of most any dashboarding project should be a whiteboarding session with the end users; the dashboard becomes a collection of visuals that hold the ability to answer their questions.
For every single visual you create, make sure you’re answering a specific question. Each graph needs to be intentional and purposeful, and it’s very important to have your KPIs clearly defined well before you start building. If you don’t include your stakeholders from the very beginning, you’ll almost certainly have a lot more reworking to do after initial production is complete.
Courtesy of discourse.looker.com
2. A good data foundation is key.
Generating meaningful visualizations is nearly impossible without a good data foundation. Unclean data means holes and problems will need to be patched and fixed further down the pipeline. Many BI tools have functions that can format/prepare your data and generate some level of relational modeling for building your visualizations. However, too much modeling and logic in the tool itself will lead to large performance issues, and most BI tools aren’t specifically built with data wrangling in mind. A well-modeled semantic layer in a separate tool that handles all the necessary business logic is often essential for performance and governance.
Don’t undervalue the semantic layer!
The semantic layer is the step in preparation where the business logic is performed, joins are defined, and data is formatted from its raw form so it’s understandable and logical for users going forward. For Power BI users, for example, you would likely generate tabular models within SSAS. With a strong semantic layer in place before you even get to the BI tool, there will be little to no data management to be done in the tool itself. This means there is less processing the BI tool needs to handle and a much cleaner governance system.
In many BI tools, you can load in a raw dataset and have a functional dashboard in 10 minutes. However, building a semantic layer forces you to slow down and put some time in upfront for definition, development, and reflection about what the data is and what insights you’re trying to get for your business. This ensures you’re actually answering the right questions.
This is one of the many strengths of Looker, which is built specifically to handle the semantic layer as well as create visualizations. It forces you to define the logic in the tool itself before you start creating visuals.
It’s often tempting to skip the data prep steps in favor of putting out a finished product quickly, but remember: Your dashboard is only as good as the data underneath it.
3. PLEASE de-clutter.
There are numerous, obvious problems with the dashboard below, but there is one lesson to learn that many developers forget: Embrace white space! White space wants to be your friend. Like in web development, trying to pack too many visuals into the same dashboard is a recipe for disaster. Edward Tufte calls it the “data to ink ratio” in his book The Visual Display of Quantitative Information, one of the first and most impactful resources on data visualization.
Basically, just remove anything that isn’t essential or move important but non-pertinent information to a different page of the dashboard/report.
4. Think before using that overly complicated visual.
About to use a tree-map to demonstrate relationships among three variables at once? What about a 3-D, three-axis representation of sales? Most of the time: don’t. Visualizing data isn’t about making something flashy – it’s about creating something simple that someone can gain insight from at a glance. For almost any complex visualization, there is a simpler solution available, like splitting up the graph into multiple, more focused graphs.
5. Keep your interface clean, understandable, and consistent.
In addition to keeping your data clean and your logic well-defined, it’s important to make sure everything is understandable from start to finish and is easy to interpret by the end users. This starts with simply defining dimensions and measures logically and uniformly, as well as hiding excess and unused columns in the end product. A selection panel with 10 well-named column options is much easier than one with 30, especially if end-users will be doing alterations and exploration themselves.
You may notice a theme with most of these principles for dashboard development: Slow down and plan. It’s tempting to jump right into creating visuals, but never underestimate the value of planning and defining your steps first. Doing that will help ensure your dashboard is clean, consistent, and most important, valuable.
If you need help planning, implementing, or finding insights in your dashboards, the 2nd Watch team can help. Our certified consultants have the knowledge, training, and experience to help you drive the most value from your dashboard tool. Contact us today to learn about our data visualization starter pack.