Blockchain is one of those once-in-a-generation technologies that has the potential to really change the world around us. Despite this, blockchain is something that a lot of people still know nothing about. Part of that, of course, is because it’s such a new piece of technology that really only became mainstream within the past few years. The main reason, though, (and to address the elephant in the room) is because blockchain is associated with what some describe as “fake internet money” (i.e., Bitcoin). The idea of a decentralized currency with no guarantor is intimidating, but let’s not let that get in the way of what could be a truly revolutionary technology. So, before we get started, let’s remove the Bitcoin aspect and simply focus on blockchain. (Don’t worry, we’ll pick it back up later on.)
Blockchain, at its very core, is a database. But blockchains are different from traditional databases in that they are immutable, unable to be changed. Imagine this: Once you enter information into your shiny new blockchain, you don’t have to worry about anybody going in and messing up all your data. “But how is this possible?” you might ask.
Blockchains operate by taking data and structuring it into blocks (think of a block like a record in a database). This can be any kind information, from names and numbers all the way to executable code scripts. There are a few essential pieces of information that should be placed in all blocks, those being an index (the block number), a timestamp, and the hash (more on this later) of the previous block. All of this data is compiled into a block, and a hashing algorithm is applied to the information.
After the hash is computed, the information is locked and you can’t change information without re-computing the hash. This hash is then passed on to the next block where it gets included in its data, creating a chain. The second block then compiles all of its own data and, including the hash of the previous block, creates a new hash and sends it to the next block in the chain. In this way, a blockchain is created by “chaining” together blocks by means of a block’s unique hash. In other words, the hash of one block is reliant on the hash of the previous block, which is reliant on that of the one before it, ad infinitum.
And there you go, you have a blockchain! Before we move on to the next step (which will really blow your mind), let’s recap:
You have Block-0. Information is packed into Block-0 and hashed, giving you Hash-0. Hash-0 is passed to Block-1, which is combined with the data in Block-1. So, Block-1’s data now includes its own information and Hash-0. This is now hashed to release Hash-1, and it’s passed to the next block.
The second major aspect of blockchain is that it is distributed. This means that the entire protocol is operated across a network of nodes at the same time. All of the nodes in the network store the entire chain, along with all new blocks, at the same time and in real time.
Secure Data Is Good Data
Remember earlier when we said a blockchain is immutable? Let’s go back to that.
Suppose you have a chain 100 blocks long and running on 100 nodes at once. Now let’s say you want to stage an attack on this blockchain to change Block-75. Because the chain is run and stored across 100 nodes simultaneously, you have to instantaneously change Block-75 in all 100 nodes at the same time. Let’s imagine somehow you are able to hack into those other nodes to do this; now you have to rehash everything from Block-75 to Block-100 (which, remember, rehashing is extremely computationally difficult). So while you (the singular malicious node) are trying to rehash all of those blocks, the other 99 nodes in the network are working to hash new blocks, thereby extending the chain. This makes it impossible for a compromised chain to become valid because it will never reach the same length of the original chain.
About That Bitcoin Thing…
Now, there are two types of blockchains. Most popular blockchains are public, in which anybody in the world is able to join and contribute to the network. This requires some incentive, as without it nobody would join the network, and this comes in the form of “tokens” or “coins” (i.e., Bitcoin). In other words, Bitcoin is an incentive for people to participate and ensure the integrity of the chain. Then there are permissioned chains, which are run by individuals, organizations, or conglomerates for their own reasons and internal uses. In permissioned chains, only nodes with certain permissions are able to join and be involved in the network.
And there you go, you have the basics of blockchain. At a fundamental level, it’s an extremely simple yet ingenious idea with applications for supply chains, smart contracts, auditing, and many more to come. However, like any promising new technology, there are still questions, pitfalls, and risks to be explored. If you have any questions about this topic or want to discuss the potential for blockchain in your organization, contact us here.
Data ingestion is the first step in any analytical undertaking. It’s a process where data from one or many sources are gathered and imported into one place. Data can be imported in real time (like POS data) or in batches (like billing systems).
Why It Matters for Marketers:
The process of data ingestion consolidates all of the relevant information from across your data sources into a single, centralized storage system. Through this process, you can begin to convert disparate data created in your CRM, POS, and other source systems into a unified format that is ready for real-time or batch analysis.
Marketing teams pull data from a wide variety of resources, including Salesforce, Marketo, Facebook, Twitter, Google, Stripe, Zendesk, Shopify, Mailchimp, mobile devices, and more. It’s incredibly time-consuming to manually combine these data sources, but by using tools to automate some of these processes you can get data into the hands of your team faster.
This empowers marketers to answer more sophisticated questions about customer behavior, such as:
Why are customers leaving a specific product in their online shopping carts?
What is the probability that we’ll lose a customer early in the customer journey?
Which messaging pillar is resonating most with customers in the middle of the sales funnel who live in Germany?
Image 1: In this image, three source systems with varying formats and content are ingested into a central location in the data warehouse.
ETL vs. ELT
ETL and ELT are both data integration methods that make it possible to take data from various sources and move it into a singular storage space like a data warehouse. The difference is in when the transformation of data takes place.
As your business scales, ELT tools are better-equipped to handle the volume and variety of marketing data on hand. However, a robust data plan will make use of both ELT and ETL tools.
For example, a member of your team wants to know which marketing channels are the most effective at converting customers with the highest average order value. The data you need to answer that question is likely spread across multiple structured data sources (e.g., referral traffic from Google Analytics, transaction history from your POS or e-commerce system, and customer data from your CRM).
Through your ETL process, you can extract relevant data from the above sources, transform it (e.g., updating customer contact info across files for uniformity and accuracy), and load the clean data into one final location. This enables your team to run your query in a streamlined way with limited upfront effort.
In comparison, your social media marketing team wants to see whether email click-through rates or social media interactions lead to more purchases. The ELT process allows them to extract and load all of the raw data in real time from the relevant source systems and run ad-hoc analytics reports, making adjustments to campaigns on the fly.
Extract, Transform, Load (ETL)
This method of data movement first copies data from the original database into the target system and then converts the data into a singular format. Lastly, the transformed data is uploaded into a data warehouse for analytics.
When You Should Use ETL:
ETL processes are preferable for moving small amounts of structured data with no rush on when that data is available for use. A robust ETL process would clean and integrate carefully selected data sources to provide a single source of truth that delivers faster analytics and makes understanding and using the data extremely simple.
Image 2: This image shows four different data sources with varying data formats being extracted from their sources, transformed to all be formatted the same, and then loaded into a data warehouse. Having all the data sources formatted the same way allows you to have consistent and accurate data in the chart that is built from the data in the data warehouse.
Extract, Load, Transform (ELT)
Raw data is read from source databases, then loaded into the database in its raw form. Raw data is usually stored in a cloud-based data lake or data warehouse, allowing you to transform only the data you need.
When You Should Use ELT:
ELT processes shine when there are large amounts of complex structured and unstructured data that need to be made available more immediately. ELT processes also upload and store all of your data in its raw format, making data ingestion faster. However, performing analytics on that raw data is a more complex process because cleaning and transformation happen post-upload.
Image 3: This image is showing four different data sources with the data formatted in different ways. The data is being extracted from the various sources, loaded into the data warehouse, and then transformed within the data warehouse to all be formatted the same. This allows for accurate reporting of the data in the chart seen above.
A data pipeline is a series of steps in an automated process that moves data from one system to another, typically using ETL or ELT practices.
Why It Matters for Marketers:
The automatic nature of a data pipeline removes the burden of data manipulation from your marketing team. There’s no need to chase down the IT team or manually download files from your marketing automation tool, CRM, or other data sources to answer a single question. Instead, you can focus on asking the questions and honing in on strategy while the technology takes away the burden of tracking down, manipulating, and refreshing the information.
Say under the current infrastructure, your sales data is split between your e-commerce platform and your in-store POS systems. The different data formats are an obstacle to proper analysis, so you decide to move them to a new target system (such as a data warehouse).
A data pipeline would automate the process of selecting data sources, prioritizing the datasets that are most important, and transforming the data without any micromanagement of the tool. When you’re ready for analysis, the data will already be available in one destination and validated for accuracy and uniformity, enabling you to start your analysis without delay.
Data Storage Options
Databases, data warehouses, and data lakes are all systems for storing and using data, but there are differences to consider when choosing a solution for your marketing data.
A database is a central place where a structured and organized collection of data can be stored in a computer that is accessed via various applications such as MailChimp, Rollworks, Marketo, or even more traditional campaigns like direct mail. It is not meant for large-scale analytics.
A data warehouse is a specific way of structuring your data in database tables so that it is optimized for analytics. A data warehouse brings together all your various data sources under one roof and structures it for analytics.
A data lake is a vast repository of structured and unstructured data. It handles all types of data, and there is no hierarchy or organization to the storage.
Why It Matters for Marketers:
There are benefits and drawbacks to each type of data structure, and marketers should have a say in how data gets managed throughout the organization. For example, with a data lake, you will need to have a data scientist or other technical resource on staff to help make sense of all the data, but your marketing team can be more self-sufficient with a database or data warehouse.
Without organization and structure, the insights your data holds can be unreliable and hard to find. Pulling data from various source systems is often time-consuming and requires tedious and error-prone reformatting of the data in order to tell a story or answer a question. A database can help to store data from multiple sources in an organized central location.
Without databases, your team would have to use multiple Excel sheets and manual manipulation to store the data needed for analysis. This means your team would have to manually match up or copy/paste each Excel sheet’s data in order to create one place to analyze all of your data.
A data warehouse delivers an extra layer of organization across all databases throughout your business. Your CRM, sales platform, and social media data differ in format and complexity but often contain data about similar subjects. A data warehouse brings together all of those varying formats into a standardized and holistic view structured to optimize reporting. When that data is consolidated from across your organization, you can obtain a complete view of your customers, their spending habits, and their motivations.
You might hear people say “enterprise data warehouse” or “EDW” when they talk about data. This is a way to structure data that makes answering questions via reports quick and easy. More importantly, EDWs often contain information from the entire company, not just your function or department. Not only can you answer questions about your customer or marketing-specific topics, but you can understand other concepts such as the inventory flow of your products. With that knowledge, you can determine, for example, how inventory delays are correlated to longer shipping times, which often result in customer churn.
A data lake is a great option for organizations that need more flexibility with their data. The ability for a data lake to hold all data—structured, semi-structured, or unstructured—makes it a good choice when you want the agility to configure and refigure models and queries as needed. Access to all the raw data also makes it easier for data scientists to manipulate the data.
You want to get real-time reports from each step of your SMS marketing campaign. Using a data lake enables you to perform real-time analytics on the number of messages sent, the number of messages opened, how many people replied, and more. Additionally, you can save the content of the messages for later analysis, delivering a more robust view of your customer and enabling you to increase personalization of future campaigns.
So, how do you choose?
You might not have to pick just one solution. In fact, it might make sense to use a combination of these systems. Remember, the most important thing is that you’re thinking about your marketing data, how you want to use it, what makes sense for your business, and the best way to achieve your results.
Hopefully this information has helped you better understand your options for data ingestion and storage. Feel free to contact us with any questions or to learn more about data ingestion and storage options for your marketing data.
Building a data pipeline can be daunting due to the complexities involved in safely and efficiently transferring data. Companies create tons of disparate data throughout their organizations through applications, databases, files and streaming sources. Moving the data from one data source to another is a complex and tedious process.
Ingesting different types of data into a common platform requires extensive skill and knowledge of both the inherent data type of use and sources.
Due to these complexities, this process can be faulty, leading to inefficiencies like bottlenecks, or the loss or duplication of data. As a result, data analytics becomes less accurate and less useful and in many instances, provide inconclusive or just plain inaccurate results.
For example, a company might be looking to pull raw data from a database or CRM system and move it to a data lake or data warehouse for predictive analytics. To ensure this process is done efficiently, a comprehensive data strategy needs to be deployed necessitating the creation of a data pipeline.
What is a Data Pipeline?
A data pipeline is a set of actions organized into processing steps that integrates raw data from multiple sources to one destination for storage, business intelligence (BI), data analysis, and visualization.
There are three key elements to a data pipeline: source, processing, and destination. The source is the starting point for a data pipeline. Data sources may include relational databases and data from SaaS applications. There are two different methods for processing or ingesting models: batch processing and stream processing.
Batch processing: Occurs when the source data is collected periodically and sent to the destination system. Batch processing enables the complex analysis of large datasets. As patch processing occurs periodically, the insights gained from this type of processing are from information and activities that occurred in the past.
Stream processing: Occurs in real-time, sourcing, manipulating, and loading the data as soon as it’s created. Stream processing may be more appropriate when timeliness is important because it takes less time than batch processing. Additionally, stream processing comes with lower cost and lower maintenance.
The destination is where the data is stored, such as an on-premises or cloud-based location like a data warehouse, a data lake, a data mart, or a certain application. The destination may also be referred to as a “sink”.
Data Pipeline vs. ETL Pipeline
One popular subset of a data pipeline is an ETL pipeline, which stands for extract, transform, and load. While popular, the term is not interchangeable with the umbrella term of “data pipeline”. An ETL pipeline is a series of processes that extract data from a source, transform it, and load it into a destination. The source might be business systems or marketing tools with a data warehouse as a destination.
There are a few key differentiators between an ETL pipeline and a data pipeline. First, ETL pipelines always involve data transformation and are processed in batches, while data pipelines ingest in real-time and do not always involve data transformation. Additionally, an ETL Pipeline ends with loading the data into its destination, while a data pipeline doesn’t always end with the loading. Instead, the loading can instead activate new processes by triggering webhooks in other systems.
Uses for Data Pipelines:
To move, process, and store data
To perform predictive analytics
To enable real-time reporting and metric updates
Uses for ETL Pipelines:
To centralize your company’s data
To move and transform data internally between different data stores
To Enrich your CRM system with additional data
9 Popular Data Pipeline Tools
Although a data pipeline helps organize the flow of your data to a destination, managing the operations of your data pipeline can be overwhelming. For efficient operations, there are a variety of useful tools that serve different pipeline needs. Some of the best and most popular tools include:
AWS Data Pipeline: Easily automates the movement and transformation of data. The platform helps you easily create complex data processing workloads that are fault tolerant, repeatable, and highly available.
Azure Data Factory: A data integration service that allows you to visually integrate your data sources with more than 90 built-in, maintenance-free connectors.
Etleap: A Redshift data pipeline tool that’s analyst-friendly and maintenance-free. Etleap makes it easy for business to move data from disparate sources to a Redshift data warehouse.
Fivetran: A platform that emphasizes the ability to unlock faster time to insight, rather than having to focus on ETL using robust solutions with standardized schemas and automated pipelines.
Google Cloud Dataflow: A unified stream and batch data processing platform that simplifies operations and management and reduces the total cost of ownership.
Keboola: Keboola is a platform is a SaaS platform that starts for free and covers the entire pipeline operation cycle.
Segment: A customer data platform used by businesses to collect, clean, and control customer data to help them understand the customer journey and personalize customer interactions.
Stitch: Stitch is a cloud-first platform rapidly moves data to the analysts of your business within minutes so that it can be used according to your requirements. Instead of focusing on your pipeline, Stitch helps reveal valuable insights.
Xplenty: A cloud-based platform for ETL that is beginner-friendly, simplifying the ETL process to prepare data for analytics.
When we talk about high performance computing ( HPC ) we are typically trying to solve some type of problem. These problems will generally fall into one of four types:
Compute Intensive – A single problem requiring a large amount of computation.
Memory Intensive – A single problem requiring a large amount of memory.
Data Intensive – A single problem operating on a large data set.
High Throughput – Many unrelated problems that are be computed in bulk.
In this post, I will provide a detailed introduction to High Performance Computing ( HPC ) that can help organizations solve the common issues listed above.
Compute Intensive Workloads
First, let us take a look at compute intensive problems. The goal is to distribute the work for a single problem across multiple CPUs to reduce the execution time as much as possible. In order for us to do this, we need to execute steps of the problem in parallel. Each process—or thread—takes a portion of the work and performs the computations concurrently. The CPUs typically need to exchange information rapidly, requiring specialization communication hardware. Examples of these types of problems are those that can be found when analyzing data that is relative to tasks like financial modeling and risk exposure in both traditional business and healthcare use cases. This is probably the largest portion of HPC problem sets and is the traditional domain of HPC.
When attempting to solve compute intensive problems, we may think that adding more CPUs will reduce our execution time. This is not always true. Most parallel code bases have what we call a “scaling limit”. This is in no small part due to the system overhead of managing more copies, but also to more basic constraints.
This is summed up brilliantly in Amdahl’s law.
In computer architecture, Amdahl’s law is a formula which gives the theoretical speedup in latency of the execution of a task at fixed workload that can be expected of a system whose resources are improved. It is named after computer scientist Gene Amdahl, and was presented at the AFIPS Spring Joint Computer Conference in 1967.
Amdahl’s law is often used in parallel computing to predict the theoretical speedup when using multiple processors. For example, if a program needs 20 hours using a single processor core, and a particular part of the program which takes one hour to execute cannot be parallelized, while the remaining 19 hours (p = 0.95) of execution time can be parallelized, then regardless of how many processors are devoted to a parallelized execution of this program, the minimum execution time cannot be less than that critical one hour. Hence, the theoretical speedup is limited to at most 20 times (1/(1 − p) = 20). For this reason, parallel computing with many processors is useful only for very parallelizable programs.
Amdahl’s law can be formulated the following way:
S latency is the theoretical speedup of the execution of the whole task.
s is the speedup of the part of the task that benefits from improved system resources.
p is the proportion of execution time that the part benefiting from improved resources originally occupied.
Chart Example: If 95% of the program can be parallelized, the theoretical maximum speed up using parallel computing would be 20 times.
Bottom line: As you create more sections of your problem that are able to run concurrently, you can split the work between more processors and thus, achieve more benefits. However, due to complexity and overhead, eventually using more CPUs becomes detrimental instead of actually helping.
There are libraries that help with parallelization, like OpenMP or Open MPI, but before moving to these libraries, we should strive to optimize performance on a single CPU, then make p as large as possible.
Memory Intensive Workloads
Memory intensive workloads require large pools of memory rather than multiple CPUs. In my opinion, these are some of the hardest problems to solve and typically require great care when building machines for your system. Coding and porting is easier because memory will appear seamless, allowing for a single system image. Optimization becomes harder, however, as we get further away from the original creation date of your machines because of component uniformity. Traditionally, in the data center, you don’t replace every single server every three years. If we want more resources in our cluster, and we want performance to be uniform, non-uniform memory produces actual latency. We also have to think about the interconnect between the CPU and the memory.
Nowadays, many of these concerns have been eliminated by commodity servers. We can ask for thousands of the same instance type with the same specs and hardware, and companies like Amazon Web Services are happy to let us use them.
Data Intensive Workloads
This is probably the most common workload we find today, and probably the type with the most buzz. These are known as “Big Data” workloads. Data Intensive workloads are the type of workloads suitable for software packages like Hadoop or MapReduce. We distribute the data for a single problem across multiple CPUs to reduce the overall execution time. The same work may be done on each data segment, though not always the case. This is essentially the inverse of a memory intensive workload in that rapid movement of data to and from disk is more important than the interconnect. The type of problems being solved in these workloads tend to be Life Science (genomics) in the academic field and have a wide reach in commercial applications, particularly around user data and interactions.
High Throughput Workloads
Batch processing jobs (jobs with almost trivial operations to perform in parallel as well as jobs with little to no inter-CPU communication) are considered High Throughput workloads. In high throughput workloads, we create an emphasis on throughput over a period rather than performance on any single problem. We distribute multiple problems independently across multiple CPUs to reduce overall execution time. These workloads should:
Break up naturally into independent pieces.
Have little or no inter CPU communcation
Be performed in separate processes or threads on a separate CPU (concurrently)
Workloads that are compute intensive jobs can likely be broken into high throughput jobs, however, high throughput jobs do not necessarily mean they are CPU intensive.
HPC On Amazon Web Services
Amazon Web Services (AWS) provides on-demand scalability and elasticity for a wide variety of computational and data-intensive workloads, including workloads that represent many of the world’s most challenging computing problems: engineering simulations, financial risk analyses, molecular dynamics, weather prediction, and many more.
– AWS: An Introduction to High Performance Computing on AWS
Amazon literally has everything you could possibly want in an HPC platform. For every type of workload listed here, AWS has one or more instance classes to match and numerous sizes in each class, allowing you to get very granular in the provisioning of your clusters.
Speaking of provisioning, there is even a tool called CfnCluster which creates clusters for HPC use. CfnCluster is a tool used to build and manage High Performance Computing (HPC) clusters on AWS. Once created, you can log into your cluster via the master node where you will have access to standard HPC tools such as schedulers, shared storage, and an MPI environment.
For data intensive workloads, there a number of options to help get your data closer to your computer resources.
EBS is even a viable option for creating large scale parallel file systems to meet high-volume, high-performance, and throughput requirements of workloads.
HPC Workloads & 2nd Watch
2nd Watch can help you solve complex science, engineering, and business problems using applications that require high bandwidth, enhanced networking, and very high compute capabilities.
Increase the speed of research by running high performance computing ( HPC ) in the cloud and reduce costs by paying for only the resources that you use, without large capital investments. With 2nd Watch, you have access to a full-bisection, high bandwidth network for tightly coupled, IO-intensive workloads, which enables you to scale out across thousands of cores for throughput-oriented applications. Contact us today to learn more about High Performance Computing ( HPC )
2nd Watch Customer Success
Celgene is an American biotechnology company that manufactures drug therapies for cancer and inflammatory disorders. Read more about their cloud journey and how they went from doing research jobs that previously took weeks or months, to just hours. Read the case study.
We have also helped a global finance & insurance firm prove their liquidity time and time again in the aftermath of the 2008 recession. By leveraging the batch computing solution that we provided for them, they are now able to scale out their computations across 120,000 cores while validating their liquidity with no CAPEX investment. Read the case study.
– Lars Cromley, Director of Engineering, Automation, 2nd Watch
The Product Development team at 2nd Watch is responsible for many technology environments that support our software and solutions—and ultimately, our customers. These environments need to be easily built, maintained, and kept in sync. In 2016, 2nd Watch performed an analysis on the amount of AWS billing data that we had collected and the number of payer accounts we had processed over the course of the previous year. Our analysis showed that these measurements had more than tripled from 2015 and projections showed that we will continue to grow at the same, rapid pace with AWS usage and client onboarding increasing daily. Knowing that the storage of data is critical for many systems, our Product Development team underwent an evaluation of the database architecture used to house our company’s billing data—a single SQL Server instance running a Web edition of SQL Server with the maximum number of EBS volumes attached.
During the evaluation, areas such as performance, scaling, availability, maintenance and cost were considered and deemed most important for future success. The evaluation revealed that our current billing database architecture could not meet the criteria laid out to keep pace with growth. Considerations were made to increase the storage capacity by one VM to the maximum family size or potentially upgrade to MS SQL Enterprise. In either scenario, the cost of the MS SQL instance doubled. The only option for scaling without substantially increasing our cost was to scale vertically, however, to do so would result in diminishing performance gains. Maintenance of the database had become a full-time job that was increasingly difficult to manage.
Ultimately, we chose the cloud-native solution, Amazon Aurora, for its scalability, low-risk, easy-to-use technology. Amazon Aurora is a MySQL relational database that provides speed and reliability while being delivered at a lower cost. It offers greater than 99.99% availability and can store up to 64TB of data. Aurora is self-healing and fully managed, which, along with the other key features, made Amazon Aurora an easy choice as we continue to meet the AWS billing usage demands of our customers and prepare for future growth.
The conversion from MS SQL to Amazon Aurora was successfully completed in early 2017 and, with the benefits and features that Amazon Aurora offers, many gains were made in multiple areas. Product Development can now reduce the complexity of database schemas because of the way Aurora stores data. For example, a database with one hundred tables and hundreds of stored procedures was reduced to one table with 10 stored procedures. Gains were made in performance as well. The billing system produces thousands of queries per minute and Amazon Aurora handles the load with the ability to scale to accommodate the increasing number of queries. Maintenance of the Amazon Aurora system is now virtually managed. Tasks such as database backups are automated without the complicated task of managing disks. Additionally, data is copied across six replicas in three availability zones which ensures availability and durability.
With Amazon Aurora, every environment is now easily built and setup using Terraform. All infrastructure is automatically setup—from the web tier to the database tier—with Amazon CloudWatch logs to alert the company when issues occur. Data can easily be imported using automated processes and even anonymized if there is sensitive data or the environment is used to demo to our customers. With the conversion of our database architecture from a single MS SQL Service instance to Amazon Aurora, our Product Development team can now focus on accelerating development instead of maintaining its data storage system.