Businesses have been collecting data for decades, but we’re only just starting to understand how best to apply new technologies, like machine learning and AI, for analysis. Fortunately, the cloud offers tools to maximize data use. When starting any data project, the best place to begin is by exploring common data problems to gain valuable insights that will help create a strategy for accomplishing your overall business goal.
Why do businesses need data?
The number one reason enterprise organizations need data is for decision support. Business moves faster today than it ever has, and to keep up, leaders need more than a ‘gut feeling’ on which to base decisions. Data doesn’t make decisions for us, but rather augments and influences which path forward will yield the results we desire.
Another reason we all need data is to align strategic initiatives from the top down. When C-level leaders decide to pursue company wide change, managers need data-based goals and incentives that run parallel with the overall objectives. For change to be successful, there needs to be metrics in place to chart progress. Benchmarks, monthly or quarterly goals, department-specific stats, and so on are all used to facilitate achievement and identify intervention points.
We’ve never before had more data available to us than we do today. While making the now necessary decision to utilize your data for insights is the first step, finding data, cleaning it, understanding why you want it, and analyzing the value and application can be intensive. Ask yourself these five questions before diving into a data project to gain clarity and avoid productivity-killing data issues.
1. Is your data relevant?
- What kind of value are you getting from your data?
- How will you apply the data to influence your decision?
2. Can you see your data?
- Are you aware of all the data you have access to?
- What data do you need that you can’t see?
3. Can you trust your data?
- Do you feel confident making decisions based on the data you have?
- If you’re hesitant to use your data, why do you doubt its authenticity?
4. Do you know the recency of your data?
- When was the data collected? How does that influence relevancy?
- Are you getting the data you need, when you need it?
5. Where is your data siloed?
- What SaaS applications do different departments use? (For example: Workday for HR, HubSpot for marketing, Salesforce for Sales, MailChimp, Trello, Atlassian, and so on.)
- Do you know where all of your data is being collected and stored?
Cloud to the rescue! But only with accurate data
The cloud is the most conducive environment for data analysis because of its plethora of analysis tools available. More and more tools, like plug-and-play machine learning algorithms, are developed every day, and they are widely and easily available in the cloud.
But tools can’t do all the work for you. Tools cannot unearth the value of data. It’s up to you to know why you’re doing what you’re doing. What is the business objective you’re trying to get to? Why do you care about the data you’re seeking? What do you need to get out of it?
A clearly defined business objective is incredibly important to any cloud initiative involving data. Once that’s been identified, it’s important for that goal to serve as the guiding force behind the tools you use in the cloud. Because tools are really for developers and engineers, you want to pair them with someone engaging in the business value of the effort as well. Maybe it’s a business analyst or a project manager, but the team should include someone who is in touch with the business objective.
However, you can’t completely rely on cloud tools to solve data problems because you probably have dirty data, or data that isn’t correct or in the specified format. If your data isn’t accurate, all the tools in the world won’t help you accomplish your objectives. Dirty data interferes with analysis and creates a barrier to your data providing any value.
To cleanse your data, you need to validate the data coming in with quality checks. Typically, there are issues with dates and time stamps, spelling errors from form fields, and other human error in data entry. Formatting date-entry fields and using calendar pickers can help users uniformly complete date information. Drop down menus on form fields will reduce spelling errors and allow you to filter more easily. Small design changes like these can significantly help the cleanliness of your data and your ability to maximize the impact of cloud tools.
Are you ready for data-driven decision making? Access and act on trustworthy data with the Data and Analytics services provided by 2nd Watch to enable smart, fast, and effective decisions that support your business goals. Contact Us to learn more about how to maximize your data use.
-Robert Whelan, Data Engineering & Analytics Practice Manager
Intellyx’s new Agile Digital Transformation Roadmap poster is here! The poster lays out the steps necessary for enterprises to align with customer preferences by implementing change as a core competency and features five main focus areas: customer experience, enterprise IT, agile architecture, devops, and big data.
“While digital transformation begins with a customer-focused technology transformation, in reality, it represents end-to-end business transformation as organizations establish change as a core competency,” says Jason Bloomberg, president of Intellyx and contributor to Forbes. “The Agile Digital Transformation Roadmap poster illustrates the complex, intertwined steps enterprises must take to achieve the benefits of digital transformation.”
The poster is the companion to Jason Bloomberg’s forthcoming book, Agile Digital Transformation, due in 2017. This book will lay out a practical approach for digitally transforming organizations to be more agile and innovative.
As an official sponsor of the poster, we’re giving you the download for free – enjoy!
Download the Poster
-Nicole Maus, Marketing Manager
Customized mobile device digital marketing gets a lot easier
When marketers think digital, they think mobile, but the best way to reach people on their smartphones is an app, not a website. Still, mobile apps are a double-edged sword for companies. They deliver more users with higher engagement but are also harder and more costly to develop and . Given that mobile devices are inherently connected, the first cloud services emerged to simplify app development. Mobile backends and SDKs like Facebook Parse, Kumulos or AWS Mobile Services tackled the backend services data management, synchronization, notification and analytics. Real world ing is the la service, courtesy of the AWS Device Farm, which provides virtual access to myriad mobile devices and operating environments. Device Farm, released in July, allows developers to easily apps on hundreds of combinations of hardware and OS (with a constantly growing list) using either custom scripts or a standard AWS compatibility suite. Although the service launched targeting the most acute problem, on fragmented Android, it now supports iOS as well. But the cloud service isn’t just able to provide instant access to a multitude of devices for hardware-specific s – it also allows ing on multiple devices in parallel, which greatly cuts time.
Bootstrapping mobile development with cloud services can yield huge dividends for organizations wanting to better connect with customers, employees and partners. Not only are there more mobile than desktop users, but their usage is heavier. The average adult in the US spends almost three hours per day consuming digital content on a mobile device, 11% more than just last year. This means that businesses without a mobile strategy, don’t have any digital strategy.
The problem is that providing a richer, customized, differentiated experience requires building a custom mobile app – a task that’s made more daunting by the cornucopia of devices in use. It means supporting multiple versions of two operating systems and countless hardware variations. Although Apple users generally upgrade to the la iOS release within months, the la Android development stats show four versions with at least 13% usage. Worse yet, a 2015 OpenSignal survey of hundreds of thousands of Android devices found more than 24,000 distinct device types. Such diversity makes developing and thoroughly ing mobile apps vastly more complex than a website or PC application. One mobile app developer does QA ing on 400 different Android devices for every app – a ing nightmare that’s even worse when you consider that the mobile app release cycle is measured in weeks, not months. If ever a problem was in need of a virtualized cloud service, this is it; and AWS has delivered.
Device Farm takes an app archive (.apk file for Android or .ipa for iOS) and s it against either custom scripts or an AWS compatibility suite using a fuzz of random events. Test projects are comprised of the actual suite (Device Farm supports five scripting languages), a device pool (specific hardware and OS versions) and any predefined device state such as other installed apps, required local data and device location. Aggregate results are presented on a summary screen with details, including any screenshots, performance data and log file output, available for each device.
Device Farm doesn’t replace the need for in-field beta ing and mobile app instrumentation to measure real world usage, performance and failures, however with thorough, well-crafted suites and a diverse mix of device types, it promises to dramatically improve the end-user experience by eliminating problems that only manifest when running on actual hardware instead of an IDE simulator.
Developers can automate and schedule s using the Device Farm API or via Jenkins using the AWS plugin. Like every AWS service, pricing is usage based, where the metric is the total time for each device at $0.17 per device minute, however by judiciously selecting the device pool, it’s much cheaper than buying and configuring the actual hardware. Developers can automate and schedule s using the Device Farm API or via Jenkins using the AWS plugin. Like every AWS service, pricing is usage based, where the metric is the total time for each device at $0.17 per device minute.
Along with Mobile Services for backend infrastructure, Device Farm makes a compelling mobile app development platform, particularly for organizations already using AWS for website and app development.
To learn more about AWS Device Farm or to get started on your Digital Marketing initiatives, contact us.
-2nd Watch blog by Kurt Marko
The Amazon Web Services Cloud platform offers many unique advantages that can improve and expedite application development that traditional solutions cannot offer. Cloud computing eliminates the need for hardware procurement and makes resources available to anyone with fewer financial resources. What once may have taken months to prepare can now be ready in weeks, days, or even hours. A huge advantage to developing in the Cloud is speed. You no longer have to worry about the infrastructure, storage, or computing capacity needed to build and deploy applications. Development teams can focus on what they do best – creating applications.
Server and networking infrastructure
Developing a new application platform from start to finish can be a lengthy process fraught with numerous hurdles from an operations and infrastructure perspective that cause unanticipated delays of all types. Issues such as budget restrictions, hardware procurement, datacenter capacity and network connectivity are some of the usual suspects when it comes to delays in the development process. Developers cannot develop a platform without the requisite server and networking hardware in place, and deployment of those resources traditionally can require a significant investment of money, time and people.
Items that may need to be taken into consideration for preparing an environment include:
- Networking hardware (e.g. switches and load balancers)
- Datacenter cabinets
- Power Distribution Units (PDU)
- Power circuits
- Cabling (e.g. power and network)
Delays related to any item on the above list can easily set back timeframes anywhere from a day to a few weeks. A short list of problems that can throw a wrench into plans include:
- Potentially having to negotiate new agreements to lease additional datacenter space
- Hardware vendor inventory shortages (servers, switches, memory, disks, etc.)
- Bad network-cross connects, ports and transceivers
- Lack of hosting provider/datacenter cabinet space
- Over provisioned power capacity requiring additional circuits and/or PDU’s
- Defective hardware requiring RMA processing
- Long wait times for installation of hardware by remote-hands
In a perfect world, the ideal development and staging environments will exactly mirror production, providing zero variability across the various stacks, with the exception of perhaps endpoint connectivity. Maintenance of multiple environments can be very time consuming. Performing development and ing in the AWS Cloud can help to completely eliminate many of the above headaches.
AWS handles all hardware provisioning, allowing you to select the hardware you need when you want it and pay by the hour as you go. Eliminating up-front costs allows for ing on or near the exact same type of hardware as in production and with the desired capacity. No need to worry about datacenter cabinet capacity and power. No need to have a single server running multiple functions (e.g. database and caching) because there simply isn’t enough hardware to go around. For applications that are anticipated to handle significant amounts of processing, this can be extremely advantageous for ing at scale. This is a key area where compromises are usually made in regards to hardware platforms and capacity. It’s commonplace to have older hardware re-used for development purposes because that’s simply the only hardware available, or to have multiple stacks being developed on a single platform because of a lack of resources, whether from a budget or hardware perspective. In a nutshell, provisioning hardware can be expensive and time consuming.
Utilizing a Cloud provider such as AWS eliminates the above headaches and allows for quickly deploying an infrastructure at scale with the hardware resources you want, when you want them. There are no up-front hardware costs for servers or networking equipment, and you can ‘turn off’ the instances if and when they are not being used for additional savings. Along with the elimination of up-front hardware costs is the adjustment of changing from capital expenses to operating expenses. Viewing hardware and resources from this perspective allows greater insight into expenses on a month-to-month basis, which can increase accountability and help to minimize and control spending. Beyond the issues of hardware and financing, there are other numerous benefits.
Ensuring software uniformity across stacks
This can be achieved by creating custom AMI’s (Amazon Machine Images) allowing for the same OS and software packages to be included on deployed instances. Alternatively, User Data scripts which execute commands (e.g. software installation and configuration) can also be used for this purpose when provisioning instances. Being able to deploy multiple instances in minutes with just a few mouse clicks is now possible. Need to load and find out if your platform can handle 50,000 transactions per second? Simply deploy additional instances that have been created using an AMI built from an existing instance and add them to a load balancer configuration. AWS also features the AWS Marketplace, which helps customers find, buy, and immediately start using the software and services they need to build products and run their businesses. Customers can select software from well-known vendors already packaged into AMI’s that are ready to use upon instance launch.
Trying to duplicate a database store usually involves dumping a database and then re-inserting that data into another server, which can be very time consuming. First you have to dump the data, and then it has to be imported to the destination. A much faster method that can be utilized on AWS is to:
- Snapshot the datastore volume
- Create an EBS volume from the snapshot in the desired availability zone
- Attach the EBS volume to another instance
- Mount the volume and then start the database engine.
*Note that snapshots are region specific and will need to be copied if they are to be used in a region that differs from where they were originally created.
If utilizing the AWS RDS (Relational Database Service), the process is even simpler. All that’s needed is to create a snapshot of the RDS instance and deploy another instance from the snapshot. Again, if deploying in a different region, the snapshot will need to be copied between regions.
Being that AWS is API driven, it allows for the easy deployment of infrastructure as code utilizing the CloudFormation service. It utilizes JSON formatted templates that allow for the provisioning of various resources such as S3 buckets, EC2 instances, auto scale groups, security groups, load balancers and EBS volumes. VPC’s (Virtual Private Cloud) can also be created using this same service, allowing for duplication of network environments between development, staging and production. Utilizing CloudFormation to deploy AWS infrastructure can greatly expedite the process of security validation. Once an environment has passed ing, the same CFT’s (CloudFormation Templates) can be used to deploy an exact copy of your stack in another VPC or region. Properly investing the time during the development and phases to refine CloudFormation code can reduce deployment of additional environments to a few clicks of a mouse button – try accomplishing that with a physical datacenter.
For those wishing to utilize higher-level services to further simplify the deployment of environments, AWS offers the Elastic Beanstalk and OpsWorks services. Both are designed to reduce the complexity of deploying and managing the hardware and network layers on which applications run. Elastic Beanstalk is an easy-to-use and highly simplified application management service for building web apps and web services with popular application containers such as Java, PHP, Python, Ruby and .NET. Customers upload their code and Elastic Beanstalk automatically does the rest. OpsWorks features an integrated management experience for the entire application lifecycle including resource provisioning (e.g. instances, databases, load balancers), configuration management (Chef), application deployment, monitoring, and access control. It will work with applications of any level of complexity and is independent of any particular architectural pattern. Compared to Elastic Beanstalk, it also provides more control over the various layers comprising application stacks such as:
- Application Layers: Ruby on Rails, PHP, Node.js, Java, and Nginx
- Data Layers: MySQL and Memcached
- Utility Layers: Ganglia and HAProxy
The examples above highlight some of the notable features and advantages AWS offers that can be utilized to expedite and assist application development. Public Cloud computing is changing the way organizations build, deploy and manage solutions. Notably, operating expenses now replace capital expenses, costs are lowered due to no longer having to guess capacity, and resources are made available on-demand. All of this adds up to reduced costs and shorter development times, which enables products and services to reach end-users much faster.
-Ryan Manikowski, Cloud Engineer
An oft-held misconception by many individuals and organizations is that AWS is great for Web services, big data processing, DR, and all of the other “Internet facing” applications but not for running your internal business applications. While AWS is absolutely an excellent fit for the aforementioned purposes, it is also an excellent choice for running the vast majority of business applications. Everything from email services, to BI applications, to ERP, and even your own internally built applications can be run in AWS with ease while virtually eliminating future IT capex spending.
Laying the foundation
One of the most foundational pieces of architecture for most businesses is the network that applications and services ride upon. In a traditional model, this will generally look like a varying number of switches in the datacenter that are interconnected with a core switch (e.g. a pair of Cisco Nexus 7000s). Then you have a number of routers and VPN devices (e.g. Cisco ASA 55XX) that interconnect the core datacenter with secondary datacenters and office sites. This is a gross oversimplification of what really happens on the business’s underlying network (and neglects to mention technologies like Fibre Channel and InfiniBand). But that further drives the point that migrating to AWS can greatly reduce the complexity and cost of a business in managing a traditional RYO (run your own) datacenter.
Anyone familiar with IT budgeting is more than aware of the massive capex costs associated with continually purchasing new hardware as well as the operational costs associated with managing it – maintenance agreements, salaries of highly skilled engineers, power, leased datacenter and network space, and so forth. Some of these costs can be mitigated by going to a “hosted” model where you are leasing rack space in someone else’s datacenter, but you are still going to be forking out a wad of cash on a regular basis to support the hosted model.
The AWS VPC (Virtual Private Cloud) is a completely virtual network that allows businesses the ability to create private network spaces within AWS to run all of their applications on, including internal business applications. Through the VGW (Virtual Private Gateway) the VPC inherently provides a pathway for businesses to interconnect their off-cloud networks with AWS. This can be done through traditional VPNs or by using the VPC’s Direct Connect. Direct provides a dedicated private connection from AWS to your off-cloud locations (e.g. on-prem, remote offices, colocation). The VPC is also flexible enough that it will allow you to run your own VPN gateways on EC2 instances if that is a desired approach. In addition, interconnecting with most MPLS providers is supported, as long as the MPLS provider hands off VLAN IDs.
Moving up the stack
The prior section showed how the VPC is a low cost and simplified approach to managing network infrastructure. We can proceed up the stack to the server, storage, and application layers. Another piece of the network layer that is generally heavily intertwined with the application architecture and the server’s hosting is load balancing. At a minimum, load balancing enables the application to run in a highly available and scalable manner while providing a single namespace/endpoint for the application client to connect. Amazon’s ELB (Elastic Load Balancer) is a very cost effective, powerful, and easy to use solution to load balancing in AWS. A lot of businesses have existing load balancing appliances, like F5 BigIP, Citrix Netscaler, or A1, that they use to manage their applications. Many have also written a plethora of custom rules and configs, like F5 iRules, to do some layer 7 processing and logic on the application. All of the previously mentioned load balancing solution providers, and quite a few more, have AWS hosted options available, so there is an easy migration path if they decide the ELB is not a good fit for their needs. However, I have personally written migration tools for our customers to convert well over a thousand F5 Virtual IPs and pools (dumped to a CSV) into ELBs. It allowed for a quick and scripted migration of the entire infrastructure with an enormous cost savings to the customer. In addition to off-the-shelf appliances for load balancing, you can also roll your own with tools like HAProxy and Nginx, but we find that for most people the ELB is an excellent solution for meeting their load balancing needs.
Now we have laid the network foundation to run our servers and applications on. AWS provides several services for this. If you need, or desire, to manage your own servers and underlying operating system, EC2 (Elastic Compute Cloud) provides the foundational building blocks for spinning up virtual servers you can tailor to suit whatever need you have. A multitude of Linux and Windows-based Operating Systems are supported. If your application supports it, there are services like ElasticBeanstalk, OpsWorks, or Lambda, to name a few, that will manage the underlying compute resources for you and simply allow you to “deploy code” on completely managed compute resources in the VPC.
What about my databases?
There are countless examples of people running internal business application databases in AWS. The RDS (Relational Database Service) provides a comprehensive, robust, and HA capable hosted solution for MySQL, PostgreSQL, Microsoft SQL server, and Oracle. If your database platform isn’t supported by RDS, you can always run your own DB servers on EC2 instances.
NAS would be nice
AWS has always recommended a very ephemeral approach to application architectures and not storing data directly on an instance. Sometimes there is no getting away from needing shared storage, though, across multiple instances. Amazon S3 is a potential solution but is not intended to be used as attached storage, so the application must be capable of addressing and utilizing S3’s endpoints if that is to be a solution. There are a great many applications that aren’t compatible with that model.
Until recently your options were pretty limited for providing a NAS type of shared storage to Amazon EC2 instances. You could create a GlusterFS (AKA Redhat Storage Server) or Ceph cluster out of EC2 instances spanned across multiple availability zones, but that is fairly expensive and has several client mounting issues. The Gluster client, for example, is a FUSE (filesystem in user space) client and has sub-optimal performance. Linux Torvalds has a famous and slightly amusing – depending upon the audience – rant about userspace filesystems (see: https://lkml.org/lkml/2011/6/9/462). To get around the FUSE problem you could always enable NFS server mode, but that breaks the ability of the client to dynamically connect to another GlusterFS server node if one fails thus introducing a single point of failure. You could conceivable set up some sort of NFS Server HA cluster using Linux heartbeat, but that is tedious, error prone, and places the burden of the storage ecosystem support on the IT organization, which is not desirable for most IT organizations. Not to mention that Heartbeat requires a shared static IP address, which could be jury rigged in VPC, but you absolutely cannot share the same IP address across multiple Availability Zones, so you would lose multi-AZ protection.
Yes, there were “solutions” but nothing that was easy and slick like most everything else in AWS is nor anything that is ready for primetime. Then on April 9th, 2015 Amazon introduced us to EFS (Elastic File System). The majority of corporate IT AWS users have been clamoring for a shared file system solution in AWS for quite some time, and EFS is set to fill that need. EFS is a low latency, shared storage solution available to multiple EC2 instances simultaneously via NFSv4. It is currently in preview mode but should be released to GA in the near future. See more at https://aws.amazon.com/efs/.
Thinking outside the box
In addition to the AWS tools that are analogs of traditional IT infrastructure (e.g. VPC ≈ Network Layer, EC2 ≈ Physical server or VM) there are a large number of tools and SaaS offerings that add value above and beyond. Tools like SQS, SWF, SES, RDS – for hosted/managed RDMBS platforms – CloudTrail, CloudWatch, DynamoDB, DirectoryServices, WorkDocs, WorkSpace, and many more make transitioning traditional business applications into the cloud easy, all the while eliminating capex costs, reducing operating costs, and increasing stability and reliability.
A word on architectural best practices
If it is at all possible, there are some guiding principles and best practices that should be followed when designing and implementing solutions in AWS. First and foremost, design for failure. The new paradigm in virtualized and cloud computing is that no individual system is sacred and nothing is impervious to potential failure. Having worked in a wide variety of high tech and IT organizations over the past 20 years, this should really come as no surprise because even when everything is running on highly redundant hardware and networks, equipment and software failures have ALWAYS been prevalent. IT and software design as a culture would have been much better off adopting this mantra years and years ago. However, overcoming some of the hurdles designing for failure creates wasn’t a full reality until virtualization and the Cloud were available.
AWS is by far the forerunner in providing services and technologies that allow organizations to decouple the application architecture from the underlying infrastructure. Tools like Route53, AutoScaling, CloudWatch, SNS, EC2, and configuration management allow you to design a high level of redundancy and automatic recovery into your infrastructure and application architecture. In addition to designing for failure, decoupling the application state from the architecture as a whole should be strived for. The application state should not be stored on any individual component in the stack, nor should it be passed around between the layers. This way the loss of a single component in the chain will not destroy the state of the application. Having the state of the application store in its own autonomous location, like a distributed NoSQL DB cluster, will allow the application to function without skipping a beat in the event of a component failure.
Finally, a DevOps, Continuous Integration, or Continuous Delivery methodology should be adopted for application development. This allows changes to be ed automatically before being pushed into production and also provides a high level of business agility. The same kind of business agility that running in the Cloud is meant to provide.
-Ryan Kennedy, Senior Cloud Architect
The exponential growth of big data is pushing companies to process massive amounts of information as quickly as possible, which is often times not realistic, practical or down right just not achievable on standard CPI’s. In a nutshell, High Performance Computing (HPC) allows you to scale performance to process and report on the data quicker and can be the solution to many of your big data problems.
However, this still relies on your cluster capabilities. By using AWS for your HPC needs, you no longer have to worry about designing and adjusting your job to meet the capabilities of your cluster. Instead, you can quickly design and change your cluster to meet the needs of your jobs. There are several tools and services available to help you do this, like the AWS Marketplace, AWS API’s, or AWS CloudFormation Templates.
Today, I’d like to focus on one aspect of running an HPC cluster in AWS that people tend to forget about – placement groups.
Placement groups are a logical grouping of instances in a single availability zone. This allows you to take full advantage of a low-latency 10 GB network, which in turn will allow you to be able to transfer up to 4TB of data per hour between nodes. However, because of the low-latency 10 GB network, the placement groups cannot span to multiple availability zones. This may scare some people away from using them, but it shouldn’t. You can create multiple placement groups in different availability zones as a work-around, and with enhanced networking you can also still connect between the different HPC’s.
One of the grea benefits of AWS HPC is that you can run your High Performance Computing clusters with no up-front costs and scale out to hundreds of thousands of cores within minutes to meet your computing needs. Learn more about Big Data and HPC solutions on AWS or Contact Us to get started with a workload workshop.
-Shawn Bliesner, Cloud Architect