Data & AI Predictions in 2023

As we reveal our data and AI predictions for 2023, join us at 2nd Watch to stay ahead of the curve and propel your business towards innovation and success. How do we know that artificial intelligence (AI) and large language models (LLMs) have reached a tipping point? It was the hot topic at most families’ dinner tables during the 2022 holiday break.

AI has become mainstream and accessible. Most notably, OpenAI’s ChatGPT took the internet by storm, so much so that even our parents (and grandparents!) are talking about it. Since AI is here to stay beyond the Christmas Eve dinner discussion, we put together a list of 2023 predictions we expect to see regarding AI and data.

#1. Proactively handling data privacy regulations will become a top priority.

Regulatory changes can have a significant impact on how organizations handle data privacy: businesses must adapt to new policies to ensure their data is secure. Modifications to regulatory policies require governance and compliance teams to understand data within their company and the ways in which it is being accessed. 

To stay ahead of regulatory changes, organizations will need to prioritize their data governance strategies. This will mitigate the risks surrounding data privacy and potential regulations. As a part of their data governance strategy, data privacy and compliance teams must increase their usage of privacy, security, and compliance analytics to proactively understand how data is being accessed within the company and how it’s being classified. 

#2. AI and LLMs will require organizations to consider their AI strategy.

The rise of AI and LLM technologies will require businesses to adopt a broad AI strategy. AI and LLMs will open opportunities in automation, efficiency, and knowledge distillation. But, as the saying goes, “With great power comes great responsibility.” 

There is disruption and risk that comes with implementing AI and LLMs, and organizations must respond with a people- and process-oriented AI strategy. As more AI tools and start-ups crop up, companies should consider how to thoughtfully approach the disruptions that will be felt in almost every industry. Rather than being reactive to new and foreign territory, businesses should aim to educate, create guidelines, and identify ways to leverage the technology. 

Moreover, without a well-thought-out AI roadmap, enterprises will find themselves technologically plateauing, teams unable to adapt to a new landscape, and lacking a return on investment: they won’t be able to scale or support the initiatives that they put in place. Poor road mapping will lead to siloed and fragmented projects that don’t contribute to a cohesive AI ecosystem.

#3. AI technologies, like Document AI (or information extraction), will be crucial to tap into unstructured data.

According to IDC, 80% of the world’s data will be unstructured by 2025, and 90% of this unstructured data is never analyzed. Integrating unstructured and structured data opens up new use cases for organizational insights and knowledge mining.

Massive amounts of unstructured data – such as Word and PDF documents – have historically been a largely untapped data source for data warehouses and downstream analytics. New deep learning technologies, like Document AI, have addressed this issue and are more widely accessible. Document AI can extract previously unused data from PDF and Word documents, ranging from insurance policies to legal contracts to clinical research to financial statements. Additionally, vision and audio AI unlocks real-time video transcription insights and search, image classification, and call center insights.

Organizations can unlock brand-new use cases by integrating with existing data warehouses. Finetuning these models on domain data enables general-purpose models across a wide variety of use cases. 

#4. “Data is the new oil.” Data will become the fuel for turning general-purpose AI models into domain-specific, task-specific engines for automation, information extraction, and information generation.

Snorkel AI coined the term “data-centric AI,” which is an accurate paradigm to describe our current AI lifecycle. The last time AI received this much hype, the focus was on building new models. Now, very few businesses need to develop novel models and algorithms. What will set their AI technologies apart is the data strategy.

Data-centric AI enables us to leverage existing models that have already been calibrated to an organization’s data. Applying an enterprise’s data to this new paradigm will accelerate a company’s time to market, especially those who have modernized their data and analytics platforms and data warehouses

#5. The popularity of data-driven apps will increase.

Snowflake recently acquired Streamlit, which makes application development more accessible to data engineers. Additionally, Snowflake introduced Unistore and hybrid tables (OLTP) to allow data science and app teams to work together and jointly off of a single source of truth in Snowflake, eliminating silos and data replication.

Snowflake’s big moves demonstrate that companies are looking to fill gaps that traditional business intelligence (BI) tools leave behind. With tools like Streamlit, teams can harness tools to automate data sharing and deployment, which is traditionally manual and Excel-driven. Most importantly, Streamlit can become the conduit that allows business users to work directly with the AI-native and data-driven applications across the enterprise.

#6. AI-native and cloud-native applications will win.

Customers will start expecting AI capabilities to be embedded into cloud-native applications. Harnessing domain-specific data, companies should prioritize building upon module data-driven application blocks with AI and machine learning. AI-native applications will win over AI-retrofitted applications. 

When applications are custom-built for AI, analytics, and data, they are more accessible to data and AI teams, enabling business users to interact with models and data warehouses in a new way. Teams can begin classifying and labeling data in a centralized, data-driven way, rather than manually and often-repeated in Excel, and can feed into a human-in-the-loop system for review and to improve the overall accuracy and quality of models. Traditional BI tools like dashboards, on the other hand, often limit business users to consume and view data in a “what happened?” manner, rather than in a more interactive, often more targeted manner.

#7. There will be technology disruption and market consolidation.

The AI race has begun. Microsoft’s strategic partnership with OpenAI and integration into “everything,” Google’s introduction of Bard and funding into foundational model startup Anthropic, AWS with their own native models and partnership with Stability AI, and new AI-related startups are just a few of the major signals that the market is changing. The emerging AI technologies are driving market consolidation: smaller companies are being acquired by incumbent companies to take advantage of the developing technologies. 

Mergers and acquisitions are key growth drivers, with larger enterprises leveraging their existing resources to acquire smaller, nimbler players to expand their reach in the market. This emphasizes the importance of data, AI, and application strategy. Organizations must stay agile and quickly consolidate data across new portfolios of companies. 

Conclusion

The AI ball is rolling. At this point, you’ve probably dabbled with AI or engaged in high-level conversations about its implications. The next step in the AI adoption process is to actually integrate AI into your work and understand the changes (and challenges) it will bring. We hope that our data and AI predictions for 2023 prime you for the ways it can have an impact on your processes and people.

Think you’re ready to get started? Find out with 2nd Watch’s data science readiness assessment.


Snowflake Summit 2022 Keynote Recap: Disrupting Application Development

Dating back to 2014, Snowflake disrupted analytics by introducing the Snowflake Elastic Data Warehouse, the first data warehouse built from the ground up for the cloud with patented architecture that would revolutionize the data platform landscape. Four years later, Snowflake continued to disrupt the data warehouse industry by introducing Snowflake Data Sharing, an innovation for data collaboration that would eliminate the barriers of traditional data sharing methods in favor of enabling enterprises to easily share live data in real time without moving the shared data. This year, in 2022, under the bright Sin City lights, Snowflake intends to disrupt application development by echoing a unified platform for developing data applications from coding to monetization.

Currently, data teams looking to add value, whether it be improving their team’s analytical efficiency or reducing costs to their enterprise’s processes, develop internal data products such as ML-powered product categorization models or C-suite dashboards in whichever flavor their team is savvy in. However, to produce an external data product that brings value to their enterprise, there is only one metric executives truly care about: revenue.

To bridge the value gap between internal and external data products comes the promise of the Snowflake Native Application Framework. This framework will now enable developers to build, distribute, and deploy applications natively in the Data Cloud landscape through Snowflake. Moreover, these applications can be monetized on the Snowflake Marketplace, where consumers can securely purchase, install, and run these applications natively in their Snowflake environments, with no data movement required. It’s important to note that Snowflake’s goal is not to compete with OLTP Oracle DB workloads, but rather to disrupt how cloud applications are built by seamlessly blending the transactional and analytical capabilities Snowflake has to offer.

To round out the Snowflake Native Application Framework, a series of product announcements were made at the Summit:

  • Unistore (Powered by Hybrid Tables): To bridge transactional and analytical data in a single platform, Snowflake developed a new workload called Unistore. At its core, the new workload enables customers to unify their datasets across multiple solutions and streamline application development by incorporating all the same simplicity, performance, security, and governance customers expect from the Snowflake Data Cloud platform. To power the core, Snowflake developed Hybrid Tables. This new table type supports fast single-row operations driven by a ground-breaking row-based storage engine that will allow transactional applications to be built entirely in Snowflake. Hybrid Tables will also support primary key enforcement to protect against duplicate record inserts.
  • Snowpark for Python: Snowpark is a development framework designed to bridge the skill sets of engineers, data scientists, and developers. Previously, Snowpark only supported Java and Scala, but Snowflake knew what the people wanted – the language of choice across data engineers/scientists and application developers, Python. Allowing Python workloads into Snowflake removes the burden of security, a challenge developers often face within their enterprises.
  • Snowflake Worksheets for Python: Though in private preview currently, Snowflake will support Python development natively within Snowsight Worksheets, to develop pipelines, ML models, and applications. This will allow streamlining development features like auto-complete and the ability to code custom Python logic in seconds.
  • Streamlit Acquisition: To democratize access to data, a vision both Snowflake and Streamlit share, Snowflake acquired Streamlit, an open-source Python project for building data-based applications. Streamlit helps fill the void of bringing a simplified app-building experience for data scientists who want to quickly translate an ML model into a visual application that anyone within their enterprise can access.
  • Large Memory Warehouses: Still in development (out for preview in AWS in EU-Ireland), Snowflake will soon allow consumers to access 5x and 6x larger warehouses. These larger warehouses will enable developers to execute memory-intensive operations such as training ML models on large datasets through open-source Python libraries that will be natively available through Anaconda integration.

On top of all those features released for application development, Snowflake also released key innovations to improve data accessibility, such as:

  • Snowpipe Streaming: To eliminate the boundaries between batch and streaming pipelines, Snowflake introduced Snowpipe Streaming. This new feature will simplify stitching together both real-time and batch data into one single system. Users can now ingest via a client API endpoint aggregated log data for IoT devices without adding event hubs and even ingest CDC streams at a lower latency.
  • External Apache Iceberg Tables: Developed by Netflix, Apache Iceberg tables are open-source tables that can support a variety of file formats (e.g., Parquet, ORC, Avro). Snowflake will now allow consumers to query Iceberg tables in place, without moving the table data or existing metadata. This translates to being able to access customer-supplied storage buckets with Iceberg tables without compromising on security and taking advantage of the consistent governance of the Snowflake platform.
  • External On-Prem Storage Tables: For many enterprises, moving data into the Data Cloud is not a reality due to a variety of reasons, including size, security concerns, cost, etc. To overcome this setback, Snowflake has released in private preview the ability to create External Stages and External Tables on storage systems such as Dell or Pure Storage that can expose a highly compliant S3 API. This will allow customers to access a variety of storage devices using Snowflake without worrying about concurrency issues or the effort of maintaining compute platforms.

Between the Native Application Framework and the new additions for data accessibility, Snowflake has taken a forward-thinking approach on how to effectively disrupt the application framework. Developers should be keen to take advantage of all the new features this year while understanding that some key features such as Unistore and Snowpipe Streaming will have bumps along the road as they are still under public/private preview.

Wondering is Snowflake is the right choice for your organization? Read our high-level overview of Snowflake here or contact us to discuss your organization’s data and analytics needs.


Application Development in the Cloud

The Amazon Web Services Cloud platform offers many unique advantages that can improve and expedite application development that traditional solutions cannot offer. Cloud computing eliminates the need for hardware procurement and makes resources available to anyone with fewer financial resources. What once may have taken months to prepare can now be ready in weeks, days, or even hours. A huge advantage to developing in the Cloud is speed. You no longer have to worry about the infrastructure, storage, or computing capacity needed to build and deploy applications. Development teams can focus on what they do best – creating applications.

Server and networking infrastructure

Developing a new application platform from start to finish can be a lengthy process fraught with numerous hurdles from an operations and infrastructure perspective that cause unanticipated delays of all types. Issues such as budget restrictions, hardware procurement, datacenter capacity and network connectivity are some of the usual suspects when it comes to delays in the development process. Developers cannot develop a platform without the requisite server and networking hardware in place, and deployment of those resources traditionally can require a significant investment of money, time and people.

Items that may need to be taken into consideration for preparing an environment include:

  • Servers
  • Networking hardware (e.g. switches and load balancers)
  • Datacenter cabinets
  • Power Distribution Units (PDU)
  • Power circuits
  • Cabling (e.g. power and network)


Delays related to any item on the above list can easily set back timeframes anywhere from a day to a few weeks. A short list of problems that can throw a wrench into plans include:

  • Potentially having to negotiate new agreements to lease additional datacenter space
  • Hardware vendor inventory shortages (servers, switches, memory, disks, etc.)
  • Bad network-cross connects, ports and transceivers
  • Lack of hosting provider/datacenter cabinet space
  • Over provisioned power capacity requiring additional circuits and/or PDU’s
  • Defective hardware requiring RMA processing
  • Long wait times for installation of hardware by remote-hands


In a perfect world, the ideal development and staging environments will exactly mirror production, providing zero variability across the various stacks, with the exception of perhaps endpoint connectivity. Maintenance of multiple environments can be very time consuming. Performing development and ing in the AWS Cloud can help to completely eliminate many of the above headaches.

AWS handles all hardware provisioning, allowing you to select the hardware you need when you want it and pay by the hour as you go. Eliminating up-front costs allows for ing on or near the exact same type of hardware as in production and with the desired capacity. No need to worry about datacenter cabinet capacity and power. No need to have a single server running multiple functions (e.g. database and caching) because there simply isn’t enough hardware to go around. For applications that are anticipated to handle significant amounts of processing, this can be extremely advantageous for ing at scale. This is a key area where compromises are usually made in regards to hardware platforms and capacity. It’s commonplace to have older hardware re-used for development purposes because that’s simply the only hardware available, or to have multiple stacks being developed on a single platform because of a lack of resources, whether from a budget or hardware perspective. In a nutshell, provisioning hardware can be expensive and time consuming.

Utilizing a Cloud provider such as AWS eliminates the above headaches and allows for quickly deploying an infrastructure at scale with the hardware resources you want, when you want them. There are no up-front hardware costs for servers or networking equipment, and you can ‘turn off’ the instances if and when they are not being used for additional savings. Along with the elimination of up-front hardware costs is the adjustment of changing from capital expenses to operating expenses. Viewing hardware and resources from this perspective allows greater insight into expenses on a month-to-month basis, which can increase accountability and help to minimize and control spending. Beyond the issues of hardware and financing, there are other numerous benefits.

Ensuring software uniformity across stacks

This can be achieved by creating custom AMI’s (Amazon Machine Images) allowing for the same OS and software packages to be included on deployed instances. Alternatively, User Data scripts which execute commands (e.g. software installation and configuration) can also be used for this purpose when provisioning instances. Being able to deploy multiple instances in minutes with just a few mouse clicks is now possible. Need to load and find out if your platform can handle 50,000 transactions per second? Simply deploy additional instances that have been created using an AMI built from an existing instance and add them to a load balancer configuration. AWS also features the AWS Marketplace, which helps customers find, buy, and immediately start using the software and services they need to build products and run their businesses. Customers can select software from well-known vendors already packaged into AMI’s that are ready to use upon instance launch.

Data consistency

Trying to duplicate a database store usually involves dumping a database and then re-inserting that data into another server, which can be very time consuming. First you have to dump the data, and then it has to be imported to the destination. A much faster method that can be utilized on AWS is to:

  1. Snapshot the datastore volume
  2. Create an EBS volume from the snapshot in the desired availability zone
  3. Attach the EBS volume to another instance
  4. Mount the volume and then start the database engine.

*Note that snapshots are region specific and will need to be copied if they are to be used in a region that differs from where they were originally created.

If utilizing the AWS RDS (Relational Database Service), the process is even simpler. All that’s needed is to create a snapshot of the RDS instance and deploy another instance from the snapshot. Again, if deploying in a different region, the snapshot will need to be copied between regions.

Infrastructure consistency

Being that AWS is API driven, it allows for the easy deployment of infrastructure as code utilizing the CloudFormation service. It utilizes JSON formatted templates that allow for the provisioning of various resources such as S3 buckets, EC2 instances, auto scale groups, security groups, load balancers and EBS volumes. VPC’s (Virtual Private Cloud) can also be created using this same service, allowing for duplication of network environments between development, staging and production. Utilizing CloudFormation to deploy AWS infrastructure can greatly expedite the process of security validation. Once an environment has passed ing, the same CFT’s (CloudFormation Templates) can be used to deploy an exact copy of your stack in another VPC or region. Properly investing the time during the development and phases to refine CloudFormation code can reduce deployment of additional environments to a few clicks of a mouse button – try accomplishing that with a physical datacenter.

Higher-level services

For those wishing to utilize higher-level services to further simplify the deployment of environments, AWS offers the Elastic Beanstalk and OpsWorks services. Both are designed to reduce the complexity of deploying and managing the hardware and network layers on which applications run. Elastic Beanstalk is an easy-to-use and highly simplified application management service for building web apps and web services with popular application containers such as Java, PHP, Python, Ruby and .NET. Customers upload their code and Elastic Beanstalk automatically does the rest. OpsWorks features an integrated management experience for the entire application lifecycle including resource provisioning (e.g. instances, databases, load balancers), configuration management (Chef), application deployment, monitoring, and access control. It will work with applications of any level of complexity and is independent of any particular architectural pattern. Compared to Elastic Beanstalk, it also provides more control over the various layers comprising application stacks such as:

  • Application Layers: Ruby on Rails, PHP, Node.js, Java, and Nginx
  • Data Layers: MySQL and Memcached
  • Utility Layers: Ganglia and HAProxy


Summary

The examples above highlight some of the notable features and advantages AWS offers that can be utilized to expedite and assist application development. Public Cloud computing is changing the way organizations build, deploy and manage solutions. Notably, operating expenses now replace capital expenses, costs are lowered due to no longer having to guess capacity, and resources are made available on-demand. All of this adds up to reduced costs and shorter development times, which enables products and services to reach end-users much faster.

-Ryan Manikowski, Cloud Engineer