Serverless Aurora – Is it Production-Ready Yet?

In the last few months, AWS has made several announcements around it’s Aurora offering such as:

All of these features work towards the end goal of making serverless databases a production-ready solution. Even with the latest offerings, should you explore migrating to a serverless architecture? This blog highlights some considerations when looking to use Backend-as-a-Services (BaaS) at your data layer.

Aurora Models

Let’s assume that you’ve already either made the necessary schema changes and have migrated already or have a general familiarity of implementing a new database with Aurora Classic. Aurora currently comes in two models -Provisioned and Serverless Aurora. A traditional AWS database that is provisioned either has a self-managed EC2 instance or operates as a PAAS model using an AWS managed RDS instance. In both use cases, you have to allocate memory and CPU in addition to creating security groups to allow applications to listen on a TCP connection string.

In this pattern, issues can arrive right at the connection. There are limits as to how many connections can access a database before you start to see performance degradation or an inability to connect altogether when the limit is maxed out. In addition to that, your application may also receive varying degrees of traffic (e.g., a retail application used during a peak season or promotion). Even if you implement a caching layer in front, such as Memcache or Redis, you still have scenarios where the instance will eventually either have to scale vertically to a more robust instance or horizontally with replicas to distribute reads and writes.

This area is where serverless provides some value. It’s worth recalling that a serverless database does not equal no servers. There are servers there, but that is abstracted away from the user (or in this case the application). Following recent compute trends, Serverless focuses more on writing business logic and less on infrastructure management and provisioning to deploy from the requirements stage, to prod-ready quicker. In the traditional database model, you are still responsible for securing the box, authentication, encryption, and other operations unrelated to the actual business functions.

How Aurora Serverless works

What serverless Aurora provides to help alleviate issues with scaling and connectivity is a Backend as a Service solution. The application and Aurora instance must be deployed in the same VPC and connect through endpoints that go through a network load balancer (NLB). Doing so allows for connections to terminate at the load balancer and not at the application.

By abstracting the connections, you no longer have to create logic manage load balancing algorithms or worry about making DNS changes to facilitate for database endpoint changes. The NLB has routing logic through request routers that make the connection to whichever instance is available at the time, which then maps to the underlying serverless database storage. If the serverless database needs to scale up, a pool of resources is always available and kept warm. In the event the instances scale down to zero, a connection cannot persist.

By having an available pool of warm instances, you now have a pay-as-you-go model where you pay for what you utilize. You can still run into the issue of max connections, which can’t be modified, but the number allowed for smaller 2 and 4 ACU implementations has increased since the initial release.

Note: Cooldowns are not instantaneous and can take up to 5 mins after the instance is entirely idle, and you are still billed for that time. Also, even though the instances are kept warm, the connection to those instances still has to initiate. If you make a query to the database during that time, you can see wait times of 25 seconds or more before the query fully executes.

Cost considerations

Can you really scale down completely? Technically yes, if certain conditions are made:

  • CPU below 30 percent utilization
  • Less than 40 percent of connections being used

To achieve this and get the cost savings, the database must be completely idle. There can’t be long-running queries or locked tables. Also, varying activities outside of the application can generate queries such as open sessions, monitoring tools, health-checks, so on and so forth. The database only pauses when the conditions are met, AND there is zero activity.

Serverless Aurora at .06/VCU starts at a higher price than its provisioned predecessor at .041. Aurora classic also charges hourly, where Serverless Aurora charges by the second with a 5-minute minimum AND a 5-minute cool-down period. We already discussed that cool-downs in many cases are not instantaneous, and now you pile on that billing doesn’t stop until an additional 5 minutes after that period. If you go with the traditional minimal setup of 2 VCU and never scale down the instances, the cost is more expensive by a magnitude of at least 3x. Therefore, to get the same cost payoff, your database would have to run only 1/3 of the time and can be achievable for dev/test boxes that are parked or apps only used during business hours in a single time-zone. Serverless Aurora is supposed to be highly available by default, so if you are getting two instances at this price point, then you are getting a better bargain performance-wise than running a single, provisioned instance for an only slightly higher price point.

Allowing for a minimum of 1ACU allows you the option of fully scaling down to a serverless database and makes the price point more comparable to RDS without enabling pausing.

Migration and Data API

Migrating to Serverless Aurora is relatively simple as you can just load in a snapshot from an existing database. With Data API, you no longer need a persistent connection to query the database. In previous scenarios, a fetch could take 25 seconds or more if the query is executed after a cool-down period. In this scenario, you can query the serverless database even if it’s been idle for some time. You can leverage a Lambda function via API gateway which works around the VPC implementation. AWS has mentioned it is providing performance metrics around the time it takes on average to execute a query with data API in the next coming months.

Conclusion

With the creation of EC2, Docker, and Lambda functions, we’ve seen more innovation in the area of compute and not as much on the data layer. Traditional provisioned relational databases have difficulties scaling and have a finite limit on the number of connections. By eliminating the need for an instance, this level of abstraction presents a strong use case for unpredictable workloads. Kudos to AWS for engineering a solution at this layer.

The latest updates these last few months embellish AWS’ willingness to solve complex problems. Running 1ACU does bring the cost down to a rate comparable to RDS while providing a mechanism for better performance if you disable pauses. However, while it is now possible to run Aurora serverless 24/7 more cost-effectively, this scenario contrasts their signature use case of having an on/off database.

Serverless still seems a better fit for databases that are rarely used and only see spikes on occasion or applications primarily used during business hours. Administration time is still a cost, and serverless databases, despite the progress, still has many unknowns. It can take an administrator some time and patience to truly get a configuration that is performant, highly available, and not overly expensive. Even though you don’t have to rely on automation and can manually scale your Aurora serverless cluster, it takes some effort to do so in a way that doesn’t immediately terminate the connections.

Today, you can leverage ECS or Fargate with spot instances and implement a solution that yields similar or better results at a cheaper cost if a true serverless database is the desired goal. I would still recommend this for dev/test workloads and see if you can work your way up to prod for smaller workloads as the tool still provides much value. Hopefully, AWS releases GA offerings for MySQL 5.7 and Postgres soon.

Want more tips and info on Serverless Aurora or serverless databases? Contact our experts.

-Sabine Blair, Cloud Consultant


Logging and Monitoring in the era of Serverless – Part 1

Figuring out monitoring in a holistic sense is a challenge for many companies still, whether it is with conventional infrastructure or new platforms like serverless or containers.

In most applications there are two aspects of monitoring an application:

  • System Metrics such as errors, invocations, latency, memory and cpu usage
  • Business Analytics such as number of signups, number of emails sent, transactions processed, etc

The former is fairly universal and generally applicable in any stack to a varying degree. This is what I would call the undifferentiated aspect of monitoring an application. The abilities to perform error detection and track performance metrics are absolutely necessary to operate an application.

Everything that is old is new again. I am huge fan of the Twelve-Factor App. If you aren’t familiar, I highly suggest taking a look at it. Drafted in 2011 by developers at Heroku, the Twelve-Factor App is a methodology and set best practices designed to enable applications to be built with portability and resiliency when deployed to the web.

In the Twelve-Factor App manifesto, it is stated that applications should produce “logs as event streams” and leave it up to the execution environment to aggregate them. If we are to gather information from our application, why not make that present in the logs? We can use our event stream (i.e. application log) to create time-series metrics. Time-series metrics are just datapoints that have been sampled and aggregated over time, which enable developers and engineers to track performance. They allow us to make correlations with events at a specific time.

AWS Lambda works almost exactly in this way by default, aggregating its logs via AWS CloudWatch. CloudWatch organizes logs based on function, version, and containers while Lambda adds metadata for each invocation. And it is up to the developer to add application-specific logging to their function. CloudWatch, however, will only get you so far. If we want to track more information than just invocation, latency, or memory utilization, we need to analyze the logs deeper. This is where something like Splunk, Kibana, or other tools come into play.

In order to get to the meat of our application and the value it is delivering we need to ensure that we have additional information (telemetry) going to the logs as well:

e.g. – Timeouts – Configuration Failures – Stack traces – Event objects

Logging out these types of events or information will enable those other tools with rich query languages to create a dashboard with just about anything we want on them.

For instance, let’s say we added the following line of code to our application to track an event that was happening from a specific invocation and pull out additional information about execution:

log.Println(fmt.Sprintf(“-metrics.%s.blob.%s”, environment, method))

In a system that tracks time-series metrics in logs (e.g. SumoLogic), we could build a query like this:

“-metrics.prod.blob.” | parse “-metrics.prod.blob.*” as method | timeslice 5m | count(method) group by  _timeslice, method | transpose row _timeslice column method

This would give us a nice breakdown of the different methods used in a CRUD or RESTful service and can then be visualized in the very same tool.

While visualization is nice, particularly when taking a closer look at a problem, it might not be immediately apparent where there is a problem. For that we need some way to grab the attention of the engineers or developers working on the application. Many of the tools mentioned here support some level of monitoring and alerting.

In the next installment of this series we will talk about increasing visibility into your operations and battling dashboard apathy! Check back next week.

-Lars Cromley, Director, Cloud Advocacy and Innovation