AWS ECS & EC2 hibernation for cost savings

Published in

Gorilla Tech Blog

5 min readMay 27, 2021

Introduction

Gorilla empowers B2B utilities with data-driven solutions for pricing, forecasting and reporting. Our product is cloud-native, and we’re all-in on Amazon Web Services (AWS).

We make heavy use of AWS ECS and Dask for scheduling long-running calculations. Many of our ECS clusters are only used during office hours, though, meaning they can be shut down at night for cost savings. More generally, periods of inactivity can be detected based on the arrival rate of jobs in a cluster’s queue, the cluster’s CPU usage, and other metrics.

This kind of cost optimization is a recommended practice, but we haven’t seen many examples of how to actually implement it. Considering that every product is slightly different, such examples may not generalize well.

In this blog post, we show how we built our in-house hibernation system for cost savings, implemented in Serverless Framework, and geared towards services running on AWS ECS. It likely won’t work as-is with your particular service, though. We simply want to share this as an example to be adapted.

Note: What we’re showing here should not be confused with EC2 hibernation. Although this feature looked promising to us, the documentation states that it cannot be combined with Auto Scaling groups or ECS:

You can’t hibernate an instance that is in an Auto Scaling group or used by Amazon ECS. If your instance is in an Auto Scaling group and you try to hibernate it, the Amazon EC2 Auto Scaling service marks the stopped instance as unhealthy, and might terminate it and launch a replacement instance.

Still, we consider “hibernation” a fitting name for what we’ve built instead.

Architecture overview

Architecture diagram of a compute cluster with hibernation support

Our typical compute cluster consists of a single Dask scheduler running on AWS Fargate, and a flexible number of Dask workers running as an ECS service backed by EC2 instances. The reason we don’t use Fargate for the workers is that our workloads demand a broader range of CPU and memory capacity. The scheduler listens for jobs on an SQS queue and publishes notifications of job completion to an SNS topic.

Cluster activity is expressed using CloudWatch metric math: activity = IF(cpu >= 3 OR jobs_finished, 1, 0), with cpu and jobs_finished representing metrics of (1) cluster CPU usage and (2) the number of SNS messages published for each job completed. The rationale behind this is that large jobs typically increase CPU usage (1), while small jobs might not, meaning they can only be detected through SNS activity (2). Whenever activity has remained 0 for a configurable period (typically 1 hour), a CloudWatch alarm invokes a Lambda function that shuts down the cluster by scaling the ECS services and underlying EC2 Auto Scaling group (ASG) to 0.

When a user schedules a calculation while the cluster is asleep (typically in the morning), the job waits in the queue for 1 minute, which triggers another CloudWatch alarm that invokes a Lambda function. This function then wakes the cluster by scaling out the ECS services and EC2 ASG to 1. This chain of events typically takes around 5 minutes, which is acceptable for clusters with no strict uptime requirements.

Once the cluster is up with 1 worker, further auto-scaling is handled by the Dask scheduler. Based on scheduling demands, it will adjust the ECS service capacity and EC2 ASG size. We’ve developed a custom integration with Dask’s adaptive scaling API to make this work, because we need to scale both the ECS service and EC2 ASG at the same time. We’ve also tried using ECS capacity providers and target tracking to keep both sizes in sync, but this introduces a few minutes of lag which we want to avoid. We do still use ECS capacity providers to manage instance scale-in protection such that the right instances are terminated when scaling in, because the Dask scheduler does not track the relationship between ECS tasks and the EC2 instances they run on (this would complicate things too much).

Sample code

Our Python code and serverless.yml are available in this Gist. The Lambda event handlers are fairly straightforward: they simply toggle the EC2 ASG size and ECS service capacities between 0 and 1. We’ve abstracted this logic somewhat, making it easy to reuse and customize.

Monitoring

New Relic dashboard tracking EC2 instances in all of our AWS accounts

We use a New Relic dashboard to monitor the EC2 ASGs and ECS services across all of our AWS accounts. This dashboard clearly indicates whether hibernation and auto scaling are working as expected. This can also be achieved using CloudWatch dashboards, but New Relic makes it easier for us to aggregate metrics from multiple AWS accounts.

Additionally, to avoid surprises at the end of the month, we use AWS billing alarms to monitor the estimated charges in all of our accounts.

Conclusion

This has been working surprisingly well for such a simple system, although it does suffer from the occasional hiccup. For example, if a Lambda invocation doesn’t achieve the desired outcome (for whatever reason), its CloudWatch alarm may get stuck in the ALARM state. To work around that, we added another Lambda function that resets the wake alarm to OK after each invocation, allowing it to be re-triggered immediately.

In addition, coordinating hibernation with ECS deployments is tricky. To ensure deployments can go through when the cluster is down, our system listens for SERVICE_TASK_PLACEMENT_FAILURE CloudWatch events, and wakes the cluster in response.

Serverless was a great fit for this kind of project since it glues several AWS services together. We can deploy the system independently of our other services, and omit it for clusters with strict uptime requirements. As a bonus, we didn’t need to change a single line in our existing codebase to get hibernation to work; this new system is completely orthogonal!