How Alto Reduced Time to Resolution for Production Incidents

Nov 16, 2021

By

Alto Pharmacy

Production incidents caused by newly-merged code happen to us all. When they do, most engineering organizations discuss the root cause and how to prevent similar incidents in the future. At Alto, we evaluate two additional metrics:

  • Time to detection: How quickly did we know something was wrong?

  • Time to resolution: How quickly did we restore service?

Time to resolution is especially critical for us as a pharmacy, where a long-running incident could block a user from getting their medication on time.

When Standard Deploys Aren't Fast Enough

Over 30 times a day, an engineer at Alto hits the merge button, deploying a new build to production. Our well-automated CI/CD pipeline, built by our Infrastructure team, handles the 3-step deployment process:

Alto Standard Deployment Pipeline

1. Build: install Ruby & JavaScript dependencies

2. Test: run test suite and linters

3. Deploy: push the build to production


The average runtime for our standard deployment pipeline is 30 minutes—reasonable for a CI/CD pipeline, but too long when shipping a fix for a production incident.

Rolling Forward: A Faster Way to Deploy

We introduced a “hotfix” workflow as a first pass at an expedited deployment pipeline for incident response. A hotfix is a commit that introduces new code to address the incident by either removing or reverting the buggy code that introduced the issue in the first place or by shipping new code that fixes the initial issue. 

The hotfix deployment pipeline reduces incident time to resolution by triggering a specific CI job that removes the test step from our standard pipeline, reducing the workflow to just the Build and Deploy steps.

1. Build: install Ruby & JavaScript dependencies

2. Deploy: push the build to production


Implementing a hotfix pipeline on top of our standard pipeline in CircleCI was relatively simple—in our `.circleci/config.yml`, any part of our CI workflow that should be skipped in case of a hotfix runs a job called “early_exit_if_hotfix” that will skip the current step if the last commit message of the code being built includes the keyword “[hotfix]”.

early_exit_if_hotfix: steps: - run: name: Early exit if hotfix command: | git_commit_message=$(git log --format=%B -n 1) if [[ $git_commit_message =~ "Merge pull request #" ]]; then echo "This is a merge commit. Checking if this is a hotfix deploy" git_commit_sha=$(git log --format=%H -n 1) git_commit_message=$(git log --format=%B $git_commit_sha^-) fi if [[ $git_commit_message =~ "[hotfix]" ]]; then echo "exiting early for hotfix deploy" circleci-agent step halt fi echo "not a hotfix deploy, continuing" run_ruby_tests: steps: early_exit_if_hotfix run_tests

The hotfix workflow reduces time to ship a resolution, once we identify the fix, from 30 minutes to ~18 minutes. It’s faster than our standard pipeline but still requires considerable hands-on work from the engineer performing the hotfix.

Rolling Back: Our Fastest Way to Deploy

To further speed up deployment and reduce manual work, we introduced a workflow that simply rolls back to a previously-built healthy deployment image. In this workflow, the entire pipeline is reduced to just the Deploy step.

Alto Rollback Deployment Pipeline

1. Deploy: push the build to production


To invoke a rollback, an engineer commits a message with the keyword “[rollback]”, followed by the SHA of the commit to rollback to. Similar to the hotfix approach, any build step that should be skipped during a rollback checks the commit message for the “[rollback]” keyword before running:

early_exit_if_rollback: steps: - run: name: Early exit if rollback command: | git_commit_message=$(git log --format=%B -n 1) if [[ $git_commit_message =~ "Merge pull request #" ]]; then echo "This is a merge commit. Checking for rollback sha..." git_commit_sha=$(git log --format=%H -n 1) git_commit_message=$(git log --format=%B $git_commit_sha^-) fi if [[ $git_commit_message =~ "[rollback]" ]]; then echo "exiting early for rollback deploy" circleci-agent step halt fi echo "not a rollback deploy, continuing" run_ruby_tests: steps: early_exit_if_hotfix early_exit_if_rollback run_tests

The rollback workflow looks for a build matching the specified SHA and posts a confirmation message for the engineer, then deploys that build to production. The rollback approach dropped our time to resolution even further, from 30 minutes to ~3 minutes.

Rollback Gotchas

There are three edge cases that we’ve also baked into our rollback workflow: potentially stale images, database migrations, and blocking new merges and deployments.

Our script has a special block warning the engineer if a target image is older than an hour. The older the image, the riskier the rollback, as the delta between the image and the latest commit on main is larger.


time_rollback=$(git show -s --format=%ct ${rollback_revision}) time_now=$(date '+%s') if [ $(($time_now - $time_rollback)) -gt 3600 ]; then message="${message}\n\n:warning:The rollback commit is more than an hour old." message="${message}\nThe [hotfix workflow](https://www.notion.so/alto/Deploying-a-Hotfix-Rolling-Forward-bb8e9cea05b44c67a64e23ae652653b6) might be a better candidate for this deployment." fi

We also want to warn engineers if any database migrations have occurred between the current image and the target by checking the diff of the db/migrate directory. Rolling back application code without applying changes to the database can result in the database and application code becoming out of sync, resulting in further problems. Our script will warn engineers if a migration has been detected between the current and target image and provide instructions on how to rollback migrations if required.

Because a rollback deploys an image that’s older than our latest build, we also need to block continuous deployment until we’ve restored the main branch to a safe state — typically by deploying a hotfix as soon as we can identify an appropriate fix. We accomplish this using a Github issue with a specific tag that, when open, triggers a job causing any active CI job to fail. After the incident is resolved, the incident manager closes the issue, which will retrigger the blocked jobs and resume the normal build and deploy workflow.

Impact

The expedited rollback process has reduced our overall time to ship a resolution for incidents down to a speedy 3 minutes—a valuable improvement for our internal operations users, as well as for our customers and healthcare providers who rely on Alto’s service to get prescriptions filled. Branching these incident response workflows off of our existing standard deploy pipeline keeps the maintenance cost low for our infrastructure team, and robust automation reduces the risk of manual error for our product engineers when they need to invoke these workflows.

The Engineering Team is hiring! Learn more about open positions here.