June 9-10, 2022

Overview

On June 9, 2022, The Palace Project suffered a service-wide server outage. This was initially identified and reported to the technical team at 9:00am ET. The team immediately began to investigate.

During the outage, patrons were unable to find their library in the app, making it impossible to search for, borrow, and read books. The administrative dashboard was also unavailable for library staff.

We notified our library partners via email and our community message board in the afternoon on June 9, 2022 when we had assessed the cause and scope.

The Palace app and administrative dashboard were back up and operational by the early afternoon on June 10, 2022; we sent out another email and posted in the community as a follow-up to our partners.

Assessment and Actions

We determined the root cause to be a bug in a Linux kernel update that Amazon AWS automatically deployed to some of our server instances. This update caused a kernel panic, which rendered many of the servers inoperable and effectively disabled our service. We developed a mitigation for the bug and, because of the nature of the problem, manually deployed it to the most severely impacted servers.

As a follow up, we developed and tested an automated configuration change, which we deployed to all of our servers to prevent the problematic Linux kernel version from affecting them in the future. This change has also been applied to our server provisioning automation to prevent this bug occurring in any new deployments.