Listen

Description

"We no longer felt confident about what the exact operational boundaries of our cluster were supposed to be."

In early 2021, observability company Honeycomb dealt with a series of outages related to their Kafka architectural migration, culminating in a 12-hour incident, which is an extremely long outage for the company. In this episode, we chat with two engineers involved in these incidents, Liz Fong-Jones and Fred Hebert, about the backstory that is summarized in this meta-analysis they published in May. 

We cover a wide range of topics beyond the specific technical details of the incident (which we also discuss), including:

Resources mentioned in the episode:


Published in partnership with Indeed.