"We no longer felt confident about what the exact operational boundaries of our cluster were supposed to be."
In early 2021, observability company Honeycomb dealt with a series of outages related to their Kafka architectural migration, culminating in a 12-hour incident, which is an extremely long outage for the company. In this episode, we chat with two engineers involved in these incidents, Liz Fong-Jones and Fred Hebert, about the backstory that is summarized in this meta-analysis they published in May.
We cover a wide range of topics beyond the specific technical details of the incident (which we also discuss), including:
Resources mentioned in the episode:
Published in partnership with Indeed.