podcast
details
.com
Print
Share
Look for any podcast host, guest or anyone
Search
Showing episodes and shows of
Tom Kleinpeter And Jamie Turner
Shows
The Downtime Project
7 Lessons From 10 Outages
After 10 post-mortems in their first season, Tom and Jamie reflect on the common issues they’ve seen. This is the last episode for a bit, but don’t worry, they’ll be back with more outages soon.
2021-06-22
46 min
The Downtime Project
7 Lessons From 10 Outages
After 10 post-mortems in their first season, Tom and Jamie reflect on the common issues they’ve seen. Click through for details! Summing Up Downtime We’re just about through our inaugural season of The Downtime Project podcast, and to celebrate, we’re reflecting back on recurring themes we’ve noticed in many of the ten outages we’ve poured […]
2021-06-22
46 min
The Downtime Project
Salesforce Publishes a Controversial Postmortem (and breaks their DNS)
On May 11, 2021, Salesforce had a multi hour outage that affected numerous services. Their public writeup was somewhat controversial — it’s the first one we’ve done on this show that called out the actions of a single individual in a negative light. The latest SRE Weekly has a good list of some different articles […]
2021-05-31
40 min
The Downtime Project
Salesforce Publishes a Controversial Postmortem (and breaks their DNS)
On May 11, 2021, Salesforce had a multi hour outage that affected numerous services. Their public writeup was somewhat controversial — it’s the first one we’ve done on this show that called out the actions of a single individual in a negative light. The latest SRE Weekly has a good list of some different articles […]
2021-05-31
40 min
The Downtime Project
Kinesis Hits the Thread Limit
During a routine addition of some servers to the Kinesis front end cluster in US-East-1 in November 2020, AWS ran into an OS limit on the max number of threads. That resulted in a multi hour outage that affected a number of other AWS servers, including ECS, EKS, Cognito, and Cloudwatch. We probably won’t do […]
2021-05-25
44 min
The Downtime Project
Kinesis Hits the Thread Limit
During a routine addition of some servers to the Kinesis front end cluster in US-East-1 in November 2020, AWS ran into an OS limit on the max number of threads. That resulted in a multi hour outage that affected a number of other AWS servers, including ECS, EKS, Cognito, and Cloudwatch. We probably won’t do […]
2021-05-25
44 min
The Downtime Project
How Coinbase Unleashed a Thundering Herd
In November 2020, Coinbase had a problem while rotating their internal TLS certificates and accidentally unleashed a huge amount of traffic on some internal services. This was a refreshingly non-database related incident that led to an interesting discussion about the future of infrastructure as code, the limits of human code review, and how many load […]
2021-05-17
38 min
The Downtime Project
How Coinbase Unleashed a Thundering Herd
In November 2020, Coinbase had a problem while rotating their internal TLS certificates and accidentally unleashed a huge amount of traffic on some internal services. This was a refreshingly non-database related incident that led to an interesting discussion about the future of infrastructure as code, the limits of human code review, and how many load […]
2021-05-17
38 min
The Downtime Project
Auth0’s Seriously Congested Database
Just one day after we released Episode 5 about Auth0’s 2018 outage, Auth0 suffered a 4 hour, 20 minute outage that was caused by a combination of several large queries and a series of database cache misses. This was a very serious outage, as many users were unable to log in to sites across the […]
2021-05-10
01 min
The Downtime Project
Auth0’s Seriously Congested Database
Just one day after we released Episode 5 about Auth0’s 2018 outage, Auth0 suffered a 4 hour, 20 minute outage that was caused by a combination of several large queries and a series of database cache misses. This was a very serious outage, as many users were unable to log in to sites across the […]
2021-05-10
01 min
The Downtime Project
Talkin’ Testing with Sujay Jayakar
Tom was feeling under the weather after joining Team Pfizer last week, so today we have a special guest episode with Sujay Jayakar, Jamie’s co-founder and engineer extraordinaire. While it’s great to respond well to an outage, it’s even better to design and test systems in such a way that outages don’t happen. As we […]
2021-05-04
29 min
The Downtime Project
Talkin’ Testing with Sujay Jayakar
Tom was feeling under the weather after joining Team Pfizer last week, so today we have a special guest episode with Sujay Jayakar, Jamie’s co-founder and engineer extraordinaire. While it’s great to respond well to an outage, it’s even better to design and test systems in such a way that outages don’t happen. As we […]
2021-05-04
29 min
The Downtime Project
GitHub’s 43 Second Network Partition
In 2018, after 43 seconds of connectivity issues between their East and West coast datacenters and a rapid promotion of a new primary, GitHub ended up with unique data written to two different databases. As detailed in the postmortem, this resulted in 24 hours of degraded service. This episode spends a lot of time on […]
2021-04-26
53 min
The Downtime Project
GitHub’s 43 Second Network Partition
In 2018, after 43 seconds of connectivity issues between their East and West coast datacenters and a rapid promotion of a new primary, GitHub ended up with unique data written to two different databases. As detailed in the postmortem, this resulted in 24 hours of degraded service. This episode spends a lot of time on […]
2021-04-26
53 min
The Downtime Project
Auth0 Silently Loses Some Indexes
Auth0 experienced multiple hours of degraded performance and increased error rates in November of 2018 after several unexpected events, including a migration that dropped some indexes from their database. The published post-mortem has a full timeline and a great list of action items, though it is curiously missing a few details, like exactly what database […]
2021-04-19
47 min
The Downtime Project
Auth0 Silently Loses Some Indexes
Auth0 experienced multiple hours of degraded performance and increased error rates in November of 2018 after several unexpected events, including a migration that dropped some indexes from their database. The published post-mortem has a full timeline and a great list of action items, though it is curiously missing a few details, like exactly what database […]
2021-04-19
47 min
The Downtime Project
One Subtle Regex Takes Down Cloudflare
On July 2, 2019, a subtle issue in a regular expression took down Cloudflare (and with it, a large portion of the internet) for 30 minutes.
2021-04-12
54 min
The Downtime Project
One Subtle Regex Takes Down Cloudflare
On July 2, 2019, a subtle issue in a regular expression took down Cloudflare (and with it, a large portion of the internet) for 30 minutes.
2021-04-12
54 min
The Downtime Project
Monzo’s 2019 Cassandra Outage
Monzo experienced some issues while adding servers to their Cassandra cluster on July 29th, 2019. Thanks to some good practices, the team recovered quickly and no data was permanently lost.
2021-04-05
43 min
The Downtime Project
Monzo’s 2019 Cassandra Outage
Monzo experienced some issues while adding servers to their Cassandra cluster on July 29th, 2019. Thanks to some good practices, the team recovered quickly and no data was permanently lost.
2021-04-05
43 min
The Downtime Project
Gitlab’s 2017 Postgres Outage
On January 31st, 2017, Gitlab experienced 24 hours of downtime and some data loss. After it was over, the team wrote a fantastic post-mortem about the experience. Listen to Tom and Jamie walk through the outage and opine on the value of having a different color prompt on machines in your production environment. Tom: [00:00:00] […]
2021-03-28
50 min
The Downtime Project
Slack vs TGWs
Slack was down for about 1.5 hours on the first day everyone was back in their (virtual) office in 2021, Jan 4th. Listen to Tom and Jamie walk through the timeline, complain about Linux's default file descriptor limit, and talk about some lessons learned.
2021-03-20
49 min
The Downtime Project
Introduction
Welcome to The Downtime Project! Here is a quick episode where Tom and Jamie talk about why they created the show and what you can expect.
2021-03-20
04 min