Listen

Description

At Google, the job of a site reliability engineer involves building tools to automate infrastructure operations. If a server crashes, there is automation in place to create a new server. If a service starts to receive a high load of traffic, there is automation in place to scale up the instances of that service.
In order to create an automated response to an infrastructure problem, a site reliability engineer needs insights into that infrastructure. Every service needs tools around monitoring, alerting, debugging, and distributed tracing.
One benefit of working at a large company like Google is that an engineer building a new product gets this kind of tooling by default. If I am hacking on a project at home, I have to set up all kinds of tools to help me diagnose and resolve problems. Setting up this tooling takes time, and requires expertise.
Stackdriver is a set of tools and instrumentation that allows developers to monitor, debug, and inspect infrastructure. Stackdriver is based on the internal observability tools built for Google. Mark Carter is a group product manager at Google, and he joins the show to discuss site reliability engineering and the creation of Stackdriver.