podcast
details
.com
Print
Share
Look for any podcast host, guest or anyone
Search
Showing episodes and shows of
Ash Patel And Sebastian Vietz
Shows
Reliability Enablers
#67 Why the SRE Book Fails Most Orgs — Lessons from a Google Veteran
A new or growing SRE team. A copy of the book. A company that says it cares about reliability. What happens next? Usually… not much.In this episode, I sit down with Dave O’Connor, a 16-year Google SRE veteran, to talk about what happens when organizations cargo-cult reliability practices without understanding the context they were born in.You might know him for his self-deprecating wit and legendary USENIX blurb about being “complicit in the development of the SRE function.”This one’s a treat — less “here’s a shiny new tool” and more “here’s what...
2025-07-15
30 min
Reliability Enablers
#66 - Unpacking 2025 SRE Report’s Damning Findings
I know it’s already six months into 2025, but we recorded this almost three months ago. I’ve been busy with my foray into the world of tech consulting and training —and, well, editing these podcast episodes takes time and care.This episode was prompted by the 2025 Catchpoint SRE Report, which dropped some damning but all-too-familiar findings:* 53% of orgs still define reliability as uptime only, ignoring degraded experience and hidden toil* Manual effort is creeping back in, reversing five years of automation gains* 41% of engineers feel pressure to ship fast, even when i...
2025-07-01
30 min
Reliability Enablers
#65 - In Critical Systems, 99.9% Isn’t Reliable — It’s a Liability
Most teams talk about reliability with a margin for error. “What’s our SLO? What’s our budget for failure?” But in the energy sector? There is no acceptable downtime. Not even a little.In this episode, I talk with Wade Harris, Director of FAST Engineering in Australia, who’s spent 15+ years designing and rolling out monitoring and control systems for critical energy infrastructure like power stations, solar farms, SCADA networks, you name it.What makes this episode different is that Wade isn’t a reliability engineer by title, but it’s baked into everythin...
2025-06-17
28 min
Reliability Enablers
#64 - Using AI to Reduce Observability Costs
Exploring how to manage observability tool sprawl, reduce costs, and leverage AI to make smarter, data-driven decisions.It's been a hot minute since the last episode of the Reliability Enablers podcast.Sebastian and I have been working on a few things in our realms. On a personal and work front, I’ve been to over 25 cities in the last 3 months and need a breather.Meanwhile, listen to this interesting vendor, Ruchir Jha from Cardinal, working on the cutting edge of o11y to help reduce costs from spiraling out of control. (To...
2025-01-28
20 min
Reliability Enablers
#63 - Does "Big Observability" Neglect Mobile?
Andrew Tunall is a product engineering leader focused on pushing the boundaries of reliability with a current focus on mobile observability. Using his experience from AWS and New Relic, he’s vocal about the need for a more user-focused observability, especially in mobile, where traditional practices fall short. * Career Journey and Current Role: Andrew Tunall, now at Embrace, a mobile observability startup in Portland, Oregon, started his journey at AWS before moving to New Relic. He shifted to a smaller, Series B company to learn beyond what corporate America offered.* Specialization in Mobile Ob...
2024-11-12
29 min
Reliability Enablers
#62 - Early Youtube SRE shares Modern Reliability Strategy
Andrew Fong’s take on engineering cuts through the usual role labels, urging teams to start with the problem they’re solving instead of locking into rigid job titles. He sees reliability, inclusivity, and efficiency as the real drivers of good engineering. In his view, SRE is all about keeping systems reliable and healthy, while platform engineering is geared toward speed, developer enablement, and keeping costs in check. It’s a values-first, practical approach to tackling tough challenges that engineers face every day.Here’s a slightly deeper dive into the concepts we discussed:* Care...
2024-11-05
35 min
Reliability Enablers
#61 Scott Moore on SRE, Performance Engineering, and More
This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
2024-10-22
38 min
Reliability Enablers
#60 How to NOT fail in Platform Engineering
Here’s what we covered:Defining Platform Engineering* Platform engineering: Building compelling internal products to help teams reuse capabilities with less coordination.* Cloud computing connection: Enterprises can now compose platforms from cloud services, creating mature, internal products for all engineering personas.Ankit’s career journey* Didn't choose platform engineering; it found him.* Early start in programming (since age 11).* Transitioned from a product engineer mindset to building internal tools and platforms.* Key experience across startups, the public sector, unicorn companies, and private cloud projects.
2024-10-01
30 min
Reliability Enablers
#59 Who handles monitoring in your team and how?
Why many copy Google’s monitoring team setup* Google’s Influence. Google played a key role in defining the concept of software reliability.* Success in Reliability. Few can dispute Google’s ability to ensure high levels of reliability and its ability to share useful ways to improve it in other settingsBUT there’s a problem:* It’s not always replicable. While Google's practices are admired, they may not be a perfect fit for every team.What is Google’s monitoring approach within teams?Here’s the thing th...
2024-09-24
08 min
Reliability Enablers
#58 Fixing Monitoring's Bad Signal-to-Noise Ratio
Monitoring in the software engineering world continues to grapple with poor signal-to-noise ratios. It’s a challenge that’s been around since the beginning of software development and will persist for years to come. The core issue is the overwhelming noise from non-essential data, which floods systems with useless alerts. This interrupts workflows, affects personal time, and even disrupts sleep.Sebastian dove into this problem, highlighting that the issue isn't just about having meaningless pages but also the struggle to find valuable information amidst the noise. When legitimate alerts get lost in a...
2024-09-17
08 min
Reliability Enablers
#57 How Technical Leads Support Software Reliability
The question then condenses down to: Can technical leads support reliability work? Yes, they can! Anemari has been a technical lead for years — even spending a few years doing that at the coveted consultancy, Thoughtworks — and now coaches others.She and I discussed the link between this role and software reliability. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
2024-09-10
31 min
Reliability Enablers
#56 Resolving DORA Metrics Mistakes
We're already well into 2024 and it’s sad that people still have enough fuel to complain about various aspects of their engineering life. DORA seems to be turning into one of those problem areas.Not at every organization, but some places are turning it into a case of “hitting metrics” without caring for the underlying capabilities and conversations.Nathen Harvey is no stranger to this problem.He used to talk a lot about SRE at Google as a developer advocate. Then, he became the lead advocate for DORA when Google acquired it in...
2024-09-04
26 min
Reliability Enablers
#55 3 Uses for Monitoring Data Other Than Alerts and Dashboards
We’ll explore 3 use cases for monitoring data. They are:* Analyzing long-term trends* Comparing over time or experiment groups* Conducting ad hoc retrospective analysis Analyzing long-term trends You can ask yourself a couple of simple questions as a starting point:* How big is my database?* How fast is the database growing? * How quickly is my user count growing?As you get comfortable with analyzing data for the simpler questions, you can start to analyze trends for less straightforward questions like:...
2024-08-27
11 min
PurePerformance
So you think you should Serverless? Things to know before you do with Sebastian Vietz!
Has one of the decision makers in your organization decided that you have to go "all in on technology X" because they saw a great presentation at a conference or got a great sales pitch from a vendor? If that is the case then this episode is for you and you should forward it to those decision makers.Sebastian Vietz, Director of Reliability Engineering and Host of the Reliability Enablers Podcast, shares his thoughts on considerations when picking a technology like Serverless. We discuss the importance of knowing limits, best fit architectural patterns and things that should influence your...
2024-08-26
1h 01
Reliability Enablers
#54 Becoming a Valuable Engineer Without Sacrificing Your Sanity
Shlomo Bielak is the Head of Engineering (Operational Excellence and Cloud) at Penn Interactive, an interactive gaming company. He’s dedicated much of his talk time at DevOps events to talk about a topic less covered at such technical events. A lot of what he said alluded to ways to become a more valuable engineer.I’ve broken them down into the following areas:* Avoid the heroic efforts* Mind + heart > Mind alone * Curiosity > Credentials* Experience > Certifications * Thinking for complexityWhen I saw him in T...
2024-08-20
37 min
Reliability Enablers
#53 What's Missing in Incident Response Processes?
Incident response is an increasingly difficult area for organizations. Many teams end up paying a lot of money for incident management solutions. However, issues remain because processes supporting the incident response are not robust.Incident response software alone isn't going to fix bad incident processes. It's gonna help for sure. You need these incident management tools to manage the data and communications within the incident. But you also need to have effective processes and human-technology integration. Dr Ukis wrote in his Establishing SRE Foundations book about complex incident coordination and priority...
2024-08-15
09 min
Reliability Enablers
Can ITIL Benefit from Site Reliability Engineering?
According to Vlad Ukis, there are a lot of enterprises around whose IT functions are organized around ITIL. What you use SRE for is something completely different. SRE is not for setting up the IT function. It is for enabling the product organization to operate online services reliably at scale.However, the problem is that many in the industry are NOT using SRE principles but instead handing over complex services to a more traditional IT function.Dr. Vladislav Ukis is well qualified to talk about reliability, being at Siemens Healthineers and leading 250 people...
2024-08-13
29 min
Reliability Enablers
#52 Navigating Complexity within Incidents
Sonja Blignaut is a complexity expert. That might not sound relevant to incident response in reliability engineering. But it is!Our systems are becoming more complex and so are the resulting incidents.Learning about complexity can help reliability folk go into an incident with less anxiety, which we’ll explore in this episode.We'll explore the causes of complexity in incidents and how the Cynefin framework classifies incidents.We'll also deep dive into the concept of complexity itself and dispel a common issue where it gets mixed up with complicatedness....
2024-08-06
36 min
Reliability Enablers
#51 Whitebox vs Blackbox Monitoring
Have you got complete monitoring of your software in effect? Are you sure? Google's SREs break monitoring down to white box versus black box monitoring.It's not the same as internal versus external monitoring, which we'll explore further.We'll cover topics like:- (quickly) What is monitoring?- What is whitebox monitoring?- What is black box monitoring?- The rising importance of blackbox monitoringThis is a concept from Chapter 6 (Monitoring Distributed Systems) of the Google SRE (2016) book. Chapter written by Rob Ewaschuk and...
2024-07-30
09 min
Reliability Enablers
#50 Making Better Sense of Observability Data
Jack Neely is a DevOps observability architect at Palo Alto Networks and has a few interesting ways of extracting value from o11y data.We crammed into just under 25 minutes ideas like these 7 takeaways:* Reasserting the Need to Monitor Four Golden Signals: Focus on latency, traffic, errors, and saturation for effective system monitoring and management.* Prioritize Customer Health: in Jack’s words, the 5th golden signal. Go beyond traditional metrics to monitor the health of your customers for a more comprehensive view of your system's impact.* Apply Mathematical Techniques: Incorporate advanced ma...
2024-07-09
24 min
Reliability Enablers
#49 Alert Fatigue is Still an Issue - Here's How We Fix it
Alert noise is no joke and neither is the fatigue that results from it. I spoke with Dan Ravenstone who gave a talk at Monitorama about this very topic. He also happens to be an avid skateboarder!Here are 9 takeaways from our conversation:* Regularly Review and Update Monitoring Systems: Don’t set up monitoring once and forget about it. Continuously assess and update your monitoring systems to ensure they remain relevant and effective.* Focus on Relevant Alerts: Ensure your alerting system is tailored to indicate real problems. Avoid relying on outdated cr...
2024-07-02
30 min
Reliability Enablers
#48 Cutting Down "Toil" aka Manual Work in Software
Sebastian and I scoured Chapter 5 of the Site Reliability Engineering (2016) book to find nuggets of wisdom on how to reduce toil.We hit the jackpot with concepts like:* what is toil according to a 5-point criteria* why even care about toil?* where you can find toil in your software system* Google’s goal for how much work (%) should be toil* the fact that toil isn’t always all that badDon’t have time to listen to what we learned or added to the concepts? C...
2024-06-25
44 min
Reliability Enablers
#47 How to Grow Team Impact Through Learning Culture
The common refrain after an incident is “We could and should learn from this”. To me, that alludes to the need for a robust learning culture.We might think we already have a good learning culture because we talk about problems and deep-dive them into retrospectives.But how often do we explore the nuances of how we are learning?Sorrel Harriet is an expert in supporting software engineering teams to develop a stronger learning culture. She was a “Continuous Learning Lead” at Armakuni (software consultancy) and now does the same work under he...
2024-06-18
28 min
Reliability Enablers
#46 Platform Team Design According to Team Team Topologies
I continue my conversation with Manuel Pais, co-author of the seminal Team Topologies book about team topologies suitable for reliability teams.In this second part, we will talk about platform teams. A quick refresher on what platform teams doIn the team topologies context:Platform teams provide a curated set of self-service capabilities to enable stream-aligned teams (product or feature teams) to deliver work with greater speed and reduced complexity.They achieve this directive by abstracting away common infrastructure and operational concerns. By doing this, they aim...
2024-06-11
24 min
Reliability Enablers
#45 How Team Topologies Can Guide Enabling Teams
I got the inside word from Manuel Pais, co-author of the seminal Team Topologies book, to explain in a 2-part series about 2 of the most relevant team topologies for reliability work. In this first part, we will talk about enabling teams.A quick refresher on what enabling teams doIn the team topologies context:Enabling teams help stream-aligned teams (product or feature teams) to overcome obstacles and improve their capabilities in specific areas.This kind of team is available to provide expertise, guidance, and support to other...
2024-06-04
25 min
Reliability Enablers
#44 - Making SLOs Matter to Stakeholders
Bonus episode on SLOs because Sebastian and I felt that we did not cover the why of SLOs and make them relevant to stakeholders. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
2024-05-30
20 min
Reliability Enablers
#43 - SLOs: a Deeper Dive into its Mechanics
This episode continues our coverage of Chapter 4 of the Site Reliability Engineering book (2016). In this second part, we take a deeper dive into the mechanics of SLOs.Here are 5 takeaways from the show:* Start Small with SLOs: Begin with a limited number of SLOs and iteratively refine them based on experience and feedback. Avoid overwhelming teams with too many objectives at once.* Defend and Enforce SLOs: Ensure that selected SLOs have real consequences attached to them. If conversations about priorities cannot be influenced by SLOs, reconsider their relevance and enforceability.* Continuous...
2024-05-28
31 min
Reliability Enablers
#42 - Hitting Software SLA Targets through SLOs and SLIs
In this first part of a 2-part coverage, Sebastian Vietz and I work out how to meet SLAs through SLOs and SLIs. This episode covers Chapter 4 of the Site Reliability Engineering book (2016). Here are 7 takeaways from the show:* Involve Technical Stakeholders Early: Ensure that technical stakeholders, such as SREs, are involved in discussions about SLAs and SLOs from the beginning. Their expertise can help ensure that objectives are feasible and aligned with the technical capabilities of the service.* Differentiate Between SLAs and SLOs: Understand the distinction between SLAs, which are legal contracts, and...
2024-05-21
29 min
Reliability Enablers
#41 Curbing High Observability Costs
No one wants to get Coinbase’s $65 million observability bill in the future. Sure, observability comes with a necessary cost. But that cost cannot exceed the concrete and perceived value on balance sheets and the minds of leaders. Sofia Fosdick shares practical insights on curbing high observability costs. She’s a senior account executive at Honeycomb.io and has held similar titles at Turbunomic, Dynatrace, and Grafana. Like always, this is not a sponsored episode!We tackled the cost issue by covering ideas like aligning cost with value, event-based systems, and dynamic sampling. You will not...
2024-05-14
24 min
Reliability Enablers
#40 How to Enable Observability for Success
Observability is more than a set of technologies. It’s a practice. Timothy Mahoney is no stranger to this practice, enabling many developer teams to take on better practices in observability. He’s a senior systems engineer at IKEA and is part of its observability enabling team. Tim highlighted the importance of developing and driving frameworks for observability. He also covered the antipattern of teams having a tool-driven mindset and the challenges of switching them out of this. You can connect with Timothy via LinkedIn This is a public episode. If you...
2024-05-07
27 min
Reliability Enablers
#39 How Chaos Engineering Helps Reduce Incident Risk
Chaos Engineering is no longer a nice to have, as Ananth Movva explains in this episode of the SREpath podcast. His experiences with it drove a reduced number and severity of serious incidents and outages.He’s been at the helm of reliability-focused decision-making at one of Canada’s largest banks, BMO, since 2020. Having completed 12 years at the bank, Ananth has seen the evolution of banking technology from archaic to user-centric, where incidents are considered seriously.Ananth highlighted the use of chaos principles and tooling to identify future points of failure well ahead of time. He a...
2024-04-30
24 min
Reliability Enablers
#38 The Real Cost of Software Reliability & Downtime
This episode covers Chapter 3 of the Site Reliability Engineering book (2016). In this second part, we talk about the costs behind reliability and choosing not to do it well or at all.Here are key takeaways from our conversation:* Prioritize Risk Mitigation: Recognize SRE as a discipline focused on mitigating risks within your organization, including technology, reputation, and financial risks. Allocate resources accordingly to address these risks proactively.* Consider Cost-Effectiveness: When aiming to improve reliability, consider the cost-effectiveness of incremental improvements. Evaluate the balance between investment in reliability and the value it brings to...
2024-04-23
23 min
Reliability Enablers
#37 An SRE Approach to Managing Technology Risk
This episode covers Chapter 3 of the Site Reliability Engineering book (2016). In this first part, we talk about embracing risk from the SRE perspective. We'll cover how it's very different to the typical IT risk management mindset. Here are key takeaways from our conversation: Embrace Risk with Velocity: Rather than being hindered by traditional governance models and change approval boards, consider embracing risk while maintaining development velocity. Strive to find a balance between risk management and the speed of innovation. Reevaluate Risk Management Approaches: Challenge traditional...
2024-04-16
30 min
Reliability Enablers
#36 Avoiding Critical Platform Engineering Mistakes
Platform engineering is replacing SRE and DevOps. Jokes aside, knowing the path to better platforms is key. Abby Bangser is the right person to tell us how to achieve greater maturity in this aspect of software operations. She's previously held SRE roles and currently works as Principal Engineer at Syntasso, the company behind the popular Kratix platform framework. Abby highlighted the need for concrete definitions and maturity models in platform engineering trends, cautioning against equating developer portals with fully functional platforms.
2024-04-09
26 min
Reliability Enablers
#35 Boosting Your Observability Data's Usability
The observability (o11y) data revolution is well underway, but are we getting the most from the data that is being collected?Richard Benwell thinks we have room for improvement, especially at the usage stage where we query and visualize the o11y data.He is the founder and CEO of SquaredUp, a dashboard software company based out of Maidenhead, UK with over 10 years of experience in the monitoring space. Richard highlighted the importance of converging human intuition with technical o11y implementations and moving from a narrow focus on...
2024-04-02
35 min
Reliability Enablers
#34 From Cloud to Concrete: Should You Return to On-Prem?
This episode continues our coverage of Chapter 2 of the Site Reliability Engineering book (2016). We talk about the age-old debate of cloud vs on-prem, which is analogous to that other debate we have in the technology of build vs buy. Here are key takeaways from our conversation: Adapt your storage solutions to business needs: Understand the diverse storage options available and tailor them to specific business needs, considering factors like data type, access patterns, and scalability requirements. Optimize your load balancing: Implement global load balancing strategies to optimize user experience and performance by...
2024-03-26
22 min
Reliability Enablers
#33 Inside Google's Data Center Design
This episode covers Chapter 2 of the Site Reliability Engineering book (2016). In this first part, we talk about the intricacies of data center design outlined in the book. One thing is for sure. Building a data center for your own needs is HARD work with many considerations you must make.Here are key takeaways from our conversation: Importance of understanding data center fundamentals: Even if you're not operating at the scale of companies like Google, understanding the fundamentals behind data center infrastructure can help. This knowledge can inform decisions on cloud services, high availability...
2024-03-19
23 min
Reliability Enablers
#32 Clarifying Platform Engineering's Role (with Ajay Chankramath) BONUS EP
Will Platform Engineering replace DevOps or SRE or both? I don’t think this is the case at all. Neither does Ajay Chankramath.He is the Head of Platform Engineering at ThoughtWorks North America, an innovator consulting group. I’d take his word for it since he’s held senior leadership roles in release engineering and more since 2002.In this bonus episode of the SREpath podcast, Ajay shared his perspective on the debate about SRE vs DevOps vs Platform Engineering. This is a public episode. If you would like to discus...
2024-03-14
16 min
Reliability Enablers
#31 Introduction to FinOps (with Ajay Chankramath)
FinOps is on the tip of many tongues in the software space right now, as we try to curb our cloud costs. Ajay Chankramath has given talks on FinOps at conferences like the DevOps Enterprise Summit (DOES) among others.He is the Head of Platform Engineering at ThoughtWorks North America, an innovator consulting group. His peers like Martin Fowler and Neal Ford have originated ideas like refactoring, microservices, and more.He shared practical advice for avoiding a harsh, restrictive cost control approach and instead taking a holistic financial view of your software operations.
2024-03-12
26 min
Reliability Enablers
#30 Clearing Delusions in Observability (with David Caudill)
Observability is going through interesting times. David Caudill believes that delusions are getting in the way of our success in this area. He's a senior engineering manager at Capital One, a US-based bank. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
2024-03-07
37 min
Reliability Enablers
#29 - Reacting to Google's SRE book 2016 (Chapter 1 Part 2)
Sebastian and I continue our breakdown of notable passages from Chapter 1 of Google's Site Reliability Engineering (2016) book by Betsy Beyer, Jennifer Pettof, Niall Murphy, et al. We covered passages like: Monitoring is one of the primary means by which service owners keep track of a system's health and availability. Efficient use of resources is important anytime a service cares about money. Humans add latency, even if a given system experiences more actual failures. A system that can avoid emergencies that require human intervention will have higher availability than a system that...
2024-02-27
31 min
Reliability Enablers
#28 - Reacting to Google's SRE Book 2016 (Chapter 1 Part 1)
Sebastian and I got together to react to and discuss 5 passages from Chapter 1 of Google's Site Reliability Engineering book (2016) by Betsy Beyer, Jennifer Pettof, Niall Murphy, et al. We covered passages like: The sysadmin approach and the accompanying development ops split have a number of disadvantages and pitfalls Google has chosen to run our systems with a different approach. Our Site Reliability Engineering teams focus on hiring software engineers to run our products The term DevOps emerged in industry. One could equivalently view SRE as a specific implementation of DevOps with...
2024-02-20
25 min
Reliability Enablers
#27 - Growing as a Site Reliability Engineer (Part 3)
Third and final instalment of the Growing as an SRE series covering practical ideas for planning your career progression This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
2024-02-13
16 min
Reliability Enablers
#26 - Growing as a Site Reliability Engineer (Part 2)
In part 1, we covered the first truth - that you don't grow in your career merely through tenure. That was a simple one. Let's explore 2 more truths that are somewhat trickier...Background music credit: Luna by KaizanBlue This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
2024-02-08
19 min
Reliability Enablers
#25 - DORA and the Pursuit of Engineering Excellence (with Tim Wheeler)
DORA metrics are a hot topic among technology executives in all kinds of enterprise. But there's more to engineering culture than solely relying on the numbers it goes you. We have a rare treat for you because Ash got Tim Wheeler on the pod. He doesn't do much of social media or podcast episodes. Tim is Director of Engineering Excellence at SquaredUp where he follows the DORA metrics but emphasizes starting conversations around them rather than setting directives. This is a public episode. If you would like to discuss this with other subscribers or get access...
2024-01-30
37 min
Reliability Enablers
#24 - Growing as a Site Reliability Engineer (Part 1)
How can you grow as an SRE? You've probably thought about your career progression at some point. Ash put together his initial thoughts on this topic. Listen on to learn how he unpacks the first idea of "You don't get promotions with tenure". This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
2024-01-23
08 min
Reliability Enablers
#23 - The Danger of Unreliable Platforms (with Jade Rubick)
Jade Rubick needs no introduction in the reliability and observability space. He was VP of Engineering at New Relic from 2010 to 2019. It was my pleasure to take on his non-obvious ideas on managing expectations with teams, especially platform-based teams. We had a few spicy ideas to dive into.We also touched on topics like enhancing engineering practices, DORA metrics, and so much more. Be sure to listen all the way through to learn Jade's amazing insights. This is a public episode. If you would like to discuss this with other...
2024-01-16
29 min
Reliability Enablers
#22 - How Google does SRE Consulting (with Yury Niño Roa)
I did not know that Google itself does consulting around its SRE practices. This is not a sponsored episode LOL! I wanted to talk with my SRE friend, Yury Niño Roa, about her drawings and SRE ideas, but we dove into a whole lot more than that. We spoke about her work at Google's PSO office, the antipatterns she's seen, and a whole lot more. Listen in for an engaging conversation.You can follow Yury and her amazing drawings via: https://www.linkedin.com/in/yurynino/ This is a...
2024-01-09
35 min
Reliability Enablers
#21 - Better SRE in 2024 is all we can hope for
Sebastian is back for this episode to help set out direction for 2024. We reflected during the holidays on the problems SREs faced in 2023 in terms of job insecurity, burnout, and "that really shouldn't be my sole job". Sebastian and I talked about what we hope to bring to the community in 2024 to make SREs and SRE teams stronger, happier, and healthier at their work. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
2024-01-02
32 min
Reliability Enablers
#20 Holiday Special with Stephen Townshend
Join Ash Patel and Stephen Townshend for a friendly chat about what they've learned in SRE as 2023 comes toward a wrap! This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
2023-12-19
29 min
Reliability Enablers
#19 How to Develop Early Career Engineers (with John Hyland)
Ash Patel talks with John Hyland who ran the Ignite Program at New Relic, which is dedicated to developing early career engineers.John shares insights about driving better outcomes for the organization and the early career professionals who join them. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
2023-12-12
40 min
Reliability Enablers
#18 Winning at SRE in Banking and Telecom (with Troy Koss)
Ash Patel talks with Troy Koss who is the Director of SRE at CapitalOne, an early adopter of DevOps and SRE in the banking sector. He shares insights on working in regulated industries like banking telecom with his early work experience being at Verizon, a US telecom. Troy shares his thoughts on building stronger SRE individual contributors and emphasizes the importance of education as pivotal to ongoing reliability success. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
2023-12-05
35 min
Reliability Enablers
#17 Lessons from SRE's Wild West Days (with Rick Boone)
Ash Patel talks with Rick Boone who is a pioneer in SRE, having been an early AppOps engineer at Facebook and Uber's first SRE hire. He shares amazing stories from those pioneering days. Rick also draws from his experience to share his insights on how to build stronger SRE teams, as well as support effective career progression for individual contributor SREs. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
2023-11-27
46 min
Reliability Enablers
#16 Acing Cloud Infra in Digital Media Giant (with Sreejith Chelanchery)
Ash Patel interviews Sreejith Chelchery who is SVP of Delivery and Infrastructure Engineering at Dotdash Meredith. Sreejith shares his journey from programming analyst in Bangalore, India, to now being an executive responsible for platform engineering, DevOps, and SRE at a media giant in New York City.He gives a glimpse into how his team saved his organization over $9 million in cloud computing costs, how they started an internal developer platform well before Backstage was around, and more. Sreejith also sheds light on how changemakers and advocates like SREs can win...
2023-11-21
39 min
Reliability Enablers
#15 Growing Reliability Engineering Across 5+ Companies (with Nash Seshan)
Ash Patel talks with Nash Seshan, who has supported reliability work in over 5 organizations, including Cisco, eBay, Dropbox, Lyft, Netflix, and Wayfair. He shares his learnings from reliability work at these big brands. Nash also draws from his experience as co-founder of a Y Combinator-funded startup on effective engineering leadership. He also gives his take on issues with ill-conceived automation. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
2023-11-14
42 min
Reliability Enablers
#14 Faster Incident Resolution through Data-Driven Notebooks (with Ivan Merrill)
Ash Patel talks with Ivan Merrill of Fiberplane about wrangling the big data that incidents and systems generate through collaborative notebooks. Ivan also touches on how open-source tools like Autometrics enable deeper observability of code by increasing the granularity of data used for incident response and retrospectives. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
2023-11-07
41 min
Reliability Enablers
#13 Making Sense of OpenTelemetry and Observability (with Adriana Villela)
Ash Patel talks with Adriana Villela (CNCF Ambassador, OpenTelemetry contributor, and senior developer advocate at Lightstep) about the promise of OpenTelemetry for observability teams, as well as the challenges of doing it right. She also touches on engineering leadership topics, recalling her experience as a leader of platform engineering and observability teams. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
2023-10-31
32 min
Reliability Enablers
#12 From Incident Firefighting to Reliability First (with Robert Ross)
Ash Patel talks with Robert Ross of Firehydrant about his experience in offering incident management software to SREs and other software incident responders. Highlights include defining the broader concept of reliability, making smarter choices for handling incidents, and more. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
2023-10-24
29 min
Reliability Enablers
#11 Rising to Staff Engineer in DevOps and SRE (with Rajesh Reddy N)
Ash Patel interviews Rajesh Reddy N about his experiences as a senior DevOps and SRE individual contributor. Rajesh shares his insights on having systems to minimize alert fatigue, the importance of security in DevOps, and more. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
2023-10-17
26 min
Reliability Enablers
#10 Using AI for Kubernetes troubleshooting self-service (with Kyle Forster)
Ash Patel interviews Kyle Forster of RunWhen about his experiences as an ex-Google director helping SREs and running an AI-based company that supports Kubernetes troubleshooting. Their conversation will cover themes like enabling junior SREs, the role of SRE in shift-left, and handling misaligned incentive models in organizations. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
2023-10-10
24 min
Reliability Enablers
#9 Inside Booking.com's Site Reliability Engineering practice (with Samuele Tonon and Yoann Fouquet)
In this episode of the SREpath Podcast, Ash Patel interviews two SRE managers from Booking.com, Samuele and Yoann, to gain insights into their experiences and strategies for developing a successful SRE practice within a large organization. Yoann is a senior manager responsible for managing SRE teams and serves as the SRE Craft lead. Samuele is an SRE engineering manager working in the Big Data department and manages a team of eight to nine people.Yoann officially began his journey in SRE in 2017, transitioning from a consultancy role to an engineer focused on...
2023-10-02
28 min
Reliability Enablers
#8 Software Reliability Ninja Who is NOT an SRE (with Pablo Bouzada)
Ash Patel interviews Pablo Bouzada about his beliefs on software reliability as a non-SRE leader. They discuss the importance of effective leadership to drive effective reliability changes in the software system, as well as the challenges of providing reliable service within video streaming giant, ViaPlay. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
2023-09-11
22 min
Reliability Enablers
What happened to the podcast?
We haven't hit hard times, just doing other things for the last 2 months including making plans for more interesting episodes on this podcast! This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
2023-09-05
03 min
Reliability Enablers
#7 Bringing HR onboard with SRE hiring and onboarding
In this episode, we highlight the importance of engaging with HR partners to establish an effective understanding of the SRE career model. This will allow them to help with recruiting, hiring, and onboarding tailored to the SRE function. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
2023-07-13
25 min
Reliability Enablers
#6 Building a successful SRE practice through capabilities
We discuss the need for a framework to guide the development of Site Reliability Engineers (SREs) and drive value for organizations. You will learn about our pillar view of areas like observability and service management, to identify areas for improvement and emphasize the importance of focusing on a few key areas at a time. We also discuss the challenges of hiring experienced SRE practitioners and suggest developing existing employees' skills and capabilities to become effective SREs. A capability view of SRE work can help establish a clear career path for SREs within an organization while aligning with acute organizational...
2023-06-29
15 min
Reliability Enablers
#5 Where does SRE fit into your organization's structure?
We discuss throughout this episode the different engagement models for Site Reliability Engineering (SRE) and how to contextualize SRE into an organization's structure. Sebastian Vietz, an experienced SRE practitioner, suggests five different engagement models for SRE and emphasizes the importance of considering the cost associated with each model. The hosts also discuss the different types of SREs that can exist within these engagement models, including SRE champions and unicorns. They stress the importance of considering organizational context when implementing SRE and tease a future episode where they will delve deeper into a framework for identifying the capabilities needed to...
2023-06-15
17 min
Reliability Enablers
#4 Should organizations care about SRE?
This episode discusses how Site Reliability Engineering (SRE) can be important to organizations. SRE can optimize software operations, reduce costs, support revenue-driving areas, mitigate risks, improve cybersecurity, and enhance customer experiences. We will also cover how to integrate SRE into the organization's culture for continuous improvement and innovation. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
2023-06-01
18 min
Reliability Enablers
#3 SRE vs DevOps vs Platform Engineering
In this episode of SREpath, Ash and Sebastian discuss the unnecessary debate surrounding Site Reliability Engineering (SRE), DevOps, and platform engineering. They argue that these disciplines should not be pitted against each other, but rather seen as complementary and able to coexist within an organization. The focus should be on continuous improvement, learning from failures, and making things better. The hosts emphasize that practitioners in all three areas share the common goal of improvement and should collaborate rather than compete. They briefly distinguish SRE as focusing on system reliability and scalability, DevOps on...
2023-05-17
22 min
Reliability Enablers
#2 What is Site Reliability Engineering (SRE) and what is not SRE?
In this episode of the SREpath podcast, Ash and Sebastian explore what Site Reliability Engineering (SRE) is and how it manifests in a highly functional organization. We also cover the controversial issue of what SRE is not. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
2023-05-04
23 min
Reliability Enablers
#1 Introducing the SREpath podcast
Welcome to the first episode of the SREpath podcast! In this episode, we'll introduce you to our podcast hosts and give you their broad-level view of Site Reliability Engineering (SRE). We'll also share some points about how we'll be running future episodes. Whether you're an SRE expert or new to the field, this episode will provide valuable insights into SRE and what you can expect from our podcast series. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
2023-04-20
21 min
Slight Reliability
Slight Reliability Episode 31 - I Still Wanna Know What SRE Is!
Send us a textIn this episode I reflect back on the very first episode of Slight Reliability "What the heck is SRE anyway?" and see if my perspective has changed since then. I also tackle the confusion about what SRE is and is not.Shout out to Sebastian Vietz (https://www.linkedin.com/in/sebastianvietz/) for his "Service Reliability Engineering" terminology and Richard Benwell (https://www.linkedin.com/in/richard-benwell-ab887b11/) for highlighting the way SRE offers a different value proposition depending on the scale of the services in question. You can...
2022-11-07
09 min
Slight Reliability
Slight Reliability Episode 16 - Interview with Sebastian Vietz
Send us a textIn this episode I have a chat with Sebastian Vietz, an SRE lead based in Canada who has been leading the implementation of SRE across different teams and organisations for eight years. In this episode we discuss SLO adoption, SRE going mainstream, virtual teams, and many other topics.You can find Sebastian on LinkedIn: https://www.linkedin.com/in/sebastianvietz/You can find me on:LinkedIn: https://www.linkedin.com/in/stephentownshend/Twitter: https://twitter.com/the_kiwi_sreMusic from Uppbeat (free for Creators!).
2022-07-11
41 min
FWK-Podcast
Kickers On Tape: Bündnis gegen Depression
Mit Dr. Melanie Vietz und Sebastian Dürrnagel Bei Kickers On Tape spricht unser Stadionsprecher Tobi Grimm mit den Köpfen hinter den Kulissen unserer Kickers. Heute mit unserem Torwarttrainer aus den LZ Sebastian Dürrnagel. Er redet nicht nur über Fußball, sondern ebenfalls über seine Depressions-Erkrankung vor gut einem Jahr. Eine wichtiges Thema, dass wir auch mit Dr. Melanie Vietz vom Bündnis gegen Depression reden. Mehr Infos findet ihr auch auf https://www.deutsche-depressionshilfe.de/regionale-angebote/wuerzburg/start
2022-02-28
34 min