Look for any podcast host, guest or anyone
Showing episodes and shows of

Rootly

Shows

Humans of ReliabilityHumans of ReliabilityBalancing Reliability at the Crypto-Finance Frontier with Brian Shaw (Uphold)Sylvain Kalache sits down with Brian Shaw, Senior Engineering Leader at Uphold, to explore the reliability challenges that arise when operating at the intersection of traditional finance and crypto markets.Brian shares how unexpected market events can create massive traffic spikes, how their platform architecture and Kubernetes setup help them stay resilient, and why Uphold's transparency and regulatory approach make them both trustworthy and a high-profile target.The conversation also touches on AI's emerging role in operations, lessons from major incidents, and the delicate balance between innovation and stringent compliance in financial engineering.2025-07-0313 minHumans of ReliabilityHumans of ReliabilityCommand Under Pressure: David Owczarek on Incident Leadership and Human-Centered ReliabilityIncident response is as much about people as it is about systems. In this episode, David Owczarek, a veteran engineer leader and seasoned incident commander, joins Silvan Kalache to unpack the human dynamics behind effective reliability leadership.Drawing on experiences across startups and global enterprises, David shares what really matters when everything breaks, including:– How incident response strategies shift between small companies and large enterprises– Why not every engineer should be an incident commander– How empathy and transparency during outages can deepen customer trust instead of eroding it– Where AI is sh...2025-06-1723 minHumans of ReliabilityHumans of ReliabilityAI at the Frontlines of Healthcare Reliability with Ryan Lockard (CVS Health)AI is transforming reliability work—from reactive firefighting to proactive engineering. In this episode, Ryan Lockard, VP of Platform Engineering and AI Enablement at CVS Health, joins Sylvain Kalache to break down how AI is showing up on the frontlines of healthcare infrastructure and operations.From LLM copilots to cultural shifts in ownership, Ryan walks us through:How AI tools help troubleshoot legacy systems and assist during real-time incidentsWhy proactive reliability is finally within reach thanks to AI-enhanced tooling and workflowsWhat MCP servers are, and how natural language interfaces are streamlining cloud operationsHow engineering culture and on...2025-05-3024 minAdventures in DevOpsAdventures in DevOpsIncident Vibing: The Self-Healing System - DevOps 242Sylvain Kalache, Head of Developer Relations at Rootly joins us to explore the new frontier of incident response powered by large language models. We dive into the evolution of DevRel and how we meet the new challenges impacting our systems.We explore Sylvain's origin story in self-healing systems, dating back to his SlideShare and LinkedIn days. From ingesting logs via Fluentd to building early ML-driven RCA tools, he shares a vision of self-healing infrastructure that targets root causes rather than just restarting boxes. Plus, we trace the historical arc of deterministic and non-deterministic tools.The conversation sh...2025-05-291h 10Humans of ReliabilityHumans of ReliabilityTrust Is the Product: Building Reliable Billing in the AI Era with Cosmo Wolfe (Metronome)In this episode, we sit down with Cosmo Wolfe, Head of Technology at Metronome, to unpack how reliability, trust, and architecture intersect in one of the most critical and overlooked parts of the AI product stack: billing.As AI workloads introduce unpredictable usage patterns and nontraditional pricing models—from token-based to outcome-based—companies are navigating a new frontier of customer trust. Cosmo explains why billing is more than just a backend function; it’s a key moment of truth in the product experience.We explore how event-sourced systems, rigorous monitoring, and internal accountability help avoid trust...2025-05-2620 min🧠   _🧠 _Foresight Over Firefighting: Being Proactive in a Reactive World | Rootly's JJ Tang Podcast: Dev Interrupted (LS 45 · TOP 1% what is this?)Episode: Foresight Over Firefighting: Being Proactive in a Reactive World | Rootly's JJ TangPub date: 2025-05-13Get Podcast Transcript →powered by Listen411 - fast audio-to-text and summarizationStill stuck in a reactive loop with incident response, only fixing problems after they happen? JJ Tang, Co-founder and CEO of Rootly, joins host Andrew Zigler to reveal how to shift beyond reactive, leveraging powerful AI and an often-underestimated skill in engineering: genuine customer empathy. Discover how these elements are crucial for navi...2025-05-2544 minHumans of ReliabilityHumans of ReliabilityThe Golden Path to Nowhere: When Platforms Undermine Reliability with Chase Roberts (Northflank)Internal platforms promise speed, consistency, and scale — but what happens when they become a distraction? In this episode, Chase Roberts, COO at Northflank, joins Sylvain Kalache to examine the quiet ways platforms erode developer experience when not planned carefully. From abandoned golden paths to shadow deployments and brittle YAML pipelines, Chase walks us through: Why early PaaS got developer experience right and what it missed The cultural bias toward building over buying (and its hidden costs) How complexity quietly kills productivity and reliability, even when everything “works” The three questions every team should ask before building an IDP What...2025-05-1427 minDev InterruptedDev InterruptedForesight Over Firefighting: Being Proactive in a Reactive World | Rootly's JJ TangStill stuck in a reactive loop with incident response, only fixing problems after they happen? JJ Tang, Co-founder and CEO of Rootly, joins host Andrew Zigler to reveal how to shift beyond reactive, leveraging powerful AI and an often-underestimated skill in engineering: genuine customer empathy. Discover how these elements are crucial for navigating the complexities of modern infrastructure and shaping the future of incident management.JJ explores the forefront of incident response automation, discussing how to integrate shiny new tech like AI safely and why deep customer understanding is key to building trust and reliability. L...2025-05-1344 minHumans of ReliabilityHumans of ReliabilityAI can boost developer productivity, if used right, with Justin Reock, Deputy CTO at DXIn this episode of Humans of Reliability, we sit down with Justin Reock, Deputy CTO at DX, to unpack the real impact of generative AI on developer productivity. Drawing from early data in DX’s GenAI Impact Report, he explains why time savings alone don’t tell the full story and why the real value might lie in shifting cognitive load toward meaningful work. We also explore how traditional productivity metrics like PR throughput can backfire, why teams need to move beyond DORA, and how modern frameworks like SPACE and the DX Core 4 offer a more complete view...2025-04-3037 minHumans of ReliabilityHumans of ReliabilityWhy Reliability in the AI Era Starts with the Network with Marino WijayIn this episode, we explore how networking has shaped reliability as we know it. Marino Wijay cloud networking expert and Staff Solutions Architect at Kong shares how his journey began not as an SRE, but with cables, routers, and switches.Marino explains the evolution of the fabric holding systems together through virtualization, and how software-defined networking, which is now a key element to resilient applications.This episode also dives into the new challenges LLMs are introducing into networking. Marino discusses how these workloads introduce new types of reliability challenges: longer response times, context preservation, model...2025-04-1727 minHumans of ReliabilityHumans of ReliabilityMetrics That Matter: Measuring Developer Productivity in the AI EraIn this episode of Humans of Reliability, Ryan McDonald is joined by Mark Quigley, Head of Platform Engineering at 90, for a conversation that cuts through the noise around developer productivity metrics and AI.Mark dives deep into how teams can measure what matters—without falling into the trap of turning every measure into a target. He shares how tools like Developer NPS, DORA metrics, and balanced scorecards can help teams optimize for both output and well-being—but only when framed with the right intent.As AI tools like GitHub Copilot begin to shift how engineering work...2025-04-0939 minHumans of ReliabilityHumans of ReliabilityAre AI and Platforms Making SRE Obsolete? With Kaspar von Grünberg, Humanitec’s CEOLast year, over 89% of companies claimed to have adopted platform engineering. And, in the past month, LLMs have been disrupting how we think about software development. In this context, Kaspar, asks if the role of Site Reliability Engineers is being obsolete as we know it. Kaspar argues that while SREs aren’t going anywhere, their responsibilities are evolving—fast.We talk about:The need for the SRE role to be transformedHow to build reliability as part every golden pathThe role of AI and LLMs in Developer ExperienceThe limits of LLMs for reliability and infrastructure2025-03-2425 minEntrepreneurs on FireEntrepreneurs on FireBeyond the Blueprint: Redefining Incident Management with JJ TangJJ Tang is the co-founder and CEO of Rootly, a company redefining how organizations approach incident management and reliability. Rootly is powering 100s of customers like Figma, NVIDIA, Canva, and more. Top 3 Value Bombs 1. Solve the problem from many angles. Go out, source and find your customers where they are and work ruthlessly. 2. All good leaders need to be operators. If the CEO can’t do it, you can’t expect anyone to do it. 3. Efficiency is the best force and function for prioritization. Visit their website and get a fr...2025-03-1924 minHumans of ReliabilityHumans of ReliabilityScientific Incident Management with Dan SlimmonDan Slimmon is an incident management veteran who's worked at Etsy, HashiCorp, and now leads consulting and training on pragmatic, non-bureaucratic incident response. In this episode, Dan shares his philosophy on "scientific incident response," the importance of hypothesis-driven troubleshooting, and why incidents should be seen as normal in complex systems. We also explore:Why asking the right questions is more important than knowing all the answers. How to use nerd sniping to unlock insights from engineers. Common failure patterns he sees across organizations. EPISODE LINKS: Video and key takeaways D2E Incident Leaders...2025-03-1437 minHumans of ReliabilityHumans of ReliabilityHow AI broke serverless and what to do about it with Vercel’s Mariano Fernández CocirioMariano, Staff Product Manager at Vercel, explains why serverless architectures are hitting unexpected limits—they’re too fast. The industry has spent millions optimizing serverless for speed, but AI workloads are changing the game. In the AI realm, slower execution often leads to better results. The challenge? Paying for all that idle compute time while waiting for AI responses. Mariano explains how Vercel Fluid is introducing a new execution model that blends the best of serverless and traditional servers—scaling efficiently while reducing costs. Mariano breaks down Fluid’s architecture, its built-in reliability features, and how it red...2025-03-0613 minHumans of ReliabilityHumans of ReliabilityI Want My Shoes Fast! Observability, SRE Burnout, and OTel with Dynatrace’s Adriana VillelaIn this episode, we sit down with Adriana Villela, Principal DevRel at Dynatrace and OpenTelemetry contributor to break down how observability impacts reliability. We dive into what contributes to SRE burnout and how managers can create psychologically safer spaces for responders. Adriana also shares her perspective on AI as an observability-buddy to navigate incidents. SHOW LINKS:Video and takeawaysAdriana’s podcast: Geeking Out with AdrianaPodcast with Hazel Weakly mentioned by Adriana2025-02-2734 minCode REDCode RED#19 - Faster Incident Resolution: How Rootly is Redefining Reliability with JJ TangRootly's Co-founder and CEO, JJ Tang, joins Dash0’s Mirko Novakovic to discuss how Rootly streamlines enterprise incident management. They explore how JJ’s experience at Instacart shaped Rootly, his vision for AI’s capabilities in the monitoring and incident response world, and how he’s building a tool designed to save teams time.2025-02-2036 minHumans of ReliabilityHumans of ReliabilityAI in Production with GitHub’s Sean GoedeckeIn this episode, we sit down with Sean Goedecke, Staff Software Engineer at GitHub, to discuss where LLMs fit into real-world development. Sean shares how he’s using LLMs how he’s drawing the line for AI-assistance in the codebases he manages—though, as he says, this might all change by next summer. Sean also weighs in on how LLMs could assist SREs during outages—especially when you’re only half-awake at 3 a.m. after a rather inconvinient page. Tune in for a nuanced take on the future of AI in software engineering, “vibe coding,” a...2025-02-1817 minHumans of ReliabilityHumans of ReliabilityThe Reliability Diagnosis: Google’s Steve McGhee on Debugging and Incident ResponseIn this episode of Humans of Reliability, we sit down with Steve McGhee, Reliability Advocate at Google, to discuss his journey from early SRE work to advocating for reliability best practices. Steve shares fascinating stories from his time at Google, the challenges of implementing SRE in enterprises, and what people often misunderstand about the discipline. He also offers valuable insights on incident response, distributed systems, and the underrated skill every reliability engineer should master. Whether you're new to SRE or a seasoned professional, this conversation is packed with wisdom and practical takeaways.This ep...2025-02-1015 minBuild vs. BuyBuild vs. BuyJJ Tang, Co-Founder and CEO, Rootly (ex-Instacart, Strava)Mang-Git interviews JJ Tang, co-founder and CEO of Rootly. They discuss Rootly's journey as an incident response management system, frameworks around building vs. buying software, and the challenges of differentiating internal tools from market solutions. JJ shares insights from his career, including his time at IBM and Instacart, and delves into Rootly’s unique approach to managing and learning from incidents. They also touch upon personal experiences, like cycling goals and technological inspirations, and highlight the significance of building strong business fundamentals.00:00 Introduction and Guest Welcome01:29 JJ's Background and Journey03:19 The Birth of Ro...2025-02-0737 minHumans of ReliabilityHumans of ReliabilityNo CS Degree, No Problem: Building a Career in Tech LeadershipWhat does it take to lead service delivery at a company experiencing massive growth? Hannah Hammonds, Service Delivery Lead at Prolific, shares her journey from an IT networking apprentice to a tech leader shaping reliability and incident response. We discuss the evolving role of service delivery, the power of mentorship, and how confidence transforms careers.Plus, we debate hot dogs, spoilers, and The Office.Tune in for career insights, leadership lessons, and a few laughs! 🎙️🚀This podcast episode is also available on YouTube if you want to see a video vers...2025-02-0511 minHumans of ReliabilityHumans of ReliabilityBeyond SLOs: How an ex-Google SRE scaled reliability at the largest e-commerce in the nordicsWhat happens when a Google-trained SRE joins a fast-moving e-commerce company? Gastón Rial Saibene, SRE Lead at Boozt.com, joins Humans of Reliability to talk about adapting reliability practices for different company sizes, the limits of SLOs, and the importance of automation. We also dive into decision-making, his favorite books, and—just for fun—whether he’d survive a zombie apocalypse. Tune in for insights, laughs, and a fresh perspective on the world of reliability engineering! 2025-02-0307 minHumans of ReliabilityHumans of ReliabilityThe Domino Effect of Outages with Nuno Tomás, Founder of isDown.app🎙️ Humans of Reliability: Keeping systems up and the lights on isn’t just about technology—it’s about the people behind it. In this episode, we’re thrilled to chat with Nuno Tomás, founder of IsDown.app, a vendor outage monitoring tool transforming how teams handle third-party incidents.Nuno shares his journey from software engineer to entrepreneur, the pivotal 4 a.m. moment that inspired Isdown, and the challenges of balancing startup life with family. We dive into the complexities of incident communication, how to tackle alert fatigue, and why transparency is key to building trust in SaaS...2025-01-2434 minThe Cloud GambitThe Cloud GambitFounders Corner: The Psychology of Product-Market Fit with JJ TangSend us a textFinding product-market fit is a psychological puzzle, especially for first-time founders launching during a pandemic. Meet JJ Tang, Co-Founder and CEO of Rootly. In this episode, JJ shares his journey—from growing up in China to attending university in Canada and founding a successful incident management platform used by giants like Cisco, NVIDIA, Canva, LinkedIn, and more. We explore valuable lessons from Y Combinator’s S21 batch, the role of messaging in achieving product-market fit, and JJ’s fresh insights on building trust, understanding customer needs, and why incident management is as much about...2025-01-2152 minThe MonkCastThe MonkCastJJ Tang on Incident Management, Cold Emailing CEOs, and the Origins of RootlyIn this RedMonk conversation, JJ Tang, CEO of Rootly, chats with senior analyst Kelly Fitzpatrick about the evolving landscape of incident management, the significance of consistency over speed, and the impact of AI on business operations. Tang shares his career journey, from landing a job at IBM (involving cold emailing the CEO) to building incident management tools at Instacart, to starting Rootly with cofounder Quentin Rousseau. He focuses on the importance of technical skills, networking, and understanding customer needs. Tang also reflects on the challenges of running a startup during economic shifts and offers valuable advice for aspiring entrepreneurs.2025-01-1533 minThis is Fine! A podcast about resilience engineering and softwareThis is Fine! A podcast about resilience engineering and softwareEpisode 6 - Can You Buy Resilience? With Special Guest Steve McGheeSteve is the host of the Google SRE Prodcast, you should check it out!Colette got her chickens from Greenfire Farms, and her chicken coop from Carolina Coops, if anyone is wondering.The Chris Hayes podcast Colette mentioned about unconditional cash transfers is here.Iain M. Banks is an author of The Culture series, a set of fiction books based in a post-scarcity societyIf you didn’t get the Vizzini/Inigo Montoya references, you should probably find a way to see The Pr...2025-01-0855 minDetection at ScaleDetection at ScaleRootly’s JJ Tang on Transforming Incident Management CultureIn this episode of Detection at Scale, Jack speaks to JJ Tang, CEO and Co-founder of Rootly, about revolutionizing incident management in tech organizations. JJ shares his journey from practitioner to founder and emphasizes the importance of viewing incident management as a cultural and collaborative effort rather than just a tooling issue.    JJ touches on breaking down silos between security and other teams to enhance communication and reliability, and empowering security practitioners to take on educator roles within their organizations. He also offers actionable insights on creating a culture of reliability and improving incident response st...2024-11-1425 minInternet PlumberInternet Plumber21: Florida gators, open source design tools, board game kickstarters, and Haley Ingram from C&CIn this episode of the Internet Plumber Podcast, we chat with our friend Haley Ingram — the founder and CEO of Coffee and Contracts. We chat everything from camper van conversions, gators, to SaaS marketing tools, this episode really has it all. We also dig into Hopsotch.io, and a few open source design tools that might start to give Figma a run for their money. Oh and Gavin backed a product in this pod — tune in to find out what he backed on kickstarter. Drop your comments & thoughts below, and join us weekly as we zip through our newsletter to s...2024-06-0451 minCode Story: Insights from Startup Tech LeadersCode Story: Insights from Startup Tech LeadersS9 E25: JJ Tang, RootlyJJ Tang grew up in mainland China, though eventually he came to the US for University and stayed. He found a job, started working and eventually, moved to Canada, as the entrepreneur ecosystem was quite friendly. He has worked for companies like Instacart, IBM and Cisco, gaining vast experience in a myriad of roles. But outside of tech, he is engaged to be married in 2025, and is big into road cycling. In addition, he has a dog named Nova, which his current venture centers their merch design around.At Instacart, JJ started to build a tool for...2024-05-2130 min10KMedia Podcast10KMedia PodcastEpisode 46: Ashley Sawatsky, Incident Response & Reliability Advocate at RootlyAdam sits down with Ashley to discuss the benefits of declaring more incidents, providing engineers with the right incentives, and why cobbling together slack and google docs just won't cut it.2024-04-0346 minGeeking Out with Adriana VillelaGeeking Out with Adriana VillelaThe One Where We Geek Out on Reliability with Ashley Sawatsky of RootlyAbout our guest:As a founding member of Shopify's incident response program for nearly 7 years, Ashley Sawatsky led incident communications and processes. Currently, as Senior Incident Response Advocate at Rootly, she consults with tech giants like Canva, Cisco, NVIDIA, and more on incident response strategies.Find our guest on:X (Twitter)LinkedInFind us on:All of our social channels are on bento.me/geekingoutAll of Adriana's social channels are on bento.me/adrianamvillelaShow Links:ShopifyRootlyWinows 98Ruby on RailsSite Reliability Engineering (Book)Disney InteractiveWorking Effectively With Executives During an...2024-02-2043 minHackers Archives - Software Engineering DailyHackers Archives - Software Engineering DailyCross-functional Incident Management with Ashley Sawatsky and Niall MurphyIncident management is the process of managing and resolving unexpected disruptions or issues in software systems, especially those that are customer-facing or critical to business operations. Implementing a robust incident management system is often a key challenge in technical environments. Rootly is a platform to handle incident management directly from Slack, and is used by hundreds of leading companies including Canva, Grammarly, and Cisco. Ashley Sawatsky leads Developer Relations at Rootly and previously led Shopify’s Incident Communications team. Niall Murphy is Co-founder and CEO at Stanza. He has written extensively about reliability engineering and is the co...2023-08-1750 minSoftware Engineering DailySoftware Engineering DailyCross-functional Incident Management with Ashley Sawatsky and Niall MurphyIncident management is the process of managing and resolving unexpected disruptions or issues in software systems, especially those that are customer-facing or critical to business operations. Implementing a robust incident management system is often a key challenge in technical environments. Rootly is a platform to handle incident management directly from Slack, and is used by hundreds of leading companies including Canva, Grammarly, and Cisco.Ashley Sawatsky leads Developer Relations at Rootly and previously led Shopify's Incident Communications team. Niall Murphy is Co-founder and CEO at Stanza. He has written extensively about reliability engineering and is the co-author...2023-08-1750 minTestGuild News ShowTestGuild News ShowFrom Manual to Automated Testing, Coding Bliss and More TGNS92Do you want to know the roadmap from manual to automation testing? What are the four main destroyers of any test automation effort, And what are some of the new features in the latest Playwright release? Find out in this episode of the test field news show for the week of August 13th. Grab you through a cup of coffee, a tea, and let's do this! Time News Title Rocket Link 0:16  Applitoools FREE Account Offer https://applitools.info/joe  0:34 The Roadmap to Zero Manual TestingWebinar https://online-events.keysight.com/keysight-technologies7/The-Roadmap-to-Zero-Manual-Testing-Leveraging-Automation?cmpid=ASC-2111146&utm_source=ADSC&utm_medium=AS...2023-08-1509 minSoftwarible BitesSoftwarible Bites#15 What is an SRE? by JJ Tang of rootly.comThe responsibilities of an SRE role are loosely identified. But in general, focus on reliability (it’s in the name) and incident response. The article featured in this episode walks us through the responsibilities of an SRE engineer. If you are looking for software and cloud development services and training, check out softwarible.com. We help customers in their software and cloud journey. Whether it’s taking software from ideas to complete solutions, migrating existing systems to the cloud, or mentoring and training individuals and teams, we make it work for your needs. Article: https://root...2021-12-1012 minSoftwarible BitesSoftwarible Bites#8 Google’s State of DevOps 2021 Report: What SREs Need to Know by Quentin Rousseau of rootly (rootly.com)The Accelerate State of DevOps 2021 Report by Google was released. This article discusses four key takeaways for SREs. If you are looking for software and cloud development services and training, check out softwarible.com. We help customers in their software and cloud journey. Whether it’s taking software from ideas to complete solutions, migrating existing systems to the cloud, or mentoring and training individuals and teams, we make it work for your needs. Article: https://rootly.com/blog/google-s-state-of-devops-2021-report-what-sres-need-to-know Tool of the Day: https://github.com/itaysk/kubectl-neat2021-11-2413 minAugmentAugmentSmile Identity, Rootly, Clearco, MarketForce, Botman and CRMNEXT raise funds | IBM plans to acquire BoxBoat Technologies | Peter Boyce II form his own firm called Stellation Capital | Switzerland’s national postal service has acquired TresoritSmile Identity received a $7 million Series A investment from Costanoa Ventures and CRE Venture Capital. Smile Identity intends to utilize the extra funds to strengthen its services, expand into new locations, and serve a larger range of ID kinds. John Cowgill, a Costanoa partner, will join Smile Identity’s board of directors. Smile Identity provides ID solutions in Africa and throughout the world for banking, telecoms, financial services, and shared economy applications.IBM plans to acquire BoxBoat Technologies, a well-known DevOps consulting firm and Kubernetes-certified enterprise service provider. BoxBoat will join IBM Global Business Services’ Hybrid Cloud Serv...2021-07-0903 min