From GPU Scarcity to GPU Waste: Solving the Utilization Crisis

Description

In this episode of Inference Time Tactics, Cooper and Byron sit down with Charlie and Anil from Rapt AI to tackle one of the industry's most expensive problems: GPU underutilization. With half a trillion dollars invested in GPU infrastructure running at just 20-30% utilization, Rapt AI is building AI-powered orchestration that automatically analyzes workloads and matches them to the right compute resources—no guesswork required.

We talked about:

Why half a trillion dollars in GPU infrastructure runs at only 20-30% utilization—and how a 5% drop costs $200,000 per $2M investment.
How Rapt AI's platform continuously analyzes workloads and auto-optimizes GPU allocation, letting customers run 4-14 models per GPU.
Real results: moving workloads from H100s to A100s at 40% of the cost, and reducing GPU footprints from 184 to under 50 while improving performance.
Why 2026 becomes the year of inference as agentic workloads create unprecedented infrastructure chaos.
The shift from supply problems to optimization problems—and why abstraction layers matter across multi-vendor environments.
Power as the next crisis: tokens-per-watt emerging as the critical metric alongside tokens-per-dollar.
How intelligent orchestration frees up data scientists and ML ops teams from infrastructure tuning to focus on AI innovation.

Connect with Rapt AI:
Website: https://www.rapt.ai/

LinkedIn (Anil Ravindranath): https://www.linkedin.com/in/anilravindranath

LinkedIn (Charlie Leeming): https://www.linkedin.com/in/charlieleeming/

Connect with Neurometric:
Website: https://www.neurometric.ai/

Substack: https://neurometric.substack.com/

X: https://x.com/neurometric/

Bluesky: https://bsky.app/profile/neurometric.bsky.social

Hosts:

Calvin Cooper

https://x.com/cooper_nyc_

https://www.linkedin.com/in/coopernyc

Byron Galbraith

https://x.com/bgalbraith

https://www.linkedin.com/in/byrongalbraith

Listen

Description

Want to check another podcast?