In this episode of Inference Time Tactics, Cooper and Byron sit down with Charlie and Anil from Rapt AI to tackle one of the industry's most expensive problems: GPU underutilization. With half a trillion dollars invested in GPU infrastructure running at just 20-30% utilization, Rapt AI is building AI-powered orchestration that automatically analyzes workloads and matches them to the right compute resources—no guesswork required.
We talked about:
- Why half a trillion dollars in GPU infrastructure runs at only 20-30% utilization—and how a 5% drop costs $200,000 per $2M investment.
- How Rapt AI's platform continuously analyzes workloads and auto-optimizes GPU allocation, letting customers run 4-14 models per GPU.
- Real results: moving workloads from H100s to A100s at 40% of the cost, and reducing GPU footprints from 184 to under 50 while improving performance.
- Why 2026 becomes the year of inference as agentic workloads create unprecedented infrastructure chaos.
- The shift from supply problems to optimization problems—and why abstraction layers matter across multi-vendor environments.
- Power as the next crisis: tokens-per-watt emerging as the critical metric alongside tokens-per-dollar.
- How intelligent orchestration frees up data scientists and ML ops teams from infrastructure tuning to focus on AI innovation.
Connect with Rapt AI:
Website: https://www.rapt.ai/
LinkedIn (Anil Ravindranath): https://www.linkedin.com/in/anilravindranath
LinkedIn (Charlie Leeming): https://www.linkedin.com/in/charlieleeming/
Connect with Neurometric:
Website: https://www.neurometric.ai/
Substack: https://neurometric.substack.com/
X: https://x.com/neurometric/
Bluesky: https://bsky.app/profile/neurometric.bsky.social
Hosts:
Calvin Cooper
https://x.com/cooper_nyc_
https://www.linkedin.com/in/coopernyc
Byron Galbraith
https://x.com/bgalbraith
https://www.linkedin.com/in/byrongalbraith