Listen

Description

In this episode of The Private AI Lab, Johan van Amersfoort speaks with Adam Grzywaczewski, a senior Deep Learning Data Scientist at NVIDIA, about the rapidly evolving world of AI inference.

They explore how inference has shifted from simple, single-GPU execution to highly distributed, latency-sensitive systems powering today’s large language models. Adam explains the real bottlenecks teams face, why software optimization and hardware innovation must move together, and how NVIDIA’s inference stack—from TensorRT-LLM to Dynamo—enables scalable, cost-efficient deployments.

The conversation also covers quantization, pruning, mixture-of-experts models, AI factories, and why inference optimization is becoming one of the most critical skills in modern AI engineering.

Topics covered