We talk a lot about training in AI. The data. The GPUs. The size of the model.
But once it’s trained, the real work begins.
Every time you chat with an LLM, get a Photoshop suggestion, or hear an AI-generated voice, you’re tapping into a process called inference—when the model is actually put to use.
And this part? It’s increasingly becoming the bottleneck in the AI pipeline.
At this year’s NVIDIA GTC, I sat down with Cirrascale, a boutique cloud company building infrastructure specifically for inference. The interview itself was straightforward—what stood out to me was how emblematic their platform is of something much bigger happening across the AI industry.
This isn’t just about one company. This is about the future of AI scale, deployment, and real-world use.
The Industry Shift: From Research Demos to Real Products
For years, the AI conversation has been dominated by benchmarks and training runs—how fast can you train GPT-style models? How big is your transformer?
But for AI to actually matter in day-to-day life—whether in enterprise, edge robotics, or consumer tools—it needs to run well. In real time. At scale. Across devices. With power constraints. And cost constraints. And privacy constraints.
That’s inference.
And suddenly, everyone is in the infrastructure game:
* OpenAI is building its own data centers.
* Google is optimizing Gemini for mobile inference.
* Meta is pushing efficient multimodal models for on-device use.
* And companies like Cirrascale are building purpose-built inference platforms optimized for enterprise needs.
This shift isn’t a side note—it’s the new battleground.
“We’re seeing enterprise customers with 200–300 potential AI use cases. But they need to figure out which 10 they can actually deploy this year.”
— Alex Nateros, Cirrascale
Cirrascale at GTC: One Window Into the Future
At their booth, Cirrascale showed off a Boston Dynamics Spot robot connected to their cloud inference platform. A multimodal model, LLaVA, analyzed its camera feed in real time.
The demo wasn’t there to dazzle—it was there to prove something important:
* Robots can’t carry massive models locally.
* Battery-powered devices need fast uplinks to smarter infrastructure.
* And inference platforms need to be optimized, not generic.
“Onboard, there’s only so much compute you can do. But send that data to our inference platform, and you unlock more intelligence per watt.”
— Alex Nateros
That’s the kind of detail that sticks with you—not because it’s flashy, but because it’s plausible. And that’s what AI needs more of.
Blackwell, FP4, and the Need for Speed
Inference isn’t just about getting an answer. It’s about getting the right answer fast.
Cirrascale’s team lit up when they talked about FP4, NVIDIA’s new lower-precision compute format on Blackwell chips.
“FP4 gives you just enough accuracy for real-time reaction—perfect for inference. It’s twice as fast as FP8 for many use cases.”
— Alex Nateros
That kind of trade-off—speed over archival accuracy—isn’t a bug. It’s the future of how models actually get used.
This Isn’t Just a Hardware Story. It’s a Strategic One.
We are entering a new phase of AI—one where performance-per-dollar, inference latency, energy usage, and physical deployment matter more than leaderboard scores.
And Cirrascale’s work is just one window into that.
Others are pushing in different directions:
* MosaicML (acquired by Databricks) is streamlining training and inference pipelines for LLMs.
* Groq is betting on ultra-low-latency AI inference chips.
* Apple is baking on-device inference into iOS.
In this environment, the quiet players—the ones focused on deployment, efficiency, and serving real workloads—are suddenly the ones to watch.
Why This Matters to You
If you’re building anything in AI, this is your reminder:
It’s not enough to have a model. You need to make it work.
That means thinking about:
* Where it runs
* How fast it responds
* How much it costs
* And what happens when it fails
Cirrascale happens to be one company tackling that puzzle. But the puzzle itself? That’s all of ours.
“We built a playground for devs to experiment with real models. Tell us what you’re building—so we can make the infrastructure better where it counts.”
— Alex Nateros
Final Thoughts
It’s easy to focus on what AI produces—the images, the conversations, the predictions. But none of that happens without infrastructure that can serve those models efficiently, reliably, and at scale.
What Cirrascale is doing isn’t loud or flashy—but it’s critical. This is the part of AI most users never see: the infrastructure that turns potential into performance. It’s the part that has to work before anything else can.
“Inference is how AI shows up in the real world. Not in research labs, but in robot arms, call centers, medical devices, and your pocket. It’s not as flashy as model training—but it’s where the magic becomes useful." - Alex Nateros
The next generation of AI isn’t just smarter. It’s faster, cheaper, and everywhere.
Every AI demo, every chatbot interaction, every edge device that claims to be intelligent—it all depends on inference. And inference depends on engineering like this.
Cirrascale’s platform is just one example of a much bigger movement across the industry: building not just bigger brains, but better bodies to carry them.
Because the future of AI isn’t just about what we can train. It’s about what we can run.
And that means the real AI story in 2025 isn’t just happening in the lab—it’s unfolding in the infrastructure.
Podcast Note: The podcast is AI-generated using Google’s NotebookLM.
Vocabulary Key
* Inference: The process of running a trained AI model to generate predictions, decisions, or outputs in real time.
* Training: The phase where a model learns patterns from large datasets; expensive and time-consuming, but only done once.
* FP4: A lower-precision number format used in GPUs to speed up inference with acceptable accuracy. Faster and more efficient than FP8 or FP16.
* Blackwell: NVIDIA’s next-generation GPU architecture designed for faster, more efficient AI workloads, including support for FP4.
* Multimodal Models: AI models that can understand and process multiple types of input—like text, images, and video—at once.
* LLaVA: A vision-language model (Large Language and Vision Assistant) used to provide image-aware context to LLMs.
* Edge Device: A computing device (like a robot or smartphone) that performs inference outside of traditional data centers.
FAQs
Q: What is inference, really?
A: Inference is when the AI model actually does something—like answering a question, generating an image, or helping a robot interpret a video feed. It happens after the model has been trained and is the key to real-world AI applications.
Q: Why is inference such a big deal now?
A: As models grow more complex and demand increases, inference is becoming the bottleneck. It’s where performance, cost, and latency constraints all collide—and where innovation is now focused.
Q: What’s special about FP4 and Blackwell?
A: FP4 is a new low-precision compute format supported by NVIDIA’s Blackwell GPUs. It allows faster inference at lower energy and hardware cost, making large-scale deployments more practical.
Q: Why did you talk to Cirrascale?
A: Because they’re a real-world example of a company focused on the future of inference infrastructure—building tools to help deploy models, not just train them.
Q: Is this just about Cirrascale?
A: Not at all. This piece uses Cirrascale as a case study to explore a broader shift happening across AI: from massive model training to efficient, scalable deployment.
Editor’s Note: Many thanks to our two interviewees from Cirrascale. Correction on the spelling- it should be Alex Nataros, not “Nadaros” as it is written in the original version of this article.
#AIInference #EdgeAI #NVIDIA #Cirrascale #DeepLearning #AIInfrastructure #Blackwell #FP4 #AIEngineering #AITools #ModelDeployment #TheWolfReadsAI #DeepLearningWithTheWolf