Listen

Description

Cody Blakeney from Datology AI joins us to talk about data curation - the unglamorous but critical work of figuring out what to actually train models on.

Cody's path from writing CUDA kernels to spending his days staring at weird internet text tells you something important: data quality can account for half or more of a model's final performance. That's on par with major architectural breakthroughs.

We get into the differences between pre-training, mid-training, and post-training data. Mid-training in particular has become a key technique for squeezing value out of rare, high-quality datasets. Cody's team stumbled onto it while solving a practical problem: how do you figure out if a 5-billion-token dataset is actually useful when you can't afford hundreds of experimental runs?

We also talk about data filtering and some genuinely surprising findings: the documents that make the best training data are often short and dense with information. Those nicely written blog posts with personal anecdotes? Turns out models don't learn as well from them.

On synthetic data, Cody thinks pre-training is still in its early days, where most techniques are variations on a few core ideas, but there's huge potential. He's excited about connecting RL failures back to mid-training: when models fail at tasks, use that signal to generate targeted training data.


Takeaways:

Changes: trimmed


About

The Information Bottleneck is hosted by Ravid Shwartz-Ziv and Allen Roush, featuring in-depth conversations with leading AI researchers about the ideas shaping the future of machine learning.