In this post, 0xkato explains how modern transformer-based LLMs work, walking through the core machinery that turns text into token IDs, embeds them as vectors, tracks position, uses attention and feed-forward networks to process meaning, and then predicts the next token in a loop. The piece is pitched as an accessible, low-math introduction, showing how shared architecture, trained weights, model configuration, and post-training together shape systems like GPT, Claude, Gemini, and LLaMA.
* 00:00 - Introduction
* 02:40 - Tokenization
* 05:45 - Embeddings
* 08:36 - Positional encoding
* 13:02 - Attention
* 19:10 - Multi-head attention
* 23:39 - Feed-forward network
* 29:03 - Residual stream and layer normalization
* 33:41 - Next-token prediction
* 37:26 - Architecture vs trained weights
* 39:50 - Where this is going
https://www.0xkato.xyz/how-llms-actually-work/