**Are the state-of-the-art autoregressive decoder-style transformers the only future for large language models?** **We dive into the most fascinating alternatives, including linear attention hybrids that promise huge efficiency gains for long contexts and text diffusion models that generate tokens in parallel instead of sequentially.** Plus, discover how Code World Models are training models to simulate code behavior for improved modeling performance, aiming to develop more capable coding systems.