podcast
details
.com
Print
Share
Look for any podcast host, guest or anyone
Search
Showing episodes and shows of
Mcgrof
Shows
AI: post transformers
ShadowKV: High-Throughput Long-Context LLM Inference
This April 2025 paper introduces ShadowKV, an innovative inference system for long-context Large Language Models (LLMs) designed to significantly enhance throughput and support larger batch sizes without compromising accuracy. It achieves this by strategically managing the Key-Value (KV) cache: specifically, it compresses the low-rank pre-Rotary Position Embedding (RoPE) key cache on the GPU and offloads the value cache to the CPU. ShadowKV further optimizes performance through an accurate KV selection strategy that reconstructs minimal sparse KV pairs on-the-fly, thus minimizing decoding latency. Em...
2025-09-17
18 min
AI: post transformers
TailorKV: Hybrid KV Cache Compression for LLMs
This May 2025 paper introduces TailorKV, a novel hybrid framework designed to optimize Key-Value (KV) cache management in large language models (LLMs) for long-context inference. It addresses challenges like high GPU memory consumption and inference latency that arise from the linear growth of KV cache size with sequence length. TailorKV categorizes Transformer layers into quantization-friendly and sparsity-friendly based on their attention patterns, applying 1-bit quantization to the former and dynamic retrieval of Top-K tokens from CPU memory for the latter. This tailored approach significantly reduces memory usage and decoding latency while maintaining model accuracy, enabling LLMs to operate efficiently on...
2025-09-17
18 min
AI: post transformers
MIRAGE: Optimizing LLM KV Cache with Parameter Remapping
This July 2025 paper discusses advanced memory optimization techniques for Large Language Models (LLMs), particularly focusing on KV cache management in multi-tenant serving environments. The primary subject, MIRAGE, introduces parameter remapping, a novel method that dynamically repurposes GPU memory allocated for model parameters to expand KV cache capacity, outperforming traditional CPU-offloading and KV cache swapping by reducing latency and increasing throughput. Complementary research highlights challenges in on-device LLM deployment and proposes solutions like quantization (AWQ) for model compression and two-level scheduling (FineServe, Nexu...
2025-09-17
20 min
AI: post transformers
WebSailor-V2: Bridging Proprietary Agents with Synthetic Data and RL
This September 2025 paper introduces WebSailor-V2, an open-source deep research agent developed by Alibaba Group's Tongyi Lab. The paper details a post-training pipeline that uses a novel synthetic data construction scheme, SailorFog-QA-V2, and a dual-environment reinforcement learning framework. WebSailor-V2, built on the Qwen3-30B-A3B model, demonstrates state-of-the-art performance among open-source agents and is competitive with leading proprietary systems on various web-agent benchmarks, including BrowseComp and Humanity's Last Exam. The authors emphasize that high-quality data and a stable training environment are more crucial than the specifi...
2025-09-17
19 min
AI: post transformers
Dynamic Chunking for Hierarchical Sequence Modeling
This July 2025 paper introduces Hierarchical Networks (H-Nets), a novel architecture designed to move beyond traditional tokenization in large language models by implementing dynamic chunking. This mechanism allows the model to automatically learn content- and context-dependent segmentation strategies directly from raw data, eliminating the need for predefined pre-processing steps like byte-pair encoding (BPE). H-Nets utilize a recursive, multi-stage structure that processes data at varying levels of abstraction, from bytes to more complex semantic units. Experiments demonstrate that H-Nets, particularly multi-stage configurations, outperform tokenized Transformers in perplexity, downstream tasks, and robustness to textual pertur...
2025-09-17
25 min
AI: post transformers
LoFT: Parameter-Efficient Fine-Tuning for Long-tailed Semi-Supervised Learning
This September 2025 paper introduces LoFT, a novel framework designed to improve Long-Tailed Semi-Supervised Learning (LTSSL) by leveraging parameter-efficient fine-tuning of pre-trained foundation models. The core idea is to enhance confidence calibration and generate more reliable pseudo-labels, which are crucial for addressing the imbalance inherent in long-tailed datasets. Furthermore, the paper extends this approach to open-world scenarios with LoFT-OW, specifically incorporating mechanisms to detect and filter out-of-distribution (OOD) samples from unlabeled data. The authors demonstrate that these fine-tuned models achieve superior performance on various benchmarks, even when utilizing significan...
2025-09-17
17 min
AI: post transformers
QuantAgent: Multi-Agent LLM for High-Frequency Trading
This September 2025 paper describes QuantAgent, a novel multi-agent large language model (LLM) framework designed for high-frequency quantitative trading based solely on price-derived market signals. The system decomposes trading decisions into four specialized agents—IndicatorAgent, PatternAgent, TrendAgent, and DecisionAgent—which analyze market dynamics from complementary perspectives and communicate through structured prompts. QuantAgent consistently outperforms baseline models across diverse assets, including commodities, equities, and cryptocurrencies, demonstrating robust generalization and achieving high directional accuracy in predicting price movements. A key feature is its ability to produce traceable, language-native explanations for its trading decisions...
2025-09-17
17 min
AI: post transformers
Infini-gram: Scaling Unbounded N-gram Language Models
This April 2025 paper introduces Infini-gram, a novel engine designed to scale n-gram language models to an unprecedented 5 trillion tokens and support unbounded n (∞-gram LMs). Unlike traditional methods that rely on pre-computed count tables, Infini-gram leverages suffix arrays for efficient, low-latency calculation of n-gram and ∞-gram probabilities, even for extremely long contexts. The authors demonstrate that this modernized approach significantly improves the perplexity of neural Large Language Models (LLMs), by up to 73%, by offering complementary insights into human-written and machine-generated text. Beyond enhancing LLMs, the Infini-gram engine also enables various appl...
2025-09-17
19 min
AI: post transformers
Generalist Reward Modeling with Inference-Time Scaling
This April 2025 paper introduces Self-Principled Critique Tuning (SPCT), a novel method designed to enhance the inference-time scalability of Generative Reward Models (GRMs) for various domains. It details how SPCT, through a combination of rejective fine-tuning and rule-based online reinforcement learning, facilitates the adaptive generation of principles and critiques, thereby improving the quality and inference-time scalability of GRMs. The paper compares different reward generation paradigms (scalar, semi-scalar, generative) and scoring patterns (pointwise, pairwise), demonstrating that the proposed DeepSeek-GRM models, particularly when guided by a meta Reward...
2025-09-16
17 min
AI: post transformers
Hierarchical Reasoning Model: Brain-Inspired AI for Complex Tasks
This August 2025 paper introduces the Hierarchical Reasoning Model (HRM), a novel AI architecture inspired by the human brain's hierarchical and multi-timescale processing. This model aims to overcome the limitations of current large language models (LLMs) and Chain-of-Thought (CoT) techniques in complex reasoning tasks, which often suffer from computational inefficiencies and extensive data requirements. HRM utilizes two interdependent recurrent modules: a high-level module for abstract planning and a low-level module for detailed computations, enabling it to achieve significant computational depth. Notably, HRM demonstrates exceptional performance on challenging reasoning benchmarks like Sudoku a...
2025-09-16
17 min
AI: post transformers
Native Sparse Attention: Efficient Long-Context LLMs
This February 2025 paper introduces Native Sparse Attention (NSA), a novel approach to address the computational demands of long-context modeling in large language models. NSA combines algorithmic innovations like a dynamic hierarchical sparse strategy with hardware-aligned optimizations to significantly improve efficiency. The paper highlights NSA's ability to maintain or even surpass the performance of traditional "Full Attention" models across various benchmarks, including general language, long-context tasks, and instruction-based reasoning, while achieving substantial speedups in decoding, forward, and backward propagation. It critically analyzes the shortcomings of existing sparse attention methods, particularly their failure to achi...
2025-09-16
16 min
AI: post transformers
CodeI/O: Reasoning Patterns Through Code Input-Output Prediction
This February 2025 paper introduce CodeI/O, a novel training method for Large Language Models (LLMs) that enhances general reasoning abilities by transforming code into an input-output prediction task. Instead of focusing on generating code, CodeI/O trains models to predict inputs or outputs of a given code in natural language Chain-of-Thought (CoT) rationales. This approach allows LLMs to learn universal reasoning primitives embedded in code, such as logic flow and decision-making, while decoupling them from specific programming syntax. An improved version, CodeI/O++, further refines training data through
2025-09-16
16 min
AI: post transformers
Janus-Pro: Unified Multimodal AI with Scaled Improvements
This January 2025 paper introduces Janus-Pro, an enhanced artificial intelligence model for multimodal understanding and generation. It builds upon its predecessor, Janus, through optimized training strategies, expanded data, and increased model size. The authors demonstrate that Janus-Pro achieves significant improvements in both multimodal understanding benchmarks and text-to-image generation capabilities, producing more stable and aesthetically pleasing outputs. This work highlights the benefits of decoupling visual encoding for understanding and generation tasks within a unified autoregressive transformer architecture.Source:https://arxiv.org/pdf/2501.17811
2025-09-16
15 min
AI: post transformers
Federated Post-Training LLMs: An Accessibility and Efficiency Survey
This August 2025 paper examines the evolving landscape of Federated Large Language Models (FedLLM), focusing on how large language models are post-trained while preserving user data privacy. The authors introduce a novel taxonomy that categorizes FedLLM approaches based on model accessibility (white-box, gray-box, and black-box) and parameter efficiency. It highlights various techniques within these categories, such as adapter-based tuning and prompt tuning, which reduce computational and communication overhead. The paper also discusses the growing importance of inference-only black-box settings for future FedLLM development and identifies
2025-09-16
20 min
AI: post transformers
Non-Penetrative Tensor Partitioning for Collaborative AIoT Inference
This June 2025 paper introduces Non-Penetrative Tensor Partitioning (NPTP), a novel method designed to improve the speed of collaborative inference for Deep Neural Networks (DNNs) on Internet of Things (IoT) devices. It addresses the common challenge of limited resources and strict latency requirements by minimizing the communication overhead that typically arises when large images are divided and processed across multiple devices. Unlike existing methods that utilize penetrative partitioning, which leads to substantial data sharing between devices, NPTP employs a non-penetrative approach and a Multilevel Partitioning Algorithm (MPA) to reduce this inter-device communication. Experimental results demonstrate that NPTP significantly outperforms state-of-the-art...
2025-09-16
15 min
AI: post transformers
Collaborative Edge Inference with Dynamic Task Offloading and Early Exiting
This December 2024 paper introduces a collaborative inference framework designed for large-scale models in 5G smart city edge computing environments, addressing the challenge of limited memory and computing capacity on individual edge nodes. The framework partitions large models into sub-models deployed across multiple edge nodes and incorporates an early exit mechanism to accelerate inference. To manage the complexities of heterogeneous systems and dynamic environments, the authors propose a distributed algorithm called DTO-EE, which jointly optimizes task offloading strategies and confidence thresholds for early exits. Experimental r...
2025-09-16
13 min
AI: post transformers
Adaptive LLM Partitioning for Edge Inference
This May 2025 paper introduces a resource-aware algorithm designed to optimize the performance of Large Language Models (LLMs) for low-latency inference on edge computing devices. The core innovation lies in its fine-grained partitioning of the Transformer architecture, specifically at the attention head-level, rather than coarser layer-level divisions. This approach allows for dynamic reassignment and migration of these individual attention heads and their associated Key/Value (K/V) caches across heterogeneous edge devices. By managing the expanding memory footprint of K/V caches and exploiting parallel execution of attention heads, the proposed method significantly reduces inference latency and memory usage compared...
2025-09-16
15 min
AI: post transformers
UQ: Unsolved Questions for Language Models
This August 2025 paper introduces UQ, a novel evaluation framework designed to challenge large language models (LLMs) with complex, unsolved questions sourced from platforms like Stack Exchange, where no definitive ground truth answers currently exist. The framework consists of three main components: UQ-Dataset, a collection of 500 hand-filtered, difficult, and unsolved questions; UQ-Validators, a set of LLM-based validation strategies that assess candidate solutions by leveraging the observation that models are often better at verifying answers than generating them; and UQ-Platform, which facilitates community engagement and human verification. The paper highlights the generator-validator gap, demonstrating that LLMs show improved performance in validating...
2025-09-16
17 min
AI: post transformers
PETALS: Collaborative Large Language Model Inference and Fine-tuning
This March 2023 paper introduces PETALS, a novel system designed to facilitate the collaborative inference and fine-tuning of large language models (LLMs) by pooling resources from multiple participants. It addresses the significant computational and memory demands of LLMs, which typically restrict access for many researchers. PETALS proposes an alternative to traditional methods like slow RAM offloading or inflexible inference APIs by allowing distributed processing across a network of consumer GPUs, enhancing speed and flexibility. The system incorporates optimizations like 8-bit quantization and dynamic load balancing to improve performance and reliability. Ultimat...
2025-09-16
15 min
AI: post transformers
AWQ: On-Device LLM Compression and Acceleration
This July 2024 paper introduces Activation-aware Weight Quantization (AWQ), a novel method for compressing Large Language Models (LLMs) by quantizing weights to low-bit integers for efficient deployment on edge devices. It highlights that AWQ identifies and protects crucial "salient" weights by observing activation distributions, which significantly reduces quantization error without requiring computationally intensive training or overfitting to specific datasets. Complementing AWQ, the paper also presents TinyChat, an inference framework specifically designed to optimize and accelerate these 4-bit quantized LLMs on various hardware, including mobile GPUs and even resource-constrained devices like the Raspberry Pi, achieving substantial speedups compared to traditional implementations...
2025-09-15
19 min
AI: post transformers
HybridServe: Efficient LLM Inference with Hybrid Caching
This January 2025 paper introduces HybridServe, an LLM inference system designed to enhance throughput and cost-effectiveness for large language models by optimizing memory usage and host-GPU communication. It tackles the challenges of host memory offloading, where model parameters and KV cache are stored on slower host memory to reduce costs but can lead to GPU underutilization due to limited transfer bandwidth. HybridServe proposes a novel activation checkpointing technique with a KV-Activation hybrid caching scheme that stores intermediate activations, allowing for faster recomput...
2025-09-15
22 min
AI: post transformers
FlexGen: High-Throughput LLM Inference on a Single GPU
This June 2023 paper introduces FlexGen, a novel high-throughput generation engine designed to overcome the substantial computational and memory demands of large language model (LLM) inference on limited hardware, specifically a single commodity GPU. It details FlexGen's ability to aggregate memory and computation across the GPU, CPU, and disk, employing an optimized scheduling approach and a linear programming-based policy search to store and access tensors efficiently. Furthermore, FlexGen incorporates 4-bit compression for model weights and attention caches, which significantly reduces memory footprint with minimal accuracy loss. The research demonstrates FlexGen's superior performance, achieving subst...
2025-09-15
20 min
AI: post transformers
GraphSAGE: Inductive Representation Learning on Large Graphs
This September 2018 paper introduces GraphSAGE, a novel inductive framework designed to generate node embeddings for large, evolving graphs, addressing limitations of prior transductive methods that struggle with unseen data. Instead of learning a specific embedding for each node, GraphSAGE learns a function that generates these embeddings by sampling and aggregating features from a node's local neighborhood. The authors evaluate various aggregator architectures, including mean, LSTM, and pooling functions, demonstrating that GraphSAGE significantly outperforms strong baselines on node classification tasks across diverse datasets, such as citation networks, Reddit posts, and...
2025-09-15
26 min
AI: post transformers
MetaGraph: knowledge graphs from financial NLP
This September 2025 paper presents MetaGraph, a novel methodology for constructing knowledge graphs from scientific literature, specifically applied to Financial Natural Language Processing (NLP) research between 2022 and 2025. The authors utilized Large Language Models (LLMs) to extract key information from 681 papers, including tasks, datasets, models, motivations, and limitations, and organized it into a structured, queryable format. The analysis highlights three phases in Financial NLP's evolution: initial LLM adoption and task/dataset innovation, subsequent critical reflection on LLM limitations, and a current trend toward integrating peripheral techniques into mod...
2025-09-15
17 min
AI: post transformers
Hallucination to Truth: A Review of Fact-Checking and Factuality Evaluation in Large Language Model
This August 2025 paper explores the critical area of fact-checking and factuality evaluation in Large Language Models (LLMs). It systematically analyzes the challenges of misinformation generation, particularly hallucinations, which are factually incorrect but fluent outputs from LLMs. The paper investigates various mitigation strategies, including fine-tuning, instruction tuning, and Retrieval-Augmented Generation (RAG), which grounds LLM outputs in external knowledge. It further examines evaluation metrics, datasets, and prompting strategies used to assess and enhance the factual accuracy of these models, highlighting the need for more robust, explainable, and domain-specific fact-che...
2025-09-15
19 min
AI: post transformers
The Illusion of Diminishing Returns in LLM Execution
This September 2025 paper explores the concept of long-horizon execution in Large Language Models (LLMs), arguing that marginal gains in single-step accuracy can lead to exponential improvements in the length of tasks LLMs can complete. The authors introduce a novel framework to isolate execution capabilities by providing models with necessary knowledge and plans, revealing that larger models can execute significantly more steps, even when smaller models achieve perfect single-turn accuracy. A key finding is the "self-conditioning effect," where LLMs become more prone to errors when their past mistakes are present in the context...
2025-09-15
15 min
AI: post transformers
PyTorch FSDP: Scaling Fully Sharded Data Parallel
This September 2023 paper introduces PyTorch Fully Sharded Data Parallel (FSDP), an advanced solution designed to scale the training of exceptionally large machine learning models. It addresses limitations of previous methods like Distributed Data Parallel (DDP) by sharding model parameters, gradients, and optimizer states across multiple GPUs, thereby drastically reducing individual GPU memory consumption. FSDP employs various techniques, including deferred initialization, flexible sharding strategies, and optimizations for communication overlap and prefetching, to ensure high efficiency and a user-friendly experience. The research demonstrates FSDP's effectiveness in training models with billions of parameters, achieving nea...
2025-09-15
19 min
AI: post transformers
Llama 3: Architecture, Capabilities, and Safety
On this November 2025 paper the Meta Llama Team's paper introduces Llama 3, a new family of large language models featuring 8B, 70B, and 405B parameters, designed with native multilingual support, coding, reasoning, and tool usage capabilities. The development emphasizes data quality and diversity, employing extensive filtering, de-duplication, and heuristic cleaning processes for both English and multilingual data, alongside scaling laws to optimize model size and training budgets. The models utilize a standard dense Transformer architecture with minor adaptations like grouped query attention and an attention mask for multi-document sequences, demonstrating comparable performance to leading models such as GPT-4 across various...
2025-09-15
22 min
AI: post transformers
Graph Patterns of Knowledge in Large Language Models
This May 2025 paper explores the structural patterns of knowledge within Large Language Models (LLMs) by adopting a graph-based perspective. The authors quantify LLM knowledge at both the triplet and entity levels, analyzing its relationship with graph properties like node degree. Key findings include the discovery of knowledge homophily, where closely connected entities exhibit similar knowledgeability, and a positive correlation between an entity's degree and its knowledge. These insights further motivate the development of graph machine learning models to predict entity knowledge, which can then be used to strategically select less-known triplets for fine-tuning LLMs, leading to improved performance. The...
2025-09-14
15 min
AI: post transformers
All for One: LLMs Solve Mental Math at the Last Token
This September 2025 published research investigates how large language models (LLMs) perform mental math, particularly focusing on the flow of information and computational processes within their transformer architecture. The authors introduce two novel techniques, Context-Aware Mean Ablation (CAMA) and Attention-Based Peeking (ABP), to identify a minimal computational subgraph called All-for-One (AF1). This subgraph reveals that for mental math tasks, input-specific computation is largely deferred to later layers and primarily handled by the final token, which receives necessary information from other tokens during a few specific intermediate layers. The...
2025-09-13
17 min
AI: post transformers
Survey of Reinforcement Learning for Large Reasoning Models
This September 2025 paper provides a comprehensive overview of Reinforcement Learning (RL) as applied to Large Reasoning Models (LRMs). It breaks down the field into foundational components such as reward design and policy optimization, explaining various algorithms like PPO and GRPO. The document also discusses training resources, distinguishing between static corpora and dynamic environments, and highlights diverse applications of RL in LRMs, including coding, agentic tasks, and multimodal understanding, with a focus on models from 2025. Ultimately, the paper aims to identify future directions for scaling RL in LRMs to...
2025-09-13
25 min
AI: post transformers
SpikingBrain: Brain-Inspired LLMs for Efficient Long-Context Processing
These September 2025 papers present a technical report on SpikingBrain, a novel family of large language models (LLMs) that draw inspiration from brain mechanisms to address the efficiency challenges of traditional Transformer architectures. The research focuses on efficient long-context training and inference by developing hybrid linear attention architectures and an adaptive threshold spiking neuron scheme. A significant aspect of this work is the successful training and deployment of these models on non-NVIDIA GPU clusters, specifically the MetaX platform, demonstrating the feasibility of large-scale LLM development on alternative hardware. The authors highlight substantial speedups in inference for long sequences and significant...
2025-09-13
16 min
AI: post transformers
Statistical Methods for Generative AI Reliability
This September 2025 paper explores the critical role of statistical methods in enhancing the reliability and functionality of Generative AI (GenAI), which inherently lacks guarantees regarding correctness or safety. It discusses various statistical applications, including improving and altering model behavior through techniques like output trimming and abstention based on risk scores, often utilizing conformal prediction for provable guarantees. The text also covers diagnostics and uncertainty quantification (UQ), differentiating between epistemic and aleatoric uncertainty and addressing challenges like semantic multiplicity and the need for calibration in GenAI outputs. Furthermore, it highlights the importance of statistical inference in evaluating GenAI models, particularly...
2025-09-13
18 min
AI: post transformers
EntiGraph: Scaling Language Models with Synthetic Pretraining
This October 2024 paper introduces synthetic continued pretraining (synthetic CPT), a novel method designed to enhance language model knowledge acquisition from small, specialized text collections. Current large language models often struggle with data efficiency and learning niche facts from limited sources. The core of this approach is EntiGraph, a synthetic data augmentation algorithm that extracts entities and their relationships from a small corpus to generate a much larger, more diverse synthetic dataset. Experiments using the QuALITY dataset demonstrate that EntiGraph CPT significantly improves a model's ability to answer ques...
2025-09-13
23 min
AI: post transformers
NOVELTYBENCH: Evaluating Language Model Diversity
This August 2025 paper introduces NOVELTYBENCH, a new benchmark designed to evaluate how well large language models (LLMs) generate diverse and high-quality outputs, addressing the problem of "mode collapse" where models produce repetitive responses. The research found that current state-of-the-art LLMs consistently generate less diversity than human writers, with larger models often exhibiting even lower diversity than their smaller counterparts. The benchmark uses a unique approach to measure functional equivalence between generations, ensuring that diversity is meaningful to users. While certain prompting strategies, like in-context regeneration, can enhance diversity, the study...
2025-09-12
18 min
AI: post transformers
HyperController: Fast, Stable Reinforcement Learning Hyperparameter Optimization
This April 2025 paper introduces HyperController, a novel and computationally efficient algorithm designed to optimize hyperparameters during the training of reinforcement learning neural networks. Hyperparameter optimization is crucial for improving machine learning models, but traditional methods can be slow and computationally intensive. HyperController addresses these challenges by modeling the hyperparameter optimization problem as an unknown Linear Gaussian Dynamical System and leveraging the Kalman filter for efficient prediction. The algorithm is validated through experiments on various OpenAI Gymnasium environments, where it demonstrates faster training times and superior or comparable performance compared to existing methods, achieving the highest median reward i...
2025-09-12
18 min
AI: post transformers
Parallel-R1: Reinforcement Learning for Parallel Thinking in LLMs
This September 10, 2025 technical report from Tencent AI Lab introduces Parallel-R1, a novel reinforcement learning (RL) framework designed to enhance large language models (LLMs) with parallel thinking capabilities for complex mathematical reasoning tasks. Unlike previous methods relying on supervised fine-tuning (SFT) over synthetic data, Parallel-R1 utilizes a progressive curriculum to address the cold-start problem in RL, initially using SFT on simpler tasks to instill the basic format of parallel thinking before transitioning to RL for exploration and generalization on more challenging problems. The research highlights that parallel thinking evo...
2025-09-12
15 min
AI: post transformers
Explaining AI for Digital Advertising with LLMs
This April 2025 paper introduces SODA, a novel framework designed to enhance digital advertising strategies by making opaque AI systems more understandable for marketers. The authors highlight the current challenges faced by advertisers due to the lack of transparency in major ad platforms like Meta, which often results in wasted ad spend and reliance on intuition. To address this, SODA integrates Large Language Models (LLMs) with explainable AI techniques to provide clear, actionable insights into ad performance. The framework initially employs an improved Click-Through Rate (CTR) prediction model, SoWide-v2, which also offers visua...
2025-09-11
16 min
AI: post transformers
AdLlama: Boosting Ad Performance with Reinforcement Learning
This July 2025 paper introduces AdLlama, a new large language model (LLM) for generating Facebook ad text, trained using Reinforcement Learning with Performance Feedback (RLPF). Unlike previous models that relied on supervised fine-tuning to imitate curated ads, AdLlama utilizes historical ad performance data, specifically click-through rates (CTR), as a reward signal to optimize its text generation. A large-scale A/B test on Facebook, involving nearly 35,000 advertisers, demonstrated that AdLlama significantly improved advertiser-level CTR by 6.7% and increased the number of ad variations advertisers created by 18.5%. The findings highlight RLPF as a promising, generalizable approach for m...
2025-09-11
17 min
AI: post transformers
ByteCheckpoint: A Unified LLM Checkpointing System
This July 2024 paper introduces ByteCheckpoint, a novel PyTorch-native system designed for Large Language Model (LLM) development. This system addresses critical challenges in LLM training, particularly the high I/O costs associated with saving and loading checkpoints, and the complexities of checkpoint resharding across different parallel configurations and training frameworks. ByteCheckpoint achieves this through a data/metadata disaggregated storage architecture and asynchronous tensor merging, enabling automatic online resharding and multi-framework support. The paper highlights ByteCheckpoint's significant performance improvements in reducing checkpoint savi...
2025-09-11
17 min
AI: post transformers
Darling: Reinforcing Diversity and Quality in Language Models
This September 2025 paper introduces Diversity-Aware Reinforcement Learning (Darling), a novel framework designed to enhance both the quality and semantic diversity of large language model (LLM) generations. Recognizing that traditional post-training methods often sacrifice diversity for accuracy, Darling integrates a learned partition function to measure semantic diversity beyond simple lexical variations. This diversity signal is then multiplied with a quality reward during online reinforcement learning, which encourages LLMs to produce responses that are not only high-quality but also distinct and novel. Experiments on both non-verifiable tasks, such as creative writing, and verifiable tasks, like competition math, demonstrate that Darling consistently...
2025-09-10
20 min
AI: post transformers
INF2: Near-Storage LLM Inference for High Throughput
This February 2025 paper introduces INF2, a novel framework designed to enhance the generative inference throughput of large language models (LLMs) by utilizing computational storage devices (CSDs). The core innovation, attention-near storage (ANS), offloads memory-intensive self-attention operations directly to accelerators within these storage devices, significantly reducing data transfer bottlenecks over the system interconnect. To further boost performance, INF2 incorporates delayed KV cache writeback which minimizes storage write latency by batching updates to the KV cache, and cooperative X-cache, which optimizes host memory usage by storing input activations instead of key-value caches for cooperative processing between the GPU and CSDs. Through...
2025-09-10
21 min
AI: post transformers
K2-Think: A Parameter-Efficient Reasoning System
The September 9 2025 press release and paper announce and detail K2 Think, an advanced open-source AI reasoning system developed by the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) and G42 in the UAE. K2 Think stands out for its parameter efficiency, achieving performance comparable to much larger models, particularly in mathematical reasoning, with only 32 billion parameters. This breakthrough is attributed to a six-pillar approach, including supervised fine-tuning, reinforcement learning with verifiable rewards, agentic planning, test-time scaling, and optimization for Cerebras Waf...
2025-09-10
16 min
AI: post transformers
AlphaEvolve: AI for Scientific and Algorithmic Discovery
The May - June 2025 sources introduce AlphaEvolve, a novel AI coding agent developed by Google DeepMind in collaboration with mathematicians like Javier Gómez Serrano and Terence Tao. This Gemini-powered tool utilizes an evolutionary process, similar to natural selection, to generate and iteratively refine code solutions for complex problems. AlphaEvolve has demonstrated its capability in scientific and algorithmic discovery, successfully tackling open mathematical challenges such as improving bounds for matrix multiplication and the kissing number problem in 11 dimensions. Beyond theoretical advancements, it has also been applied to optimize critical components within G...
2025-09-10
14 min
AI: post transformers
BLEU: Automatic Machine Translation Evaluation
This July 2002 paper introduced BLEU (Bilingual Evaluation Understudy), an automatic and inexpensive method for evaluating machine translation (MT) quality. It highlights the limitations of human evaluation, such as its high cost and time consumption, and proposes BLEU as a quick, language-independent alternative that correlates strongly with human judgment. The core concept of BLEU involves measuring the "closeness" of a machine translation to one or more human reference translations through a modified n-gram precision metric and a brevity penalty. The paper details the mathematical formulation of the BLEU sco...
2025-09-10
20 min
AI: post transformers
Mini-o3: Scaling Reasoning for Visual Search
This September 2025 paper introduces Mini-o3, a Vision-Language Model (VLM) designed to overcome the limitations of existing VLMs in handling complex visual search tasks that require multi-turn reasoning and trial-and-error exploration. The researchers developed a three-component training recipe, including the creation of the Visual Probe Dataset with challenging, high-resolution images, a pipeline for synthesizing diverse multi-turn trajectories for supervised finetuning, and an over-turn masking technique in reinforcement learning. This masking prevents penalization of long, incomplete reasoning paths, encouraging deeper exploration without increasing training time. Mini-o3 demonstrates state-of-the-art performanc...
2025-09-10
12 min
AI: post transformers
Masked Diffusion Models: Performance and Theory
This September 2025 paper analyzes the theoretical benefits and limitations of Masked Diffusion Models (MDMs) for text generation, contrasting them with auto-regressive models. While MDMs can sample multiple tokens in parallel, offering a potential for efficiency, the research demonstrates that their actual performance depends heavily on the evaluation metric. Specifically, MDMs can achieve near-optimal fluency (low Token Error Rate) with a constant number of sampling steps, regardless of sequence length. However, when assessed for correctness (low Sequence Error Rate), particularly for tasks requiring logical reasoning, MDMs necessitate a number of sampling...
2025-09-10
16 min
AI: post transformers
TraceRL: Reinforcement Learning for Diffusion Language Models
This September 2025 paper introduces TraceRL, a novel reinforcement learning framework designed to enhance diffusion language models (DLMs) across various architectural types. The core idea behind TraceRL is to align the training process with the preferred inference trajectories of the model, which demonstrably improves performance on complex reasoning tasks like mathematics and coding. The authors also propose a diffusion-based value model to boost training stability. Through experiments, the paper showcases the effectiveness of TraceRL, yielding state-of-the-art DLMs called TraDo that outperform larger autoregressive models. Furthermore, the source provi...
2025-09-09
13 min
AI: post transformers
LLM Benchmark Robustness to Linguistic Variation
This September 2025 paper investigates the reliability and robustness of Large Language Models (LLMs) when evaluated using traditional benchmarks. The authors systematically paraphrased questions across six common benchmarks and observed how 34 different LLMs performed. Their findings indicate that while LLM rankings remain relatively consistent, their absolute effectiveness scores significantly decline when faced with reworded questions, suggesting a lack of robustness to linguistic variability. The study highlights that current benchmark evaluations may overstate LLM generalization abilities and advocates for more robustness-aware evaluation methodologies that better reflect real-world language us...
2025-09-09
17 min
AI: post transformers
Behavioral Fingerprinting of Large Language Models
This September 2025 paper introduces "Behavioral Fingerprinting," a novel framework designed to evaluate Large Language Models (LLMs) beyond traditional performance scores like MMLU. It aims to understand how models "think," creating a multi-faceted profile of their intrinsic cognitive and interactive styles. The methodology employs a diagnostic prompt suite and an automated evaluation pipeline where a powerful LLM acts as a judge, analyzing eighteen different models across four key dimensions: internal world model, reasoning abilities, biases and personality (including sycophancy), and semantic robustness. Findings indicate a convergence in core reasoning abilities
2025-09-09
15 min
AI: post transformers
Offloading LLM Models and KV Caches to NVMe SSDs
This March 2025 paper examines the input/output (I/O) characteristics of offloading large language model (LLM) components to NVMe SSDs during inference, a critical solution for overcoming GPU memory limitations with ever-growing LLMs. Researchers analyzed block-layer I/O traces from two prominent LLM frameworks, DeepSpeed and FlexGen, to understand how model weights and key-value (KV) caches are handled. The findings indicate that asynchronous I/O using libaio significantly outperforms POSIX for tensor transfers, although neither method fully saturates the NVMe SSD's theoretical bandwidth. For model offloading, I/O is predominantly characterized by 128KiB...
2025-09-08
17 min
AI: post transformers
GPT-NeoX: Large-Scale Autoregressive Language Modeling in PyTorch
Thus describes EleutherAI's GPT-NeoX library, a robust open-source framework for training large-scale autoregressive language models on GPUs, building upon the Megatron and DeepSpeed libraries. It highlights the library's advanced features like distributed training, support for various hardware and systems, and cutting-edge architectural innovations. The text also provides practical guidance on setup, configuration, data preparation, training, inference, and evaluation, alongside details on pretrained models like GPT-NeoX-20B and Pythia. Furthermore, it details how to export models to Hugging Face and monitor experiments, underscoring its widespread adoption in research and indus...
2025-09-07
12 min
AI: post transformers
SGLang: Efficient Language Model Program Execution
This June 2024 paper introduces SGLang, a framework designed to enhance the efficiency of Large Language Model (LLM) and Vision Language Model (VLM) serving. It achieves this through a co-design of a flexible frontend language and a fast backend runtime. The frontend simplifies programming with primitives for generation and parallelism, while the backend utilizes novel optimizations like RadixAttention for KV cache reuse and compressed finite state machines for faster structured output decoding. These innovations allow SGLang to significantly improve throughput and reduce latency compared to existing systems across various LLM applications and hardw...
2025-09-07
17 min
AI: post transformers
Eleuther: evaluating LLMs
These sources collectively explore various approaches to evaluating and improving Large Language Models (LLMs). Several papers introduce new benchmark datasets designed to test LLMs on complex reasoning tasks, such as the "BIG-Bench Hard (BBH)" suite, the graduate-level "GPQA" questions in science, and "MuSR" for multistep soft reasoning in natural language narratives. A key technique discussed across these sources is Chain-of-Thought (CoT) prompting, which encourages LLMs to show their step-by-step reasoning, leading to improved performance, often surpassing human-rater averages on challenging tasks. Additionally, the "Instruction-Following Eval (IFEval)" introduces a reproducible benchmark for verifiable instructions, allowing for objective assessment of an...
2025-09-07
26 min
AI: post transformers
OpenELM: Apple's Open Language Model Family
The provided May 2024 sources center around CoreNet, an Apple-developed library for training deep neural networks, and OpenELM, an efficient language model family built using CoreNet. CoreNet is a versatile toolkit supporting various tasks, including foundation models like large language models (LLMs), object classification, and semantic segmentation, with its development evolving from the earlier CVNets. A key innovation highlighted is OpenELM's layer-wise scaling strategy, which optimizes parameter allocation within transformer models to achieve superior accuracy with fewer pre-training tokens compared to other open LLMs. The resources emphasize reproducibility and transparency by providing comprehensive frameworks for OpenELM's training and evaluation, including...
2025-09-07
15 min
AI: post transformers
FineVision: Open Data for Computer Vision
These September 2025 posts describe HuggingFaceM4/FineVision, a large dataset designed for image and text modalities. It features a substantial size, ranging from 10M to 100M, and is available in the parquet format. This dataset includes various ratings, such as relevance, visual dependency, image correspondence, and formatting, indicating its use in evaluating the quality and relationship between visual and textual content. The examples provided demonstrate that FineVision contains question-and-answer pairs related to diverse charts and diagrams, covering topics like population trends, genetic diseases, software update frequencies, and demographic distributions, suggesting its a...
2025-09-07
15 min
AI: post transformers
Evaluating Large Language Models Trained on Code
This July 2021 paper documents the development and evaluation of OpenAI's Codex models, which are large language models specialized in code generation, particularly Python functions from docstrings. They introduce HumanEval, a hand-written dataset designed to assess the functional correctness of generated code through unit tests, a more robust metric than traditional match-based scores like BLEU. The papers compare the performance of various Codex iterations, including supervised fine-tuned versions (Codex-S), against other models like GPT-3, demonstrating significant improvements in pass rates with increased model size and sample generation. Furthermore, the texts explore the limitations, broa...
2025-09-07
16 min
AI: post transformers
Democratizing AI Compute: The Modular Vision
This blog post series from Chris Lattner extensively examines CUDA's pervasive dominance in AI compute, detailing its evolution from a graphics processor to a layered software platform integral to NVIDIA's success, while also highlighting the challenges and complexities it presents to developers and alternative hardware vendors. The articles critically assess various attempts to democratize AI compute, including OpenCL, TVM, XLA, and MLIR, explaining why these alternatives largely failed to dislodge CUDA due to fragmentation, misaligned incentives, and a lack of unified vision. Ultimately, the texts introduce Modular's approach to addressing these issues through its Mojo language, MAX framework, and...
2025-09-07
1h 11
AI: post transformers
Limitations of Embedding-Based Retrieval
This August 2025 paper from Google DeepMind, titled "On the Theoretical Limitations of Embedding-Based Retrieval," explores the fundamental constraints of vector embedding models in information retrieval. The authors demonstrate that the number of relevant document combinations an embedding can represent is inherently limited by its dimension. Through empirical "free embedding" experiments and the introduction of a new dataset called LIMIT, they show that even state-of-the-art models struggle with simple queries designed to stress these theoretical boundaries. The research concludes that for complex, instruction-following queries, alternative...
2025-09-06
15 min
AI: post transformers
SAIR: Accelerating Pharma R&D with AI-Powered Structural Intelligence
This September 2025 paper describe SAIR, the Structurally Augmented IC50 Repository, a groundbreaking open-source dataset developed by SandboxAQ in collaboration with NVIDIA. SAIR is the largest publicly available collection of over 5 million AI-generated 3D protein-ligand structures, each linked with experimentally measured drug potency data (IC₅₀ values). This dataset aims to bridge a critical data gap in AI-powered drug discovery by providing comprehensive structural intelligence, thereby enabling researchers to accelerate R&D, explore novel drug targets, and improve the accuracy of AI models for predicting drug properties. The creation of SAIR involved extensive high-performance computing, taking o...
2025-09-06
18 min
AI: post transformers
EmbeddingGemma: On-Device AI for High-Quality Embeddings
This document announces EmbeddingGemma, a new open embedding model from Google, specifically designed for on-device artificial intelligence (AI). It highlights the model's efficiency, compact size, and best-in-class performance for its category, particularly in multilingual text embedding. The source explains how EmbeddingGemma enables mobile-first Retrieval Augmented Generation (RAG) pipelines and semantic search by generating high-quality text embeddings directly on user hardware, ensuring privacy and offline functionality. It also details the model's compatibility with popular development tools and its ability to offer flexib...
2025-09-05
14 min
AI: post transformers
MTEB & MMTEB: The Massive Text Embedding Benchmark
These academic papers introduce and detail the Massive Multilingual Text Embedding Benchmark (MMTEB), a comprehensive evaluation framework for text embedding models. The MMTEB expands upon existing benchmarks by offering over 500 tasks across 250+ languages and various domains, significantly increasing the diversity and scale of evaluation. It incorporates optimizations like downsampling and caching to reduce computational costs, making the benchmark more accessible, especially for low-resource languages. The papers also evaluate various models, including large language models (LLMs) and smaller, multilingual models, revealing that instruction-tuned models often perform better, and smaller models can surprisingly outperform larger LLMs in highly multilingual or low-resource...
2025-09-05
16 min
AI: post transformers
DeepResearch Arena: Benchmarking LLMs' Research Abilities
This September 2025 paper introduces DeepResearch Arena, a novel benchmark designed to evaluate the research capabilities of large language models (LLMs) by mirroring real-world academic inquiry. This benchmark addresses limitations of existing evaluation methods, which often suffer from data leakage or lack authenticity, by grounding its tasks in academic seminars and expert discourse. A Multi-Agent Hierarchical Task Generation (MAHTG) system is utilized to automatically generate over 10,000 diverse research tasks across multiple disciplines, covering phases from synthesis to evaluation. The paper also proposes a hybrid evaluation framework that combines Keypoint-Aligned Evaluation (KAE) for factual correctness and Adaptively-generated Checklist Evaluation (ACE) for...
2025-09-05
16 min
AI: post transformers
Inverse IFEval: Unlearning LLM Cognitive Inertia
This September 2025 paper introduces Inverse IFEval, a novel benchmark designed to evaluate Large Language Models (LLMs) for their Counter-intuitive Ability. This refers to an LLM's capacity to override its ingrained training patterns and comply with instructions that conflict with conventional norms or standardized formats. The benchmark includes eight distinct categories of such challenging instructions, like "Code without Comments" or "Deliberately Incorrect Answers," to expose the cognitive inertia and overfitting that current LLMs exhibit. The study underscores the need for future LLM development to prioritize adaptability...
2025-09-05
19 min
AI: post transformers
The Rise of Physical Neural Networks
This June 2024 paper examines the current state and future potential of Physical Neural Networks (PNNs), which are AI systems implemented directly in physical hardware rather than purely digital software. It explores various training methodologies for PNNs, including in-silico (digital simulation), in-situ (real-world hardware training), and hybrid approaches like physics-aware training, each with its own advantages and limitations regarding accuracy, speed, cost, and complexity. The text also discusses alternative training paradigms such as Feedback Alignment, Local Learning, and gradient-free methods that aim to overcome challenges associated with traditional backpropagation in physical systems. Furthermore, it hi...
2025-09-04
20 min
AI: post transformers
FastVLM: Efficient Vision Encoding for Language Models
This May 2025 paper introduces FastVLM, an innovative approach designed to enhance the efficiency of Vision Language Models (VLMs). The authors explain that while increasing image resolution is crucial for VLM performance, traditional visual encoders become inefficient. FastVLM addresses this by incorporating FastViTHD, a novel hybrid vision encoder that reduces both the number of visual tokens and encoding time for high-resolution images. This optimization, achieved solely through input image scaling, leads to a significant 3.2x improvement in time-to-first-token (TTFT) while maintaining strong performance on VLM benchmarks, making it a more efficient solution compared to prior methods. The paper, submitted to...
2025-09-04
12 min
AI: post transformers
Apertus Tech Report Overview
This paper introduces Apertus, a large language model developed by the Swiss AI Initiative, a partnership between ETH Zurich and EPFL. The GitHub repository appears to host technical documentation or code related to Apertus, while the Hugging Face page provides a comprehensive overview of the Apertus-8B-Instruct-2509 model. This model is highlighted for being fully open, massively multilingual (supporting over 1800 languages), and compliant with data privacy regulations, even incorporating mechanisms for data protection and copyright requests. The Hugging Face page also outlines the model's technical...
2025-09-04
13 min
AI: post transformers
Supervised Learning in DNA Neural Networks
This September 2025 paper article from Nature, authored by Kevin M. Cherry and Lulu Qian, introduces a novel DNA-based neural network capable of supervised learning in vitro. The authors demonstrate how DNA molecules can be programmed to autonomously classify patterns from molecular examples. This system integrates training data directly into molecular memories and uses these memories for subsequent classification, moving beyond previous systems that relied on in silico learning. The work highlights the potential of molecular circuits to perform complex information processing, opening doors for adaptive decision-making in various physical systems, fr...
2025-09-04
18 min
AI: post transformers
FusionANNS: Billion-Scale ANNS with SSD and GPU
This September 2024 paper introduces FusionANNS, a novel system designed to improve Approximate Nearest Neighbor Search (ANNS) for extremely large datasets. It addresses challenges in existing ANNS systems, such as performance bottlenecks, high operational costs, and accuracy limitations, particularly when dealing with billion-scale vector data in modern AI infrastructure like Large Language Models (LLMs). FusionANNS achieves this through a cooperative CPU/GPU architecture that employs multi-tiered indexing, heuristic re-ranking, and redundancy-aware I/O deduplication. The system is shown to significantly outperform state-of-the-art SSD-based and GPU-accelerated in-memory ANNS solutions in terms of throughput (QPS), cost efficiency, and memory efficiency, while maintaining low latency and high...
2025-09-04
26 min
AI: post transformers
rStar2-Agent: Smarter Math Reasoning Through Agentic RL
This August 2025 paper introduces rStar2-Agent, a 14B math reasoning model developed by Microsoft Research that achieves state-of-the-art performance comparable to much larger models by employing agentic reinforcement learning. The model is trained to "think smarter" through three key innovations: an efficient RL infrastructure that manages high-throughput code execution, a novel GRPO-RoC algorithm for effective reasoning in a noisy code environment by filtering high-quality trajectories, and an efficient training recipe that minimizes computational cost. Demonstrating superior accuracy on challenging math benchmarks like AIME24/25, rStar2-Agent-14B also exhib...
2025-09-03
19 min
AI: post transformers
Scientific LLMs: A Data-Centric Survey and Roadmap
This August 2025 paper offers an extensive overview of the evolution and application of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) within scientific research, primarily focusing on the period from 2018 to 2025. It details how these AI models have progressed through various paradigm shifts, from initial transfer learning to sophisticated scientific agents capable of autonomous research. The document thoroughly examines the diverse data modalities—including visual spectra, microscopy images, molecular encodings, and time-series data—across six key scientific domains: Chemistry, Materials Science, Physics, Life Sciences, Astronomy, and Earth Science. Furthe...
2025-09-03
19 min
AI: post transformers
Pimba: Processing-in-Memory for LLM Serving
This August 2025 paper introduces Pimba, a novel Processing-in-Memory (PIM) accelerator designed to enhance the efficiency of Large Language Model (LLM) serving for both traditional transformer-based models and emerging post-transformer architectures. The authors highlight that memory bandwidth is a critical bottleneck for both types of LLMs, specifically during attention operations in transformers and state updates in post-transformers. Pimba addresses this by integrating PIM technology with LLM quantization, using a State-update Processing Unit (SPU) shared between memory banks to maximize hardware resource sharing and area efficiency. The system employs MX-based quantized arithmetic within its State-update Processing Engine (SPE), which is identified as a Pareto-optimal choic...
2025-08-27
27 min
AI: post transformers
Oaken: Fast, Efficient LLM Serving with Hybrid KV Cache Quantization
This August 2025 paper introduces Oaken, a novel acceleration solution for serving Large Language Models (LLMs) that addresses the significant challenges of memory bandwidth and capacity bottlenecks inherent in batched LLM inference. Oaken achieves this through a co-designed algorithm and hardware architecture, featuring an online-offline hybrid KV cache quantization technique. This technique efficiently reduces the memory footprint and access requirements of the Key-Value (KV) cache by categorizing data into "inliers" and "outliers" using offline threshold profiling and applying group-shift quantization. Furthermore, Oaken integrates custom quantization/dequantization engines and memory management units into LLM accelerators to translate algorithmic gains into tangible performance...
2025-08-27
19 min
AI: post transformers
AdamW: Decoupled Weight Decay Regularization for Adaptive Gradient Algorithms
This January 2019 academic paper addresses the common issue of poor generalization in adaptive gradient optimization methods like Adam, compared to traditional Stochastic Gradient Descent (SGD) with momentum. The authors demonstrate that L2 regularization and weight decay are not equivalent for adaptive optimizers, unlike for standard SGD, leading to suboptimal performance in Adam. They propose a simple modification called "decoupled weight decay" (AdamW), which separates the weight decay step from the gradient-based updates. Empirical evidence shows that AdamW significantly improves Adam's generalization performance on image classification tasks and simplifies hyperparameter tuning by decoupling the learning rate and weight decay factors. Furth...
2025-08-27
46 min
AI: post transformers
Training Recurrent Neural Networks: Vanishing and Exploding Gradients
This academic paper addresses the inherent challenges in training Recurrent Neural Networks (RNNs), specifically the vanishing and exploding gradient problems. The authors explore these issues from analytical, geometrical, and dynamical systems perspectives, building upon previous work. They propose and empirically validate a gradient norm clipping strategy to combat exploding gradients and a soft regularization constraint to mitigate vanishing gradients. The research demonstrates that these solutions significantly improve RNN performance on both synthetic pathological tasks requiring long-term memory and natural language processing and music prediction problems.Source:https://arxiv.org/pdf/1211.5063
2025-08-27
20 min
AI: post transformers
Adafactor: Memory-Efficient Adaptive Learning Rates
This April 2018 paper introduces Adafactor, a novel optimization method designed to reduce the memory footprint of adaptive learning rate algorithms like Adam, particularly for large neural networks. Adafactor achieves this by estimating per-parameter second moments using factored representations, specifically maintaining only row and column sums for weight matrices, thereby reducing memory requirements from O(nm) to O(n+m). The paper also addresses training instability in adaptive methods, proposing update clipping and a gradually increasing decay rate scheme for the second-moment accumulator as solutions. Furthermore, Adafactor suggests scaling parameter updates based on the parameters' own magnitudes rather than absolute step sizes...
2025-08-27
17 min
AI: post transformers
SPAM: Stabilizing LLM Training with Spike-Aware Optimization
This February 2025 research addresses the critical issue of training instability in Large Language Models (LLMs), which often stems from sudden, massive "gradient spikes" that can be thousands of times larger than typical gradients. The authors introduce Spike-Aware Adam with Momentum Reset (SPAM), a novel optimizer designed to counteract these spikes through periodic momentum resets and spike-aware gradient clipping, which scales down rather than zeroes out large gradients. Experiments demonstrate that SPAM consistently outperforms existing optimizers like Adam and Adafactor across various LLM sizes during both pre-training and fine-tuning. Furthermore, SPAM offers a memory-efficient version leveraging sparse momentum, enabling better performance...
2025-08-27
17 min
AI: post transformers
Google: Measuring AI's Environmental Impact at Scale
This August 2025 paper presents Google's comprehensive methodology for measuring the environmental impact of AI inference workloads in a large-scale production environment. It addresses a critical gap in existing research by accounting for the full stack of AI serving infrastructure, including active AI accelerator power, host system energy, idle machine capacity, and data center overhead. The paper reveals that a median Gemini Apps text prompt consumes significantly less energy, carbon emissions, and water than many prior public estimates. Furthermore, it highlights Google's efforts in software efficiency and clean energy procurement, which ha...
2025-08-26
17 min
AI: post transformers
ComoRAG: Cognitively Inspired Narrative Reasoning
This August 2025 paper introduces ComoRAG, a novel framework designed to enhance long-context narrative comprehension in Large Language Models (LLMs) by simulating human metacognitive regulation. It addresses the limitations of existing Retrieval-Augmented Generation (RAG) methods, which struggle with stateful reasoning and integrating contradictory evidence over extended narratives. ComoRAG employs a dynamic cognitive loop that includes a hierarchical knowledge source (veridical, semantic, and episodic layers) and a dynamic memory workspace to continuously acquire new evidence and consolidate knowledge. Experimental results demonstrate ComoRAG's superior performance, particularly in solving com...
2025-08-26
13 min
AI: post transformers
Quantizing Diffusion LLMs: A Systematic Study
This August 2025 academic paper explores the application of post-training quantization (PTQ) to diffusion large language models (dLLMs), a promising alternative to traditional autoregressive LLMs for natural language generation. The authors conduct a systematic study to understand how existing PTQ techniques, commonly used for compressing AR LLMs, perform with dLLMs. A key finding is the prevalence of activation outliers in dLLMs, which pose a significant challenge for low-bit quantization. The research also evaluates the effectiveness of various quantization methods, bit-widths, task types, and model variants, concluding that 4-bit quantization is optimal for weight-only methods like GPTQ, while 8-bit is tolerable...
2025-08-26
24 min
AI: post transformers
ODYSSEY: Unified Mobile Manipulation for Agile Quadruped Robots
This August 2025 paper introduces ODYSSEY, a comprehensive framework for open-world mobile manipulation that integrates robotic mobility, manipulation, and real-time perception. It highlights a novel approach that uses large language models for high-level task planning and vision-language models for fine-grained action guidance, enabling robots to adaptively interact in complex environments. A significant contribution is the first comprehensive benchmark for long-horizon mobile manipulation, featuring diverse daily tasks in both indoor and outdoor settings to thoroughly evaluate embodied reasoning, planning, navigation, and manipulation capabilities. The system demonstrates strong sim-to-real transfer performan...
2025-08-26
21 min
AI: post transformers
GPT-5 Spatial Intelligence: An Empirical Study
This August 2025 academic paper, titled "Has GPT-5 Achieved Spatial Intelligence? An Empirical Study," examines the spatial understanding and reasoning capabilities of advanced multi-modal AI models, including the recently released GPT-5. The authors propose a new taxonomy for spatial tasks and evaluate both proprietary and open-source models against eight key benchmarks, utilizing over a billion tokens for their study. Their findings indicate that while GPT-5 shows unprecedented strength in spatial intelligence, it still falls short of human performance across a broad range of tasks. The research also identifies specific challeng...
2025-08-24
18 min
AI: post transformers
DeepSeek-V3.1: A Hybrid AI Model with Enhanced Reasoning
This is a review of DeepSeek's latest release announced on Hugging Face on August 21, 2025. The source introduces DeepSeek-V3.1, a hybrid large language model that supports both "thinking" and "non-thinking" operational modes, distinguishable through different chat templates. This updated model offers smarter tool calling capabilities and improved thinking efficiency, providing faster responses with comparable answer quality to previous versions. Built upon a two-phase long context extension, DeepSeek-V3.1 has expanded its training dataset significantly to enhance its understanding and generation of longer documents. The document also provides detailed chat templates for various interaction types, including multi-turn conversations and tool-calling scenarios for...
2025-08-23
13 min
AI: post transformers
Compressed Experts: Efficient MoE Model Editing
This March 2025 paper introduces compressed experts, an innovative method to enhance the efficiency of Mixture-of-Experts (MoE) models by reducing computational overhead while preserving performance. The core idea involves replacing less critical "auxiliary experts" with lightweight, compact representations, called compressed experts, during fine-tuning. This strategy allows for a significant reduction in activated parameters and inference costs—over 30% and 20% respectively, as demonstrated on models like Phi-MoE and OLMoE—while retaining more than 90% of the full model's performance. The paper details the method of identifying and aggregating these compressed experts and highlights their part...
2025-08-23
21 min
AI: post transformers
Genie 3: A New Frontier for World Models
The source provides an overview of Google DeepMind's AI research and models, highlighting various applications across different scientific disciplines and creative fields. It introduces Genie 3, a general-purpose world model capable of generating diverse, interactive, real-time environments from text prompts. The document details Genie 3's capabilities, such as simulating physical properties, natural worlds, and fictional scenarios, while also addressing its limitations and the company's commitment to responsible AI development. Ultimately, the text positions Genie 3 as a significant advancement for AI research and generative media, with potential for education, training, and embodied agent development.Source:
2025-08-23
16 min
AI: post transformers
Los Alamos: overcoming the memory wall fighting sparse memory access
We review Los Alamos National Laboratory advancements in managing indirect memory accesses in high-performance computing and it's relationship to overcoming the memory wall. The first goal of DoE’s next-generation supercomputer, ATS-5, is “Overcoming the memory wall: continued memory bandwidth performance improvements for tri-lab applications.”"DX100" introduces a programmable data access accelerator designed to improve memory bandwidth utilization for irregular applications by reordering, coalescing, and interleaving memory requests. This accelerator aims to offload bulk indirect memory operations from CPU cores, thus reducing instruction count and cache misses. Complementing this, "A Workflow for the Sy...
2025-08-21
29 min
AI: post transformers
Switch Transformers: Trillion Parameter Models with Sparsity
This June 2022 paper introduces Switch Transformers, a novel architecture designed to enhance the efficiency and scalability of large-scale language models. Unlike traditional models that reuse the same parameters, Switch Transformers employ a Mixture-of-Experts (MoE) approach, activating different parameters for each input to achieve a sparsely-activated model with significantly more parameters at a constant computational cost. The authors simplify the MoE routing algorithm and implement improved training techniques to overcome prior limitations such as complexity, communication overhead, and instability. The paper demonstrates that Switch Transformers achieve substantial pre-training speedups and performance gains across various natural language tasks, including multilingual settings...
2025-08-20
18 min
AI: post transformers
Linear Transformers: Faster Than RNNs
This August 2020 paper introduces linear transformers, a novel approach to addressing the computational and memory inefficiencies of traditional transformer models, particularly for long sequences. By reframing the self-attention mechanism using a linear dot-product of kernel feature maps, the authors reduce the computational complexity from quadratic to linear, enabling significantly faster autoregressive inference. The research highlights the relationship between transformers and recurrent neural networks (RNNs), demonstrating that a causally masked transformer can be expressed as an RNN, thus allowing for constant time and memory per predictio...
2025-08-20
14 min
AI: post transformers
Speed Always Wins: Efficient Large Language Model Architectures
This August 2025 survey paper explores efficient architectures for large language models (LLMs), addressing the computational challenges of models like Transformers. It categorizes advancements into linear sequence modeling, including linear attention and state-space models, which offer linear computational complexity. The document also examines sparse sequence modeling, such as static and dynamic sparse attention, designed to reduce computational demands by limiting interactions between elements. Furthermore, it discusses methods for efficient full attention, including IO-aware and grouped attention, and introduces sparse Mixture-of-Experts (MoE) models, which enhance efficiency through conditional compu...
2025-08-20
26 min
AI: post transformers
Atom: Low-Bit Quantization for LLM Serving
This April 2024 paper introduces Atom, a novel low-bit quantization method designed to enhance the efficiency and accuracy of Large Language Model (LLM) serving. The core challenge addressed is the high computational and memory costs associated with LLMs, especially when accommodating numerous user requests. Atom tackles this by quantizing both weights and activations to low-bit representations, like 4-bit, which significantly reduces memory consumption and boosts throughput by leveraging modern GPU capabilities. It maintains accuracy through mixed-precision quantization, fine-grained group quantization, and dynamic quantization, demonstrating substantial improvements in tokens per second with n...
2025-08-19
17 min
AI: post transformers
Continuous Batching for LLM Inference: Throughput and Latency Gains
The source analyzes Large Language Model (LLM) inference, specifically focusing on how continuous batching significantly improves efficiency compared to traditional static batching. It explains the inefficiencies of static batching where GPUs are underutilized due to varying output lengths in a batch, and introduces continuous batching (also known as dynamic batching or iteration-level scheduling) as a solution that dynamically adds new requests as others complete. The document further highlights PagedAttention and vLLM as advanced memory optimization techniques built upon continuous batching, leading to even greater throu...
2025-08-19
39 min
AI: post transformers
Self-Search Reinforcement Learning for LLMs
This August 2025 paper introduces Self-Search Reinforcement Learning (SSRL), a novel method that enables Large Language Models (LLMs) to access and utilize their internal knowledge for search-driven tasks, bypassing the need for external search engines like Google or Bing. The research explores how repeated sampling can enhance an LLM's intrinsic search capabilities and investigates the impact of various prompting strategies and training methodologies, including the benefits of information masking and format-based rewards. The paper demonstrates that SSRL-trained models can effectively gen...
2025-08-18
13 min
AI: AX - introspection
GoldenMagikCarp
These two sources from LessWrong explore the phenomenon of "glitch tokens" within Large Language Models (LLMs) like GPT-2, GPT-3, and GPT-J. The authors, Jessica Rumbelow and mwatkins, detail how these unusual strings, often derived from web scraping of sources like Reddit or game logs, cause anomalous behaviors in the models, such as evasion, bizarre responses, or refusal to repeat the token. They hypothesize that these issues stem from the tokens being rarely or poorly represented in the models' training data, leading to unpredictable outcomes and non-deterministic responses, even at zero temperature. The second source provides further technical details and...
2025-08-09
16 min
AI: AX - introspection
Route Sparse Autoencoder to Interpret Large Language Models
This paper introduces Route Sparse Autoencoder (RouteSAE), a novel framework designed to improve the interpretability of large language models (LLMs) by effectively extracting features across multiple layers. Traditional sparse autoencoders (SAEs) primarily focus on single-layer activations, failing to capture how features evolve through different depths of an LLM. RouteSAE addresses this by incorporating a routing mechanism that dynamically assigns weights to activations from various layers, creating a unified feature space. This approach leads to a higher number of interpretable features and improved interpretability scores compared to previous methods li...
2025-08-09
12 min
AI: AX - introspection
HarmBench: Automated Red Teaming for LLM Safety
This paper introduces HarmBench, a new framework for evaluating the safety and robustness of large language models (LLMs) against malicious use. It highlights the growing concern over LLMs' potential for harm, such as generating malware or designing biological weapons, and emphasizes the need for automated red teaming—a process of identifying vulnerabilities—due to the scalability limitations of manual methods. HarmBench addresses the previous lack of standardized evaluation by offering a comprehensive benchmark with diverse harmful behaviors, including contextual and multimodal scenarios, and robust, comparable metrics for assessing attack success rates. The document also prese...
2025-08-09
22 min
AI: AX - introspection
Jailbreaking LLMs
A long list of papers and articles are reviewed about jailbreaking LLMs:These sources primarily explore methods for bypassing safety measures in Large Language Models (LLMs), often referred to as "jailbreaking," and proposed defense mechanisms. One key area of research involves "abliteration," a technique that directly modifies an LLM's internal activations to remove censorship without traditional fine-tuning. Another significant approach, "Speak Easy," enhances jailbreaking by decomposing harmful requests into smaller, multilingual sub-queries, significantly increasing the LLMs' susceptibility to generating undesirable content. Additionally, "Sugar-Coated Poison" investigates integrating benign content with adversarial reasoning to create effective...
2025-08-09
10 min
AI: AX - introspection
PA-LRP & absLRP
We focus on two evolutions to AX, they focus on advancing the explainability of deep neural networks, particularly Transformers, by improving Layer-Wise Relevance Propagation (LRP) methods. One source introduces Positional Attribution LRP (PA-LRP), a novel approach that addresses the oversight of positional encoding in prior LRP techniques, showing it significantly enhances the faithfulness of explanations in areas like natural language processing and computer vision. The other source proposes Relative Absolute Magnitude Layer-Wise Relevance Propagation (absLRP) to overcome issues with conflicting relevance values and varying activation magnitudes in existing LRP rules, demonstrating its superior performance in generating clear, contrastive, and...
2025-08-09
19 min
AI: AX - introspection
AttnLRP: Explainable AI for Transformers
This paper 2024 introduces AttnLRP, a novel method for explaining the internal reasoning of transformer models, including Large Language Models (LLMs) and Vision Transformers (ViTs). It extends Layer-wise Relevance Propagation (LRP) by introducing new rules for non-linear operations like softmax and matrix multiplication within attention layers, improving faithfulness and computational efficiency compared to existing methods. The paper highlights AttnLRP's ability to provide attributions for latent representations, enabling the identification and manipulation of "knowledge neurons" within these complex models. Experimental resul...
2025-08-09
16 min
AI: AX - introspection
Pixel-Wise Explanations for Non-Linear Classifier Decisions
This open-access research article from PLOS One introduces Layer-wise Relevance Propagation (LRP), a novel method for interpreting decisions made by complex, non-linear image classifiers. The authors, an international team of researchers, explain how LRP can decompose a classification decision down to the individual pixels of an input image, generating a heatmap that visualizes their contribution. This technique aims to make "black box" machine learning models, like neural networks and Bag of Words (BoW) models, more transparent by showing why a system arrives at a particular classification. The paper evaluates LRP on various datasets, including PASCAL VOC images and MNIST...
2025-08-09
19 min
AI: AX - introspection
Multi-Layer Sparse Autoencoders for Transformer Interpretation
This paper introduces the Multi-Layer Sparse Autoencoder (MLSAE), a novel approach for interpreting the internal representations of transformer language models. Unlike traditional Sparse Autoencoders (SAEs) that analyze individual layers, MLSAEs are trained across all layers of a transformer's residual stream, enabling the study of information flow across layers. The research found that while individual "latents" (features learned by the SAE) tend to be active at a single layer for a given input, they are active at multiple layers when aggregated over many inputs, with this multi-layer activity increasing in larger models. The authors also explored the effect of "tuned-lens"...
2025-08-09
14 min