Mcgrof - Podcast Details

Shows

AI: post transformers ShadowKV: High-Throughput Long-Context LLM InferenceThis April 2025 paper introduces ShadowKV, an innovative inference system for long-context Large Language Models (LLMs) designed to significantly enhance throughput and support larger batch sizes without compromising accuracy. It achieves this by strategically managing the Key-Value (KV) cache: specifically, it compresses the low-rank pre-Rotary Position Embedding (RoPE) key cache on the GPU and offloads the value cache to the CPU. ShadowKV further optimizes performance through an accurate KV selection strategy that reconstructs minimal sparse KV pairs on-the-fly, thus minimizing decoding latency. Em...

2025-09-1718 min

AI: post transformers TailorKV: Hybrid KV Cache Compression for LLMsThis May 2025 paper introduces TailorKV, a novel hybrid framework designed to optimize Key-Value (KV) cache management in large language models (LLMs) for long-context inference. It addresses challenges like high GPU memory consumption and inference latency that arise from the linear growth of KV cache size with sequence length. TailorKV categorizes Transformer layers into quantization-friendly and sparsity-friendly based on their attention patterns, applying 1-bit quantization to the former and dynamic retrieval of Top-K tokens from CPU memory for the latter. This tailored approach significantly reduces memory usage and decoding latency while maintaining model accuracy, enabling LLMs to operate efficiently on...

2025-09-1718 min

AI: post transformers MIRAGE: Optimizing LLM KV Cache with Parameter RemappingThis July 2025 paper discusses advanced memory optimization techniques for Large Language Models (LLMs), particularly focusing on KV cache management in multi-tenant serving environments. The primary subject, MIRAGE, introduces parameter remapping, a novel method that dynamically repurposes GPU memory allocated for model parameters to expand KV cache capacity, outperforming traditional CPU-offloading and KV cache swapping by reducing latency and increasing throughput. Complementary research highlights challenges in on-device LLM deployment and proposes solutions like quantization (AWQ) for model compression and two-level scheduling (FineServe, Nexu...

2025-09-1720 min

AI: post transformers WebSailor-V2: Bridging Proprietary Agents with Synthetic Data and RLThis September 2025 paper introduces WebSailor-V2, an open-source deep research agent developed by Alibaba Group's Tongyi Lab. The paper details a post-training pipeline that uses a novel synthetic data construction scheme, SailorFog-QA-V2, and a dual-environment reinforcement learning framework. WebSailor-V2, built on the Qwen3-30B-A3B model, demonstrates state-of-the-art performance among open-source agents and is competitive with leading proprietary systems on various web-agent benchmarks, including BrowseComp and Humanity's Last Exam. The authors emphasize that high-quality data and a stable training environment are more crucial than the specifi...

2025-09-1719 min

AI: post transformers Dynamic Chunking for Hierarchical Sequence ModelingThis July 2025 paper introduces Hierarchical Networks (H-Nets), a novel architecture designed to move beyond traditional tokenization in large language models by implementing dynamic chunking. This mechanism allows the model to automatically learn content- and context-dependent segmentation strategies directly from raw data, eliminating the need for predefined pre-processing steps like byte-pair encoding (BPE). H-Nets utilize a recursive, multi-stage structure that processes data at varying levels of abstraction, from bytes to more complex semantic units. Experiments demonstrate that H-Nets, particularly multi-stage configurations, outperform tokenized Transformers in perplexity, downstream tasks, and robustness to textual pertur...

2025-09-1725 min

AI: post transformers LoFT: Parameter-Efficient Fine-Tuning for Long-tailed Semi-Supervised LearningThis September 2025 paper introduces LoFT, a novel framework designed to improve Long-Tailed Semi-Supervised Learning (LTSSL) by leveraging parameter-efficient fine-tuning of pre-trained foundation models. The core idea is to enhance confidence calibration and generate more reliable pseudo-labels, which are crucial for addressing the imbalance inherent in long-tailed datasets. Furthermore, the paper extends this approach to open-world scenarios with LoFT-OW, specifically incorporating mechanisms to detect and filter out-of-distribution (OOD) samples from unlabeled data. The authors demonstrate that these fine-tuned models achieve superior performance on various benchmarks, even when utilizing significan...

2025-09-1717 min

AI: post transformers QuantAgent: Multi-Agent LLM for High-Frequency TradingThis September 2025 paper describes QuantAgent, a novel multi-agent large language model (LLM) framework designed for high-frequency quantitative trading based solely on price-derived market signals. The system decomposes trading decisions into four specialized agents—IndicatorAgent, PatternAgent, TrendAgent, and DecisionAgent—which analyze market dynamics from complementary perspectives and communicate through structured prompts. QuantAgent consistently outperforms baseline models across diverse assets, including commodities, equities, and cryptocurrencies, demonstrating robust generalization and achieving high directional accuracy in predicting price movements. A key feature is its ability to produce traceable, language-native explanations for its trading decisions...

2025-09-1717 min

AI: post transformers Infini-gram: Scaling Unbounded N-gram Language ModelsThis April 2025 paper introduces Infini-gram, a novel engine designed to scale n-gram language models to an unprecedented 5 trillion tokens and support unbounded n (∞-gram LMs). Unlike traditional methods that rely on pre-computed count tables, Infini-gram leverages suffix arrays for efficient, low-latency calculation of n-gram and ∞-gram probabilities, even for extremely long contexts. The authors demonstrate that this modernized approach significantly improves the perplexity of neural Large Language Models (LLMs), by up to 73%, by offering complementary insights into human-written and machine-generated text. Beyond enhancing LLMs, the Infini-gram engine also enables various appl...

2025-09-1719 min

AI: post transformers Generalist Reward Modeling with Inference-Time ScalingThis April 2025 paper introduces Self-Principled Critique Tuning (SPCT), a novel method designed to enhance the inference-time scalability of Generative Reward Models (GRMs) for various domains. It details how SPCT, through a combination of rejective fine-tuning and rule-based online reinforcement learning, facilitates the adaptive generation of principles and critiques, thereby improving the quality and inference-time scalability of GRMs. The paper compares different reward generation paradigms (scalar, semi-scalar, generative) and scoring patterns (pointwise, pairwise), demonstrating that the proposed DeepSeek-GRM models, particularly when guided by a meta Reward...

2025-09-1617 min

AI: post transformers Hierarchical Reasoning Model: Brain-Inspired AI for Complex TasksThis August 2025 paper introduces the Hierarchical Reasoning Model (HRM), a novel AI architecture inspired by the human brain's hierarchical and multi-timescale processing. This model aims to overcome the limitations of current large language models (LLMs) and Chain-of-Thought (CoT) techniques in complex reasoning tasks, which often suffer from computational inefficiencies and extensive data requirements. HRM utilizes two interdependent recurrent modules: a high-level module for abstract planning and a low-level module for detailed computations, enabling it to achieve significant computational depth. Notably, HRM demonstrates exceptional performance on challenging reasoning benchmarks like Sudoku a...

2025-09-1617 min

AI: post transformers Native Sparse Attention: Efficient Long-Context LLMsThis February 2025 paper introduces Native Sparse Attention (NSA), a novel approach to address the computational demands of long-context modeling in large language models. NSA combines algorithmic innovations like a dynamic hierarchical sparse strategy with hardware-aligned optimizations to significantly improve efficiency. The paper highlights NSA's ability to maintain or even surpass the performance of traditional "Full Attention" models across various benchmarks, including general language, long-context tasks, and instruction-based reasoning, while achieving substantial speedups in decoding, forward, and backward propagation. It critically analyzes the shortcomings of existing sparse attention methods, particularly their failure to achi...

2025-09-1616 min

AI: post transformers CodeI/O: Reasoning Patterns Through Code Input-Output PredictionThis February 2025 paper introduce CodeI/O, a novel training method for Large Language Models (LLMs) that enhances general reasoning abilities by transforming code into an input-output prediction task. Instead of focusing on generating code, CodeI/O trains models to predict inputs or outputs of a given code in natural language Chain-of-Thought (CoT) rationales. This approach allows LLMs to learn universal reasoning primitives embedded in code, such as logic flow and decision-making, while decoupling them from specific programming syntax. An improved version, CodeI/O++, further refines training data through

2025-09-1616 min

AI: post transformers Janus-Pro: Unified Multimodal AI with Scaled ImprovementsThis January 2025 paper introduces Janus-Pro, an enhanced artificial intelligence model for multimodal understanding and generation. It builds upon its predecessor, Janus, through optimized training strategies, expanded data, and increased model size. The authors demonstrate that Janus-Pro achieves significant improvements in both multimodal understanding benchmarks and text-to-image generation capabilities, producing more stable and aesthetically pleasing outputs. This work highlights the benefits of decoupling visual encoding for understanding and generation tasks within a unified autoregressive transformer architecture.Source:https://arxiv.org/pdf/2501.17811

2025-09-1615 min

AI: post transformers Federated Post-Training LLMs: An Accessibility and Efficiency SurveyThis August 2025 paper examines the evolving landscape of Federated Large Language Models (FedLLM), focusing on how large language models are post-trained while preserving user data privacy. The authors introduce a novel taxonomy that categorizes FedLLM approaches based on model accessibility (white-box, gray-box, and black-box) and parameter efficiency. It highlights various techniques within these categories, such as adapter-based tuning and prompt tuning, which reduce computational and communication overhead. The paper also discusses the growing importance of inference-only black-box settings for future FedLLM development and identifies

2025-09-1620 min

AI: post transformers Non-Penetrative Tensor Partitioning for Collaborative AIoT InferenceThis June 2025 paper introduces Non-Penetrative Tensor Partitioning (NPTP), a novel method designed to improve the speed of collaborative inference for Deep Neural Networks (DNNs) on Internet of Things (IoT) devices. It addresses the common challenge of limited resources and strict latency requirements by minimizing the communication overhead that typically arises when large images are divided and processed across multiple devices. Unlike existing methods that utilize penetrative partitioning, which leads to substantial data sharing between devices, NPTP employs a non-penetrative approach and a Multilevel Partitioning Algorithm (MPA) to reduce this inter-device communication. Experimental results demonstrate that NPTP significantly outperforms state-of-the-art...

2025-09-1615 min

AI: post transformers Collaborative Edge Inference with Dynamic Task Offloading and Early ExitingThis December 2024 paper introduces a collaborative inference framework designed for large-scale models in 5G smart city edge computing environments, addressing the challenge of limited memory and computing capacity on individual edge nodes. The framework partitions large models into sub-models deployed across multiple edge nodes and incorporates an early exit mechanism to accelerate inference. To manage the complexities of heterogeneous systems and dynamic environments, the authors propose a distributed algorithm called DTO-EE, which jointly optimizes task offloading strategies and confidence thresholds for early exits. Experimental r...

2025-09-1613 min

AI: post transformers Adaptive LLM Partitioning for Edge InferenceThis May 2025 paper introduces a resource-aware algorithm designed to optimize the performance of Large Language Models (LLMs) for low-latency inference on edge computing devices. The core innovation lies in its fine-grained partitioning of the Transformer architecture, specifically at the attention head-level, rather than coarser layer-level divisions. This approach allows for dynamic reassignment and migration of these individual attention heads and their associated Key/Value (K/V) caches across heterogeneous edge devices. By managing the expanding memory footprint of K/V caches and exploiting parallel execution of attention heads, the proposed method significantly reduces inference latency and memory usage compared...

2025-09-1615 min

AI: post transformers UQ: Unsolved Questions for Language ModelsThis August 2025 paper introduces UQ, a novel evaluation framework designed to challenge large language models (LLMs) with complex, unsolved questions sourced from platforms like Stack Exchange, where no definitive ground truth answers currently exist. The framework consists of three main components: UQ-Dataset, a collection of 500 hand-filtered, difficult, and unsolved questions; UQ-Validators, a set of LLM-based validation strategies that assess candidate solutions by leveraging the observation that models are often better at verifying answers than generating them; and UQ-Platform, which facilitates community engagement and human verification. The paper highlights the generator-validator gap, demonstrating that LLMs show improved performance in validating...

2025-09-1617 min

AI: post transformers PETALS: Collaborative Large Language Model Inference and Fine-tuningThis March 2023 paper introduces PETALS, a novel system designed to facilitate the collaborative inference and fine-tuning of large language models (LLMs) by pooling resources from multiple participants. It addresses the significant computational and memory demands of LLMs, which typically restrict access for many researchers. PETALS proposes an alternative to traditional methods like slow RAM offloading or inflexible inference APIs by allowing distributed processing across a network of consumer GPUs, enhancing speed and flexibility. The system incorporates optimizations like 8-bit quantization and dynamic load balancing to improve performance and reliability. Ultimat...

2025-09-1615 min

AI: post transformers AWQ: On-Device LLM Compression and AccelerationThis July 2024 paper introduces Activation-aware Weight Quantization (AWQ), a novel method for compressing Large Language Models (LLMs) by quantizing weights to low-bit integers for efficient deployment on edge devices. It highlights that AWQ identifies and protects crucial "salient" weights by observing activation distributions, which significantly reduces quantization error without requiring computationally intensive training or overfitting to specific datasets. Complementing AWQ, the paper also presents TinyChat, an inference framework specifically designed to optimize and accelerate these 4-bit quantized LLMs on various hardware, including mobile GPUs and even resource-constrained devices like the Raspberry Pi, achieving substantial speedups compared to traditional implementations...

2025-09-1519 min

AI: post transformers HybridServe: Efficient LLM Inference with Hybrid CachingThis January 2025 paper introduces HybridServe, an LLM inference system designed to enhance throughput and cost-effectiveness for large language models by optimizing memory usage and host-GPU communication. It tackles the challenges of host memory offloading, where model parameters and KV cache are stored on slower host memory to reduce costs but can lead to GPU underutilization due to limited transfer bandwidth. HybridServe proposes a novel activation checkpointing technique with a KV-Activation hybrid caching scheme that stores intermediate activations, allowing for faster recomput...

2025-09-1522 min

AI: post transformers FlexGen: High-Throughput LLM Inference on a Single GPUThis June 2023 paper introduces FlexGen, a novel high-throughput generation engine designed to overcome the substantial computational and memory demands of large language model (LLM) inference on limited hardware, specifically a single commodity GPU. It details FlexGen's ability to aggregate memory and computation across the GPU, CPU, and disk, employing an optimized scheduling approach and a linear programming-based policy search to store and access tensors efficiently. Furthermore, FlexGen incorporates 4-bit compression for model weights and attention caches, which significantly reduces memory footprint with minimal accuracy loss. The research demonstrates FlexGen's superior performance, achieving subst...

2025-09-1520 min

AI: post transformers GraphSAGE: Inductive Representation Learning on Large GraphsThis September 2018 paper introduces GraphSAGE, a novel inductive framework designed to generate node embeddings for large, evolving graphs, addressing limitations of prior transductive methods that struggle with unseen data. Instead of learning a specific embedding for each node, GraphSAGE learns a function that generates these embeddings by sampling and aggregating features from a node's local neighborhood. The authors evaluate various aggregator architectures, including mean, LSTM, and pooling functions, demonstrating that GraphSAGE significantly outperforms strong baselines on node classification tasks across diverse datasets, such as citation networks, Reddit posts, and...

2025-09-1526 min

AI: post transformers MetaGraph: knowledge graphs from financial NLPThis September 2025 paper presents MetaGraph, a novel methodology for constructing knowledge graphs from scientific literature, specifically applied to Financial Natural Language Processing (NLP) research between 2022 and 2025. The authors utilized Large Language Models (LLMs) to extract key information from 681 papers, including tasks, datasets, models, motivations, and limitations, and organized it into a structured, queryable format. The analysis highlights three phases in Financial NLP's evolution: initial LLM adoption and task/dataset innovation, subsequent critical reflection on LLM limitations, and a current trend toward integrating peripheral techniques into mod...

2025-09-1517 min

AI: post transformers Hallucination to Truth: A Review of Fact-Checking and Factuality Evaluation in Large Language ModelThis August 2025 paper explores the critical area of fact-checking and factuality evaluation in Large Language Models (LLMs). It systematically analyzes the challenges of misinformation generation, particularly hallucinations, which are factually incorrect but fluent outputs from LLMs. The paper investigates various mitigation strategies, including fine-tuning, instruction tuning, and Retrieval-Augmented Generation (RAG), which grounds LLM outputs in external knowledge. It further examines evaluation metrics, datasets, and prompting strategies used to assess and enhance the factual accuracy of these models, highlighting the need for more robust, explainable, and domain-specific fact-che...

2025-09-1519 min

AI: post transformers The Illusion of Diminishing Returns in LLM ExecutionThis September 2025 paper explores the concept of long-horizon execution in Large Language Models (LLMs), arguing that marginal gains in single-step accuracy can lead to exponential improvements in the length of tasks LLMs can complete. The authors introduce a novel framework to isolate execution capabilities by providing models with necessary knowledge and plans, revealing that larger models can execute significantly more steps, even when smaller models achieve perfect single-turn accuracy. A key finding is the "self-conditioning effect," where LLMs become more prone to errors when their past mistakes are present in the context...

2025-09-1515 min

AI: post transformers PyTorch FSDP: Scaling Fully Sharded Data ParallelThis September 2023 paper introduces PyTorch Fully Sharded Data Parallel (FSDP), an advanced solution designed to scale the training of exceptionally large machine learning models. It addresses limitations of previous methods like Distributed Data Parallel (DDP) by sharding model parameters, gradients, and optimizer states across multiple GPUs, thereby drastically reducing individual GPU memory consumption. FSDP employs various techniques, including deferred initialization, flexible sharding strategies, and optimizations for communication overlap and prefetching, to ensure high efficiency and a user-friendly experience. The research demonstrates FSDP's effectiveness in training models with billions of parameters, achieving nea...

2025-09-1519 min

AI: post transformers Llama 3: Architecture, Capabilities, and SafetyOn this November 2025 paper the Meta Llama Team's paper introduces Llama 3, a new family of large language models featuring 8B, 70B, and 405B parameters, designed with native multilingual support, coding, reasoning, and tool usage capabilities. The development emphasizes data quality and diversity, employing extensive filtering, de-duplication, and heuristic cleaning processes for both English and multilingual data, alongside scaling laws to optimize model size and training budgets. The models utilize a standard dense Transformer architecture with minor adaptations like grouped query attention and an attention mask for multi-document sequences, demonstrating comparable performance to leading models such as GPT-4 across various...

2025-09-1522 min

AI: post transformers Graph Patterns of Knowledge in Large Language ModelsThis May 2025 paper explores the structural patterns of knowledge within Large Language Models (LLMs) by adopting a graph-based perspective. The authors quantify LLM knowledge at both the triplet and entity levels, analyzing its relationship with graph properties like node degree. Key findings include the discovery of knowledge homophily, where closely connected entities exhibit similar knowledgeability, and a positive correlation between an entity's degree and its knowledge. These insights further motivate the development of graph machine learning models to predict entity knowledge, which can then be used to strategically select less-known triplets for fine-tuning LLMs, leading to improved performance. The...

2025-09-1415 min

AI: post transformers All for One: LLMs Solve Mental Math at the Last TokenThis September 2025 published research investigates how large language models (LLMs) perform mental math, particularly focusing on the flow of information and computational processes within their transformer architecture. The authors introduce two novel techniques, Context-Aware Mean Ablation (CAMA) and Attention-Based Peeking (ABP), to identify a minimal computational subgraph called All-for-One (AF1). This subgraph reveals that for mental math tasks, input-specific computation is largely deferred to later layers and primarily handled by the final token, which receives necessary information from other tokens during a few specific intermediate layers. The...

2025-09-1317 min

AI: post transformers Survey of Reinforcement Learning for Large Reasoning ModelsThis September 2025 paper provides a comprehensive overview of Reinforcement Learning (RL) as applied to Large Reasoning Models (LRMs). It breaks down the field into foundational components such as reward design and policy optimization, explaining various algorithms like PPO and GRPO. The document also discusses training resources, distinguishing between static corpora and dynamic environments, and highlights diverse applications of RL in LRMs, including coding, agentic tasks, and multimodal understanding, with a focus on models from 2025. Ultimately, the paper aims to identify future directions for scaling RL in LRMs to...

2025-09-1325 min

AI: post transformers SpikingBrain: Brain-Inspired LLMs for Efficient Long-Context ProcessingThese September 2025 papers present a technical report on SpikingBrain, a novel family of large language models (LLMs) that draw inspiration from brain mechanisms to address the efficiency challenges of traditional Transformer architectures. The research focuses on efficient long-context training and inference by developing hybrid linear attention architectures and an adaptive threshold spiking neuron scheme. A significant aspect of this work is the successful training and deployment of these models on non-NVIDIA GPU clusters, specifically the MetaX platform, demonstrating the feasibility of large-scale LLM development on alternative hardware. The authors highlight substantial speedups in inference for long sequences and significant...

2025-09-1316 min

AI: post transformers Statistical Methods for Generative AI ReliabilityThis September 2025 paper explores the critical role of statistical methods in enhancing the reliability and functionality of Generative AI (GenAI), which inherently lacks guarantees regarding correctness or safety. It discusses various statistical applications, including improving and altering model behavior through techniques like output trimming and abstention based on risk scores, often utilizing conformal prediction for provable guarantees. The text also covers diagnostics and uncertainty quantification (UQ), differentiating between epistemic and aleatoric uncertainty and addressing challenges like semantic multiplicity and the need for calibration in GenAI outputs. Furthermore, it highlights the importance of statistical inference in evaluating GenAI models, particularly...

2025-09-1318 min

AI: post transformers EntiGraph: Scaling Language Models with Synthetic PretrainingThis October 2024 paper introduces synthetic continued pretraining (synthetic CPT), a novel method designed to enhance language model knowledge acquisition from small, specialized text collections. Current large language models often struggle with data efficiency and learning niche facts from limited sources. The core of this approach is EntiGraph, a synthetic data augmentation algorithm that extracts entities and their relationships from a small corpus to generate a much larger, more diverse synthetic dataset. Experiments using the QuALITY dataset demonstrate that EntiGraph CPT significantly improves a model's ability to answer ques...

2025-09-1323 min

AI: post transformers NOVELTYBENCH: Evaluating Language Model DiversityThis August 2025 paper introduces NOVELTYBENCH, a new benchmark designed to evaluate how well large language models (LLMs) generate diverse and high-quality outputs, addressing the problem of "mode collapse" where models produce repetitive responses. The research found that current state-of-the-art LLMs consistently generate less diversity than human writers, with larger models often exhibiting even lower diversity than their smaller counterparts. The benchmark uses a unique approach to measure functional equivalence between generations, ensuring that diversity is meaningful to users. While certain prompting strategies, like in-context regeneration, can enhance diversity, the study...

2025-09-1218 min

AI: post transformers HyperController: Fast, Stable Reinforcement Learning Hyperparameter OptimizationThis April 2025 paper introduces HyperController, a novel and computationally efficient algorithm designed to optimize hyperparameters during the training of reinforcement learning neural networks. Hyperparameter optimization is crucial for improving machine learning models, but traditional methods can be slow and computationally intensive. HyperController addresses these challenges by modeling the hyperparameter optimization problem as an unknown Linear Gaussian Dynamical System and leveraging the Kalman filter for efficient prediction. The algorithm is validated through experiments on various OpenAI Gymnasium environments, where it demonstrates faster training times and superior or comparable performance compared to existing methods, achieving the highest median reward i...

2025-09-1218 min

AI: post transformers Parallel-R1: Reinforcement Learning for Parallel Thinking in LLMsThis September 10, 2025 technical report from Tencent AI Lab introduces Parallel-R1, a novel reinforcement learning (RL) framework designed to enhance large language models (LLMs) with parallel thinking capabilities for complex mathematical reasoning tasks. Unlike previous methods relying on supervised fine-tuning (SFT) over synthetic data, Parallel-R1 utilizes a progressive curriculum to address the cold-start problem in RL, initially using SFT on simpler tasks to instill the basic format of parallel thinking before transitioning to RL for exploration and generalization on more challenging problems. The research highlights that parallel thinking evo...

2025-09-1215 min

AI: post transformers Explaining AI for Digital Advertising with LLMsThis April 2025 paper introduces SODA, a novel framework designed to enhance digital advertising strategies by making opaque AI systems more understandable for marketers. The authors highlight the current challenges faced by advertisers due to the lack of transparency in major ad platforms like Meta, which often results in wasted ad spend and reliance on intuition. To address this, SODA integrates Large Language Models (LLMs) with explainable AI techniques to provide clear, actionable insights into ad performance. The framework initially employs an improved Click-Through Rate (CTR) prediction model, SoWide-v2, which also offers visua...

2025-09-1116 min

AI: post transformers AdLlama: Boosting Ad Performance with Reinforcement LearningThis July 2025 paper introduces AdLlama, a new large language model (LLM) for generating Facebook ad text, trained using Reinforcement Learning with Performance Feedback (RLPF). Unlike previous models that relied on supervised fine-tuning to imitate curated ads, AdLlama utilizes historical ad performance data, specifically click-through rates (CTR), as a reward signal to optimize its text generation. A large-scale A/B test on Facebook, involving nearly 35,000 advertisers, demonstrated that AdLlama significantly improved advertiser-level CTR by 6.7% and increased the number of ad variations advertisers created by 18.5%. The findings highlight RLPF as a promising, generalizable approach for m...

2025-09-1117 min

AI: post transformers ByteCheckpoint: A Unified LLM Checkpointing SystemThis July 2024 paper introduces ByteCheckpoint, a novel PyTorch-native system designed for Large Language Model (LLM) development. This system addresses critical challenges in LLM training, particularly the high I/O costs associated with saving and loading checkpoints, and the complexities of checkpoint resharding across different parallel configurations and training frameworks. ByteCheckpoint achieves this through a data/metadata disaggregated storage architecture and asynchronous tensor merging, enabling automatic online resharding and multi-framework support. The paper highlights ByteCheckpoint's significant performance improvements in reducing checkpoint savi...

2025-09-1117 min

AI: post transformers Darling: Reinforcing Diversity and Quality in Language ModelsThis September 2025 paper introduces Diversity-Aware Reinforcement Learning (Darling), a novel framework designed to enhance both the quality and semantic diversity of large language model (LLM) generations. Recognizing that traditional post-training methods often sacrifice diversity for accuracy, Darling integrates a learned partition function to measure semantic diversity beyond simple lexical variations. This diversity signal is then multiplied with a quality reward during online reinforcement learning, which encourages LLMs to produce responses that are not only high-quality but also distinct and novel. Experiments on both non-verifiable tasks, such as creative writing, and verifiable tasks, like competition math, demonstrate that Darling consistently...

2025-09-1020 min

AI: post transformers INF2: Near-Storage LLM Inference for High ThroughputThis February 2025 paper introduces INF2, a novel framework designed to enhance the generative inference throughput of large language models (LLMs) by utilizing computational storage devices (CSDs). The core innovation, attention-near storage (ANS), offloads memory-intensive self-attention operations directly to accelerators within these storage devices, significantly reducing data transfer bottlenecks over the system interconnect. To further boost performance, INF2 incorporates delayed KV cache writeback which minimizes storage write latency by batching updates to the KV cache, and cooperative X-cache, which optimizes host memory usage by storing input activations instead of key-value caches for cooperative processing between the GPU and CSDs. Through...

2025-09-1021 min

AI: post transformers K2-Think: A Parameter-Efficient Reasoning SystemThe September 9 2025 press release and paper announce and detail K2 Think, an advanced open-source AI reasoning system developed by the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) and G42 in the UAE. K2 Think stands out for its parameter efficiency, achieving performance comparable to much larger models, particularly in mathematical reasoning, with only 32 billion parameters. This breakthrough is attributed to a six-pillar approach, including supervised fine-tuning, reinforcement learning with verifiable rewards, agentic planning, test-time scaling, and optimization for Cerebras Waf...

2025-09-1016 min

AI: post transformers AlphaEvolve: AI for Scientific and Algorithmic DiscoveryThe May - June 2025 sources introduce AlphaEvolve, a novel AI coding agent developed by Google DeepMind in collaboration with mathematicians like Javier Gómez Serrano and Terence Tao. This Gemini-powered tool utilizes an evolutionary process, similar to natural selection, to generate and iteratively refine code solutions for complex problems. AlphaEvolve has demonstrated its capability in scientific and algorithmic discovery, successfully tackling open mathematical challenges such as improving bounds for matrix multiplication and the kissing number problem in 11 dimensions. Beyond theoretical advancements, it has also been applied to optimize critical components within G...

2025-09-1014 min

AI: post transformers BLEU: Automatic Machine Translation EvaluationThis July 2002 paper introduced BLEU (Bilingual Evaluation Understudy), an automatic and inexpensive method for evaluating machine translation (MT) quality. It highlights the limitations of human evaluation, such as its high cost and time consumption, and proposes BLEU as a quick, language-independent alternative that correlates strongly with human judgment. The core concept of BLEU involves measuring the "closeness" of a machine translation to one or more human reference translations through a modified n-gram precision metric and a brevity penalty. The paper details the mathematical formulation of the BLEU sco...

2025-09-1020 min

AI: post transformers Mini-o3: Scaling Reasoning for Visual SearchThis September 2025 paper introduces Mini-o3, a Vision-Language Model (VLM) designed to overcome the limitations of existing VLMs in handling complex visual search tasks that require multi-turn reasoning and trial-and-error exploration. The researchers developed a three-component training recipe, including the creation of the Visual Probe Dataset with challenging, high-resolution images, a pipeline for synthesizing diverse multi-turn trajectories for supervised finetuning, and an over-turn masking technique in reinforcement learning. This masking prevents penalization of long, incomplete reasoning paths, encouraging deeper exploration without increasing training time. Mini-o3 demonstrates state-of-the-art performanc...

2025-09-1012 min

AI: post transformers Masked Diffusion Models: Performance and TheoryThis September 2025 paper analyzes the theoretical benefits and limitations of Masked Diffusion Models (MDMs) for text generation, contrasting them with auto-regressive models. While MDMs can sample multiple tokens in parallel, offering a potential for efficiency, the research demonstrates that their actual performance depends heavily on the evaluation metric. Specifically, MDMs can achieve near-optimal fluency (low Token Error Rate) with a constant number of sampling steps, regardless of sequence length. However, when assessed for correctness (low Sequence Error Rate), particularly for tasks requiring logical reasoning, MDMs necessitate a number of sampling...

2025-09-1016 min

AI: post transformers TraceRL: Reinforcement Learning for Diffusion Language ModelsThis September 2025 paper introduces TraceRL, a novel reinforcement learning framework designed to enhance diffusion language models (DLMs) across various architectural types. The core idea behind TraceRL is to align the training process with the preferred inference trajectories of the model, which demonstrably improves performance on complex reasoning tasks like mathematics and coding. The authors also propose a diffusion-based value model to boost training stability. Through experiments, the paper showcases the effectiveness of TraceRL, yielding state-of-the-art DLMs called TraDo that outperform larger autoregressive models. Furthermore, the source provi...

2025-09-0913 min

AI: post transformers LLM Benchmark Robustness to Linguistic VariationThis September 2025 paper investigates the reliability and robustness of Large Language Models (LLMs) when evaluated using traditional benchmarks. The authors systematically paraphrased questions across six common benchmarks and observed how 34 different LLMs performed. Their findings indicate that while LLM rankings remain relatively consistent, their absolute effectiveness scores significantly decline when faced with reworded questions, suggesting a lack of robustness to linguistic variability. The study highlights that current benchmark evaluations may overstate LLM generalization abilities and advocates for more robustness-aware evaluation methodologies that better reflect real-world language us...

2025-09-0917 min

AI: post transformers Behavioral Fingerprinting of Large Language ModelsThis September 2025 paper introduces "Behavioral Fingerprinting," a novel framework designed to evaluate Large Language Models (LLMs) beyond traditional performance scores like MMLU. It aims to understand how models "think," creating a multi-faceted profile of their intrinsic cognitive and interactive styles. The methodology employs a diagnostic prompt suite and an automated evaluation pipeline where a powerful LLM acts as a judge, analyzing eighteen different models across four key dimensions: internal world model, reasoning abilities, biases and personality (including sycophancy), and semantic robustness. Findings indicate a convergence in core reasoning abilities

2025-09-0915 min

AI: post transformers Offloading LLM Models and KV Caches to NVMe SSDsThis March 2025 paper examines the input/output (I/O) characteristics of offloading large language model (LLM) components to NVMe SSDs during inference, a critical solution for overcoming GPU memory limitations with ever-growing LLMs. Researchers analyzed block-layer I/O traces from two prominent LLM frameworks, DeepSpeed and FlexGen, to understand how model weights and key-value (KV) caches are handled. The findings indicate that asynchronous I/O using libaio significantly outperforms POSIX for tensor transfers, although neither method fully saturates the NVMe SSD's theoretical bandwidth. For model offloading, I/O is predominantly characterized by 128KiB...

2025-09-0817 min

AI: post transformers GPT-NeoX: Large-Scale Autoregressive Language Modeling in PyTorchThus describes EleutherAI's GPT-NeoX library, a robust open-source framework for training large-scale autoregressive language models on GPUs, building upon the Megatron and DeepSpeed libraries. It highlights the library's advanced features like distributed training, support for various hardware and systems, and cutting-edge architectural innovations. The text also provides practical guidance on setup, configuration, data preparation, training, inference, and evaluation, alongside details on pretrained models like GPT-NeoX-20B and Pythia. Furthermore, it details how to export models to Hugging Face and monitor experiments, underscoring its widespread adoption in research and indus...

2025-09-0712 min

AI: post transformers SGLang: Efficient Language Model Program ExecutionThis June 2024 paper introduces SGLang, a framework designed to enhance the efficiency of Large Language Model (LLM) and Vision Language Model (VLM) serving. It achieves this through a co-design of a flexible frontend language and a fast backend runtime. The frontend simplifies programming with primitives for generation and parallelism, while the backend utilizes novel optimizations like RadixAttention for KV cache reuse and compressed finite state machines for faster structured output decoding. These innovations allow SGLang to significantly improve throughput and reduce latency compared to existing systems across various LLM applications and hardw...

2025-09-0717 min

AI: post transformers Eleuther: evaluating LLMsThese sources collectively explore various approaches to evaluating and improving Large Language Models (LLMs). Several papers introduce new benchmark datasets designed to test LLMs on complex reasoning tasks, such as the "BIG-Bench Hard (BBH)" suite, the graduate-level "GPQA" questions in science, and "MuSR" for multistep soft reasoning in natural language narratives. A key technique discussed across these sources is Chain-of-Thought (CoT) prompting, which encourages LLMs to show their step-by-step reasoning, leading to improved performance, often surpassing human-rater averages on challenging tasks. Additionally, the "Instruction-Following Eval (IFEval)" introduces a reproducible benchmark for verifiable instructions, allowing for objective assessment of an...

2025-09-0726 min

AI: post transformers OpenELM: Apple's Open Language Model FamilyThe provided May 2024 sources center around CoreNet, an Apple-developed library for training deep neural networks, and OpenELM, an efficient language model family built using CoreNet. CoreNet is a versatile toolkit supporting various tasks, including foundation models like large language models (LLMs), object classification, and semantic segmentation, with its development evolving from the earlier CVNets. A key innovation highlighted is OpenELM's layer-wise scaling strategy, which optimizes parameter allocation within transformer models to achieve superior accuracy with fewer pre-training tokens compared to other open LLMs. The resources emphasize reproducibility and transparency by providing comprehensive frameworks for OpenELM's training and evaluation, including...

2025-09-0715 min

AI: post transformers FineVision: Open Data for Computer VisionThese September 2025 posts describe HuggingFaceM4/FineVision, a large dataset designed for image and text modalities. It features a substantial size, ranging from 10M to 100M, and is available in the parquet format. This dataset includes various ratings, such as relevance, visual dependency, image correspondence, and formatting, indicating its use in evaluating the quality and relationship between visual and textual content. The examples provided demonstrate that FineVision contains question-and-answer pairs related to diverse charts and diagrams, covering topics like population trends, genetic diseases, software update frequencies, and demographic distributions, suggesting its a...

2025-09-0715 min

AI: post transformers Evaluating Large Language Models Trained on CodeThis July 2021 paper documents the development and evaluation of OpenAI's Codex models, which are large language models specialized in code generation, particularly Python functions from docstrings. They introduce HumanEval, a hand-written dataset designed to assess the functional correctness of generated code through unit tests, a more robust metric than traditional match-based scores like BLEU. The papers compare the performance of various Codex iterations, including supervised fine-tuned versions (Codex-S), against other models like GPT-3, demonstrating significant improvements in pass rates with increased model size and sample generation. Furthermore, the texts explore the limitations, broa...

2025-09-0716 min

AI: post transformers Democratizing AI Compute: The Modular VisionThis blog post series from Chris Lattner extensively examines CUDA's pervasive dominance in AI compute, detailing its evolution from a graphics processor to a layered software platform integral to NVIDIA's success, while also highlighting the challenges and complexities it presents to developers and alternative hardware vendors. The articles critically assess various attempts to democratize AI compute, including OpenCL, TVM, XLA, and MLIR, explaining why these alternatives largely failed to dislodge CUDA due to fragmentation, misaligned incentives, and a lack of unified vision. Ultimately, the texts introduce Modular's approach to addressing these issues through its Mojo language, MAX framework, and...

2025-09-071h 11

AI: post transformers Limitations of Embedding-Based RetrievalThis August 2025 paper from Google DeepMind, titled "On the Theoretical Limitations of Embedding-Based Retrieval," explores the fundamental constraints of vector embedding models in information retrieval. The authors demonstrate that the number of relevant document combinations an embedding can represent is inherently limited by its dimension. Through empirical "free embedding" experiments and the introduction of a new dataset called LIMIT, they show that even state-of-the-art models struggle with simple queries designed to stress these theoretical boundaries. The research concludes that for complex, instruction-following queries, alternative...

2025-09-0615 min

AI: post transformers SAIR: Accelerating Pharma R&D with AI-Powered Structural IntelligenceThis September 2025 paper describe SAIR, the Structurally Augmented IC50 Repository, a groundbreaking open-source dataset developed by SandboxAQ in collaboration with NVIDIA. SAIR is the largest publicly available collection of over 5 million AI-generated 3D protein-ligand structures, each linked with experimentally measured drug potency data (IC₅₀ values). This dataset aims to bridge a critical data gap in AI-powered drug discovery by providing comprehensive structural intelligence, thereby enabling researchers to accelerate R&D, explore novel drug targets, and improve the accuracy of AI models for predicting drug properties. The creation of SAIR involved extensive high-performance computing, taking o...

2025-09-0618 min

AI: post transformers EmbeddingGemma: On-Device AI for High-Quality EmbeddingsThis document announces EmbeddingGemma, a new open embedding model from Google, specifically designed for on-device artificial intelligence (AI). It highlights the model's efficiency, compact size, and best-in-class performance for its category, particularly in multilingual text embedding. The source explains how EmbeddingGemma enables mobile-first Retrieval Augmented Generation (RAG) pipelines and semantic search by generating high-quality text embeddings directly on user hardware, ensuring privacy and offline functionality. It also details the model's compatibility with popular development tools and its ability to offer flexib...

2025-09-0514 min

AI: post transformers MTEB & MMTEB: The Massive Text Embedding BenchmarkThese academic papers introduce and detail the Massive Multilingual Text Embedding Benchmark (MMTEB), a comprehensive evaluation framework for text embedding models. The MMTEB expands upon existing benchmarks by offering over 500 tasks across 250+ languages and various domains, significantly increasing the diversity and scale of evaluation. It incorporates optimizations like downsampling and caching to reduce computational costs, making the benchmark more accessible, especially for low-resource languages. The papers also evaluate various models, including large language models (LLMs) and smaller, multilingual models, revealing that instruction-tuned models often perform better, and smaller models can surprisingly outperform larger LLMs in highly multilingual or low-resource...

2025-09-0516 min

AI: post transformers DeepResearch Arena: Benchmarking LLMs' Research AbilitiesThis September 2025 paper introduces DeepResearch Arena, a novel benchmark designed to evaluate the research capabilities of large language models (LLMs) by mirroring real-world academic inquiry. This benchmark addresses limitations of existing evaluation methods, which often suffer from data leakage or lack authenticity, by grounding its tasks in academic seminars and expert discourse. A Multi-Agent Hierarchical Task Generation (MAHTG) system is utilized to automatically generate over 10,000 diverse research tasks across multiple disciplines, covering phases from synthesis to evaluation. The paper also proposes a hybrid evaluation framework that combines Keypoint-Aligned Evaluation (KAE) for factual correctness and Adaptively-generated Checklist Evaluation (ACE) for...

2025-09-0516 min

AI: post transformers Inverse IFEval: Unlearning LLM Cognitive InertiaThis September 2025 paper introduces Inverse IFEval, a novel benchmark designed to evaluate Large Language Models (LLMs) for their Counter-intuitive Ability. This refers to an LLM's capacity to override its ingrained training patterns and comply with instructions that conflict with conventional norms or standardized formats. The benchmark includes eight distinct categories of such challenging instructions, like "Code without Comments" or "Deliberately Incorrect Answers," to expose the cognitive inertia and overfitting that current LLMs exhibit. The study underscores the need for future LLM development to prioritize adaptability...

2025-09-0519 min

AI: post transformers The Rise of Physical Neural NetworksThis June 2024 paper examines the current state and future potential of Physical Neural Networks (PNNs), which are AI systems implemented directly in physical hardware rather than purely digital software. It explores various training methodologies for PNNs, including in-silico (digital simulation), in-situ (real-world hardware training), and hybrid approaches like physics-aware training, each with its own advantages and limitations regarding accuracy, speed, cost, and complexity. The text also discusses alternative training paradigms such as Feedback Alignment, Local Learning, and gradient-free methods that aim to overcome challenges associated with traditional backpropagation in physical systems. Furthermore, it hi...

2025-09-0420 min

AI: post transformers FastVLM: Efficient Vision Encoding for Language ModelsThis May 2025 paper introduces FastVLM, an innovative approach designed to enhance the efficiency of Vision Language Models (VLMs). The authors explain that while increasing image resolution is crucial for VLM performance, traditional visual encoders become inefficient. FastVLM addresses this by incorporating FastViTHD, a novel hybrid vision encoder that reduces both the number of visual tokens and encoding time for high-resolution images. This optimization, achieved solely through input image scaling, leads to a significant 3.2x improvement in time-to-first-token (TTFT) while maintaining strong performance on VLM benchmarks, making it a more efficient solution compared to prior methods. The paper, submitted to...

2025-09-0412 min

AI: post transformers Apertus Tech Report OverviewThis paper introduces Apertus, a large language model developed by the Swiss AI Initiative, a partnership between ETH Zurich and EPFL. The GitHub repository appears to host technical documentation or code related to Apertus, while the Hugging Face page provides a comprehensive overview of the Apertus-8B-Instruct-2509 model. This model is highlighted for being fully open, massively multilingual (supporting over 1800 languages), and compliant with data privacy regulations, even incorporating mechanisms for data protection and copyright requests. The Hugging Face page also outlines the model's technical...

2025-09-0413 min

AI: post transformers Supervised Learning in DNA Neural NetworksThis September 2025 paper article from Nature, authored by Kevin M. Cherry and Lulu Qian, introduces a novel DNA-based neural network capable of supervised learning in vitro. The authors demonstrate how DNA molecules can be programmed to autonomously classify patterns from molecular examples. This system integrates training data directly into molecular memories and uses these memories for subsequent classification, moving beyond previous systems that relied on in silico learning. The work highlights the potential of molecular circuits to perform complex information processing, opening doors for adaptive decision-making in various physical systems, fr...

2025-09-0418 min

AI: post transformers FusionANNS: Billion-Scale ANNS with SSD and GPUThis September 2024 paper introduces FusionANNS, a novel system designed to improve Approximate Nearest Neighbor Search (ANNS) for extremely large datasets. It addresses challenges in existing ANNS systems, such as performance bottlenecks, high operational costs, and accuracy limitations, particularly when dealing with billion-scale vector data in modern AI infrastructure like Large Language Models (LLMs). FusionANNS achieves this through a cooperative CPU/GPU architecture that employs multi-tiered indexing, heuristic re-ranking, and redundancy-aware I/O deduplication. The system is shown to significantly outperform state-of-the-art SSD-based and GPU-accelerated in-memory ANNS solutions in terms of throughput (QPS), cost efficiency, and memory efficiency, while maintaining low latency and high...

2025-09-0426 min

AI: post transformers rStar2-Agent: Smarter Math Reasoning Through Agentic RLThis August 2025 paper introduces rStar2-Agent, a 14B math reasoning model developed by Microsoft Research that achieves state-of-the-art performance comparable to much larger models by employing agentic reinforcement learning. The model is trained to "think smarter" through three key innovations: an efficient RL infrastructure that manages high-throughput code execution, a novel GRPO-RoC algorithm for effective reasoning in a noisy code environment by filtering high-quality trajectories, and an efficient training recipe that minimizes computational cost. Demonstrating superior accuracy on challenging math benchmarks like AIME24/25, rStar2-Agent-14B also exhib...

2025-09-0319 min

AI: post transformers Scientific LLMs: A Data-Centric Survey and RoadmapThis August 2025 paper offers an extensive overview of the evolution and application of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) within scientific research, primarily focusing on the period from 2018 to 2025. It details how these AI models have progressed through various paradigm shifts, from initial transfer learning to sophisticated scientific agents capable of autonomous research. The document thoroughly examines the diverse data modalities—including visual spectra, microscopy images, molecular encodings, and time-series data—across six key scientific domains: Chemistry, Materials Science, Physics, Life Sciences, Astronomy, and Earth Science. Furthe...

2025-09-0319 min

AI: post transformers Pimba: Processing-in-Memory for LLM ServingThis August 2025 paper introduces Pimba, a novel Processing-in-Memory (PIM) accelerator designed to enhance the efficiency of Large Language Model (LLM) serving for both traditional transformer-based models and emerging post-transformer architectures. The authors highlight that memory bandwidth is a critical bottleneck for both types of LLMs, specifically during attention operations in transformers and state updates in post-transformers. Pimba addresses this by integrating PIM technology with LLM quantization, using a State-update Processing Unit (SPU) shared between memory banks to maximize hardware resource sharing and area efficiency. The system employs MX-based quantized arithmetic within its State-update Processing Engine (SPE), which is identified as a Pareto-optimal choic...

2025-08-2727 min

AI: post transformers Oaken: Fast, Efficient LLM Serving with Hybrid KV Cache QuantizationThis August 2025 paper introduces Oaken, a novel acceleration solution for serving Large Language Models (LLMs) that addresses the significant challenges of memory bandwidth and capacity bottlenecks inherent in batched LLM inference. Oaken achieves this through a co-designed algorithm and hardware architecture, featuring an online-offline hybrid KV cache quantization technique. This technique efficiently reduces the memory footprint and access requirements of the Key-Value (KV) cache by categorizing data into "inliers" and "outliers" using offline threshold profiling and applying group-shift quantization. Furthermore, Oaken integrates custom quantization/dequantization engines and memory management units into LLM accelerators to translate algorithmic gains into tangible performance...

2025-08-2719 min

AI: post transformers AdamW: Decoupled Weight Decay Regularization for Adaptive Gradient AlgorithmsThis January 2019 academic paper addresses the common issue of poor generalization in adaptive gradient optimization methods like Adam, compared to traditional Stochastic Gradient Descent (SGD) with momentum. The authors demonstrate that L2 regularization and weight decay are not equivalent for adaptive optimizers, unlike for standard SGD, leading to suboptimal performance in Adam. They propose a simple modification called "decoupled weight decay" (AdamW), which separates the weight decay step from the gradient-based updates. Empirical evidence shows that AdamW significantly improves Adam's generalization performance on image classification tasks and simplifies hyperparameter tuning by decoupling the learning rate and weight decay factors. Furth...

2025-08-2746 min

AI: post transformers Training Recurrent Neural Networks: Vanishing and Exploding GradientsThis academic paper addresses the inherent challenges in training Recurrent Neural Networks (RNNs), specifically the vanishing and exploding gradient problems. The authors explore these issues from analytical, geometrical, and dynamical systems perspectives, building upon previous work. They propose and empirically validate a gradient norm clipping strategy to combat exploding gradients and a soft regularization constraint to mitigate vanishing gradients. The research demonstrates that these solutions significantly improve RNN performance on both synthetic pathological tasks requiring long-term memory and natural language processing and music prediction problems.Source:https://arxiv.org/pdf/1211.5063

2025-08-2720 min

AI: post transformers Adafactor: Memory-Efficient Adaptive Learning RatesThis April 2018 paper introduces Adafactor, a novel optimization method designed to reduce the memory footprint of adaptive learning rate algorithms like Adam, particularly for large neural networks. Adafactor achieves this by estimating per-parameter second moments using factored representations, specifically maintaining only row and column sums for weight matrices, thereby reducing memory requirements from O(nm) to O(n+m). The paper also addresses training instability in adaptive methods, proposing update clipping and a gradually increasing decay rate scheme for the second-moment accumulator as solutions. Furthermore, Adafactor suggests scaling parameter updates based on the parameters' own magnitudes rather than absolute step sizes...

2025-08-2717 min

AI: post transformers SPAM: Stabilizing LLM Training with Spike-Aware OptimizationThis February 2025 research addresses the critical issue of training instability in Large Language Models (LLMs), which often stems from sudden, massive "gradient spikes" that can be thousands of times larger than typical gradients. The authors introduce Spike-Aware Adam with Momentum Reset (SPAM), a novel optimizer designed to counteract these spikes through periodic momentum resets and spike-aware gradient clipping, which scales down rather than zeroes out large gradients. Experiments demonstrate that SPAM consistently outperforms existing optimizers like Adam and Adafactor across various LLM sizes during both pre-training and fine-tuning. Furthermore, SPAM offers a memory-efficient version leveraging sparse momentum, enabling better performance...

2025-08-2717 min

AI: post transformers Google: Measuring AI's Environmental Impact at ScaleThis August 2025 paper presents Google's comprehensive methodology for measuring the environmental impact of AI inference workloads in a large-scale production environment. It addresses a critical gap in existing research by accounting for the full stack of AI serving infrastructure, including active AI accelerator power, host system energy, idle machine capacity, and data center overhead. The paper reveals that a median Gemini Apps text prompt consumes significantly less energy, carbon emissions, and water than many prior public estimates. Furthermore, it highlights Google's efforts in software efficiency and clean energy procurement, which ha...

2025-08-2617 min

AI: post transformers ComoRAG: Cognitively Inspired Narrative ReasoningThis August 2025 paper introduces ComoRAG, a novel framework designed to enhance long-context narrative comprehension in Large Language Models (LLMs) by simulating human metacognitive regulation. It addresses the limitations of existing Retrieval-Augmented Generation (RAG) methods, which struggle with stateful reasoning and integrating contradictory evidence over extended narratives. ComoRAG employs a dynamic cognitive loop that includes a hierarchical knowledge source (veridical, semantic, and episodic layers) and a dynamic memory workspace to continuously acquire new evidence and consolidate knowledge. Experimental results demonstrate ComoRAG's superior performance, particularly in solving com...

2025-08-2613 min

AI: post transformers Quantizing Diffusion LLMs: A Systematic StudyThis August 2025 academic paper explores the application of post-training quantization (PTQ) to diffusion large language models (dLLMs), a promising alternative to traditional autoregressive LLMs for natural language generation. The authors conduct a systematic study to understand how existing PTQ techniques, commonly used for compressing AR LLMs, perform with dLLMs. A key finding is the prevalence of activation outliers in dLLMs, which pose a significant challenge for low-bit quantization. The research also evaluates the effectiveness of various quantization methods, bit-widths, task types, and model variants, concluding that 4-bit quantization is optimal for weight-only methods like GPTQ, while 8-bit is tolerable...

2025-08-2624 min

AI: post transformers ODYSSEY: Unified Mobile Manipulation for Agile Quadruped RobotsThis August 2025 paper introduces ODYSSEY, a comprehensive framework for open-world mobile manipulation that integrates robotic mobility, manipulation, and real-time perception. It highlights a novel approach that uses large language models for high-level task planning and vision-language models for fine-grained action guidance, enabling robots to adaptively interact in complex environments. A significant contribution is the first comprehensive benchmark for long-horizon mobile manipulation, featuring diverse daily tasks in both indoor and outdoor settings to thoroughly evaluate embodied reasoning, planning, navigation, and manipulation capabilities. The system demonstrates strong sim-to-real transfer performan...

2025-08-2621 min

AI: post transformers GPT-5 Spatial Intelligence: An Empirical StudyThis August 2025 academic paper, titled "Has GPT-5 Achieved Spatial Intelligence? An Empirical Study," examines the spatial understanding and reasoning capabilities of advanced multi-modal AI models, including the recently released GPT-5. The authors propose a new taxonomy for spatial tasks and evaluate both proprietary and open-source models against eight key benchmarks, utilizing over a billion tokens for their study. Their findings indicate that while GPT-5 shows unprecedented strength in spatial intelligence, it still falls short of human performance across a broad range of tasks. The research also identifies specific challeng...

2025-08-2418 min

AI: post transformers DeepSeek-V3.1: A Hybrid AI Model with Enhanced ReasoningThis is a review of DeepSeek's latest release announced on Hugging Face on August 21, 2025. The source introduces DeepSeek-V3.1, a hybrid large language model that supports both "thinking" and "non-thinking" operational modes, distinguishable through different chat templates. This updated model offers smarter tool calling capabilities and improved thinking efficiency, providing faster responses with comparable answer quality to previous versions. Built upon a two-phase long context extension, DeepSeek-V3.1 has expanded its training dataset significantly to enhance its understanding and generation of longer documents. The document also provides detailed chat templates for various interaction types, including multi-turn conversations and tool-calling scenarios for...

2025-08-2313 min

AI: post transformers Compressed Experts: Efficient MoE Model EditingThis March 2025 paper introduces compressed experts, an innovative method to enhance the efficiency of Mixture-of-Experts (MoE) models by reducing computational overhead while preserving performance. The core idea involves replacing less critical "auxiliary experts" with lightweight, compact representations, called compressed experts, during fine-tuning. This strategy allows for a significant reduction in activated parameters and inference costs—over 30% and 20% respectively, as demonstrated on models like Phi-MoE and OLMoE—while retaining more than 90% of the full model's performance. The paper details the method of identifying and aggregating these compressed experts and highlights their part...

2025-08-2321 min

AI: post transformers Genie 3: A New Frontier for World ModelsThe source provides an overview of Google DeepMind's AI research and models, highlighting various applications across different scientific disciplines and creative fields. It introduces Genie 3, a general-purpose world model capable of generating diverse, interactive, real-time environments from text prompts. The document details Genie 3's capabilities, such as simulating physical properties, natural worlds, and fictional scenarios, while also addressing its limitations and the company's commitment to responsible AI development. Ultimately, the text positions Genie 3 as a significant advancement for AI research and generative media, with potential for education, training, and embodied agent development.Source:

2025-08-2316 min

AI: post transformers Los Alamos: overcoming the memory wall fighting sparse memory accessWe review Los Alamos National Laboratory advancements in managing indirect memory accesses in high-performance computing and it's relationship to overcoming the memory wall. The first goal of DoE’s next-generation supercomputer, ATS-5, is “Overcoming the memory wall: continued memory bandwidth performance improvements for tri-lab applications.”"DX100" introduces a programmable data access accelerator designed to improve memory bandwidth utilization for irregular applications by reordering, coalescing, and interleaving memory requests. This accelerator aims to offload bulk indirect memory operations from CPU cores, thus reducing instruction count and cache misses. Complementing this, "A Workflow for the Sy...

2025-08-2129 min

AI: post transformers Switch Transformers: Trillion Parameter Models with SparsityThis June 2022 paper introduces Switch Transformers, a novel architecture designed to enhance the efficiency and scalability of large-scale language models. Unlike traditional models that reuse the same parameters, Switch Transformers employ a Mixture-of-Experts (MoE) approach, activating different parameters for each input to achieve a sparsely-activated model with significantly more parameters at a constant computational cost. The authors simplify the MoE routing algorithm and implement improved training techniques to overcome prior limitations such as complexity, communication overhead, and instability. The paper demonstrates that Switch Transformers achieve substantial pre-training speedups and performance gains across various natural language tasks, including multilingual settings...

2025-08-2018 min

AI: post transformers Linear Transformers: Faster Than RNNsThis August 2020 paper introduces linear transformers, a novel approach to addressing the computational and memory inefficiencies of traditional transformer models, particularly for long sequences. By reframing the self-attention mechanism using a linear dot-product of kernel feature maps, the authors reduce the computational complexity from quadratic to linear, enabling significantly faster autoregressive inference. The research highlights the relationship between transformers and recurrent neural networks (RNNs), demonstrating that a causally masked transformer can be expressed as an RNN, thus allowing for constant time and memory per predictio...

2025-08-2014 min

AI: post transformers Speed Always Wins: Efficient Large Language Model ArchitecturesThis August 2025 survey paper explores efficient architectures for large language models (LLMs), addressing the computational challenges of models like Transformers. It categorizes advancements into linear sequence modeling, including linear attention and state-space models, which offer linear computational complexity. The document also examines sparse sequence modeling, such as static and dynamic sparse attention, designed to reduce computational demands by limiting interactions between elements. Furthermore, it discusses methods for efficient full attention, including IO-aware and grouped attention, and introduces sparse Mixture-of-Experts (MoE) models, which enhance efficiency through conditional compu...

2025-08-2026 min

AI: post transformers Atom: Low-Bit Quantization for LLM ServingThis April 2024 paper introduces Atom, a novel low-bit quantization method designed to enhance the efficiency and accuracy of Large Language Model (LLM) serving. The core challenge addressed is the high computational and memory costs associated with LLMs, especially when accommodating numerous user requests. Atom tackles this by quantizing both weights and activations to low-bit representations, like 4-bit, which significantly reduces memory consumption and boosts throughput by leveraging modern GPU capabilities. It maintains accuracy through mixed-precision quantization, fine-grained group quantization, and dynamic quantization, demonstrating substantial improvements in tokens per second with n...

2025-08-1917 min

AI: post transformers Continuous Batching for LLM Inference: Throughput and Latency GainsThe source analyzes Large Language Model (LLM) inference, specifically focusing on how continuous batching significantly improves efficiency compared to traditional static batching. It explains the inefficiencies of static batching where GPUs are underutilized due to varying output lengths in a batch, and introduces continuous batching (also known as dynamic batching or iteration-level scheduling) as a solution that dynamically adds new requests as others complete. The document further highlights PagedAttention and vLLM as advanced memory optimization techniques built upon continuous batching, leading to even greater throu...

2025-08-1939 min

AI: post transformers Self-Search Reinforcement Learning for LLMsThis August 2025 paper introduces Self-Search Reinforcement Learning (SSRL), a novel method that enables Large Language Models (LLMs) to access and utilize their internal knowledge for search-driven tasks, bypassing the need for external search engines like Google or Bing. The research explores how repeated sampling can enhance an LLM's intrinsic search capabilities and investigates the impact of various prompting strategies and training methodologies, including the benefits of information masking and format-based rewards. The paper demonstrates that SSRL-trained models can effectively gen...

2025-08-1813 min

AI: AX - introspection GoldenMagikCarpThese two sources from LessWrong explore the phenomenon of "glitch tokens" within Large Language Models (LLMs) like GPT-2, GPT-3, and GPT-J. The authors, Jessica Rumbelow and mwatkins, detail how these unusual strings, often derived from web scraping of sources like Reddit or game logs, cause anomalous behaviors in the models, such as evasion, bizarre responses, or refusal to repeat the token. They hypothesize that these issues stem from the tokens being rarely or poorly represented in the models' training data, leading to unpredictable outcomes and non-deterministic responses, even at zero temperature. The second source provides further technical details and...

2025-08-0916 min

AI: AX - introspection Route Sparse Autoencoder to Interpret Large Language ModelsThis paper introduces Route Sparse Autoencoder (RouteSAE), a novel framework designed to improve the interpretability of large language models (LLMs) by effectively extracting features across multiple layers. Traditional sparse autoencoders (SAEs) primarily focus on single-layer activations, failing to capture how features evolve through different depths of an LLM. RouteSAE addresses this by incorporating a routing mechanism that dynamically assigns weights to activations from various layers, creating a unified feature space. This approach leads to a higher number of interpretable features and improved interpretability scores compared to previous methods li...

2025-08-0912 min

AI: AX - introspection HarmBench: Automated Red Teaming for LLM SafetyThis paper introduces HarmBench, a new framework for evaluating the safety and robustness of large language models (LLMs) against malicious use. It highlights the growing concern over LLMs' potential for harm, such as generating malware or designing biological weapons, and emphasizes the need for automated red teaming—a process of identifying vulnerabilities—due to the scalability limitations of manual methods. HarmBench addresses the previous lack of standardized evaluation by offering a comprehensive benchmark with diverse harmful behaviors, including contextual and multimodal scenarios, and robust, comparable metrics for assessing attack success rates. The document also prese...

2025-08-0922 min

AI: AX - introspection Jailbreaking LLMsA long list of papers and articles are reviewed about jailbreaking LLMs:These sources primarily explore methods for bypassing safety measures in Large Language Models (LLMs), often referred to as "jailbreaking," and proposed defense mechanisms. One key area of research involves "abliteration," a technique that directly modifies an LLM's internal activations to remove censorship without traditional fine-tuning. Another significant approach, "Speak Easy," enhances jailbreaking by decomposing harmful requests into smaller, multilingual sub-queries, significantly increasing the LLMs' susceptibility to generating undesirable content. Additionally, "Sugar-Coated Poison" investigates integrating benign content with adversarial reasoning to create effective...

2025-08-0910 min

AI: AX - introspection PA-LRP & absLRPWe focus on two evolutions to AX, they focus on advancing the explainability of deep neural networks, particularly Transformers, by improving Layer-Wise Relevance Propagation (LRP) methods. One source introduces Positional Attribution LRP (PA-LRP), a novel approach that addresses the oversight of positional encoding in prior LRP techniques, showing it significantly enhances the faithfulness of explanations in areas like natural language processing and computer vision. The other source proposes Relative Absolute Magnitude Layer-Wise Relevance Propagation (absLRP) to overcome issues with conflicting relevance values and varying activation magnitudes in existing LRP rules, demonstrating its superior performance in generating clear, contrastive, and...

2025-08-0919 min

AI: AX - introspection AttnLRP: Explainable AI for TransformersThis paper 2024 introduces AttnLRP, a novel method for explaining the internal reasoning of transformer models, including Large Language Models (LLMs) and Vision Transformers (ViTs). It extends Layer-wise Relevance Propagation (LRP) by introducing new rules for non-linear operations like softmax and matrix multiplication within attention layers, improving faithfulness and computational efficiency compared to existing methods. The paper highlights AttnLRP's ability to provide attributions for latent representations, enabling the identification and manipulation of "knowledge neurons" within these complex models. Experimental resul...

2025-08-0916 min

AI: AX - introspection Pixel-Wise Explanations for Non-Linear Classifier DecisionsThis open-access research article from PLOS One introduces Layer-wise Relevance Propagation (LRP), a novel method for interpreting decisions made by complex, non-linear image classifiers. The authors, an international team of researchers, explain how LRP can decompose a classification decision down to the individual pixels of an input image, generating a heatmap that visualizes their contribution. This technique aims to make "black box" machine learning models, like neural networks and Bag of Words (BoW) models, more transparent by showing why a system arrives at a particular classification. The paper evaluates LRP on various datasets, including PASCAL VOC images and MNIST...

2025-08-0919 min

AI: AX - introspection Multi-Layer Sparse Autoencoders for Transformer InterpretationThis paper introduces the Multi-Layer Sparse Autoencoder (MLSAE), a novel approach for interpreting the internal representations of transformer language models. Unlike traditional Sparse Autoencoders (SAEs) that analyze individual layers, MLSAEs are trained across all layers of a transformer's residual stream, enabling the study of information flow across layers. The research found that while individual "latents" (features learned by the SAE) tend to be active at a single layer for a given input, they are active at multiple layers when aggregated over many inputs, with this multi-layer activity increasing in larger models. The authors also explored the effect of "tuned-lens"...

2025-08-0914 min