The Most Consequential AI Research Papers: Q4 2024

Description

The sources describe a variety of language models, their architectures, and training methods. Qwen 2 is presented as a strong LLM model family that is competitive with other major LLMs. The technical report for Qwen2 details models ranging from 0.5 to 72 billion parameters, including both dense and Mixture-of-Experts architectures. Apple's Apple Intelligence Foundation Models (AFM) include a 3-billion-parameter on-device model for phones, tablets, and laptops, and a more capable server model of unspecified size. The on-device AFM is distilled and pruned from a larger 6.4-billion-parameter model, using a distillation loss method. Llama 3 has several versions, including 8B, 70B, and 405B parameter models. The Llama 3 architecture closely resembles Llama 2, with key differences being a larger vocabulary and the introduction of grouped-query attention for smaller models. Jamba-1.5 models include Mini and Large versions that use a hybrid architecture combining Transformer and Mamba layers with a Mixture-of-Experts module. Jamba-1.5-Large has 94B active parameters out of 398B total, and can fit on a single machine with 8 80GB GPUs for contexts up to 256K tokens. NVLM-1.0 is a multimodal LLM that uses three different architectures: a decoder-only architecture (NVLM-D), a cross-attention-based architecture (NVLM-X), and a hybrid approach (NVLM-H). TÜLU 3 models were developed using Direct Preference Optimization (DPO) with length normalization, and preference data was generated using a pipeline that includes prompt selection, response generation, and preference annotation.

Listen

Description

Want to check another podcast?