Arize AI - Podcast Details

Shows

Deep PapersSelf-Adapting Language Models: Paper Authors Discuss ImplicationsThe authors of the new paper *Self-Adapting Language Models (SEAL)* shared a behind-the-scenes look at their work, motivations, results, and future directions.The paper introduces a novel method for enabling large language models (LLMs) to adapt their own weights using self-generated data and training directives — “self-edits.”Learn more about the Self-Adapting Language Models paper.Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.2025-07-0831 min

Supra Insider#63: How PMs can bring predictability to AI products | Aman Khan (Head of Product @ Arize AI, ex-Spotify, ex-Apple)If you’ve ever launched an AI feature and later realized it wasn’t quite ready—or struggled to define what “quality” even looks like in an AI product—this episode is for you.In this episode of Supra Insider, Marc and Ben sit down with Aman Khan, Head of Product of Arize AI and a leading voice in AI product evaluation. Aman has helped dozens of teams—from scrappy startups to massive consumer platforms—build systematic approaches for evaluating LLM-powered features. Together, they explore the dangers of “vibe coding” your way to production, how to define and operationalize e...2025-06-2356 min

Deep PapersThe Illusion of Thinking: What the Apple AI Paper Says About LLM ReasoningThis week we discuss The Illusion of Thinking, a new paper from researchers at Apple that challenges today’s evaluation methods and introduces a new benchmark: synthetic puzzles with controllable complexity and clean logic. Their findings? Large Reasoning Models (LRMs) show surprising failure modes, including a complete collapse on high-complexity tasks and a decline in reasoning effort as problems get harder.Dylan and Parth dive into the paper's findings as well as the debate around it, including a response paper aptly titled "The Illusion of the Illusion of Thinking." Read the paper: The Ill...2025-06-2030 min

Deep PapersAccurate KV Cache Quantization with Outlier Tokens TracingWe discuss Accurate KV Cache Quantization with Outlier Tokens Tracing, a deep dive into improving the efficiency of LLM inference. The authors enhance KV Cache quantization, a technique for reducing memory and compute costs during inference, by introducing a method to identify and exclude outlier tokens that hurt quantization accuracy, striking a better balance between efficiency and performance.Read the paperAccess the slides Read the blogJoin us for Arize ObserveLearn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.2025-06-0425 min

Deep PapersScalable Chain of Thoughts via Elastic ReasoningIn this week's episode, we talk about Elastic Reasoning, a novel framework designed to enhance the efficiency and scalability of large reasoning models by explicitly separating the reasoning process into two distinct phases: thinking and solution. This separation allows for independent allocation of computational budgets, addressing challenges related to uncontrolled output lengths in real-world deployments with strict resource constraints.Our discussion explores how Elastic Reasoning contributes to more concise and efficient reasoning, even in unconstrained settings, and its implications for deploying LRMs in resource-limited environments.Read the paper Join us li...2025-05-1628 min

AWS re:Think PodcastEpisode 40: AI Observabilty and Evaluation with Arize AIAI can still sometimes hallucinate and give less than optimal answers. To address this, we are joined by Arize AI’s Co-Founder a Aparna Dhinakaran for a discussion on Observability and Evaluation for AI. We begin by discussing the challenges AI Observability and Evaluation. For example, how does “LLM as a Judge” work? We conclude with some valuable advice from Aparna for first time entrepreneurs.Begin Observing and Evaluating your AI Applications with Open Source Phoenix:https://phoenix.arize.com/AWS Hosts: Nolan Chen & Malini ChatterjeeEmail Your Feedback: rethinkpodcast@amazon.com 2025-05-0739 min

Deep PapersSleep-time Compute: Beyond Inference Scaling at Test-timeWhat if your LLM could think ahead—preparing answers before questions are even asked?In this week's paper read, we dive into a groundbreaking new paper from researchers at Letta, introducing sleep-time compute: a novel technique that lets models do their heavy lifting offline, well before the user query arrives. By predicting likely questions and precomputing key reasoning steps, sleep-time compute dramatically reduces test-time latency and cost—without sacrificing performance.We explore new benchmarks—Stateful GSM-Symbolic, Stateful AIME, and the multi-query extension of GSM—that show up to 5x lower compute at inference, 2.5x lower cost per q...2025-05-0230 min

Learning from Machine LearningAman Khan: Arize, Evaluating AI, Designing for Non-Determinism | Learning from Machine Learning #11On this episode of Learning from Machine Learning, I had the privilege of speaking with Aman Khan, Head of Product at Arize AI. Aman shared how evaluating AI systems isn't just a step in the process—it's a machine learning challenge in of itself. Drawing powerful analogies between mechanical engineering and AI, he explained, "Instead of tolerances in manufacturing, you're designing for non-determinism," reminding us that complexity often breeds opportunity.Aman's journey from self-driving cars to ML evaluation tools highlights the critical importance of robust systems that can handle failure. He encourages teams to clearly define outcomes, br...2025-04-291h 07

Deep PapersLibreEval: The Largest Open Source Benchmark for RAG Hallucination DetectionFor this week's paper read, we dive into our own research.We wanted to create a replicable, evolving dataset that can keep pace with model training so that you always know you're testing with data your model has never seen before. We also saw the prohibitively high cost of running LLM evals at scale, and have used our data to fine-tune a series of SLMs that perform just as well as their base LLM counterparts, but at 1/10 the cost. So, over the past few weeks, the Arize team generated the largest public dataset of hallucinations...2025-04-1827 min

Uncommon LoveUncommon Love Session 7: It's Time to Arize in Faith, Love, and Obedience.Join us for Session 7 of Uncommon Love featuring Pastor Keith as we talk through Love, Repentance, Obedience, & unpack key truths from his book Destine to ARIZE, explore the real meaning of love from a biblical lens, and discuss how God’s vision for relationship transforms our lives. Whether you're rebuilding your heart or growing in faith, this session will inspire you to rise with grace and boldness.#Beaplus #belovedplus #CapCut #shorts #LoveyouMore 2025-04-1331 min

Deep PapersAI Benchmark Deep Dive: Gemini 2.5 and Humanity's Last ExamThis week we talk about modern AI benchmarks, taking a close look at Google's recent Gemini 2.5 release and its performance on key evaluations, notably Humanity's Last Exam (HLE). In the session we covered Gemini 2.5's architecture, its advancements in reasoning and multimodality, and its impressive context window. We also talked about how benchmarks like HLE and ARC AGI 2 help us understand the current state and future direction of AI.Join us for the next live recording, or check out the latest AI research. Learn more about AI observability and evaluation, join the Arize AI Sl...2025-04-0526 min

A FASHION MOMENT WITH TAI CHUNNA FASHION MOMENT WITH TAI CHUNN - Guest_ Arize Igboemeka A FASHION MOMENT WITH TAI CHUNN - Guest_ Arize Igboemeka 2025-03-291h 12

Deep PapersModel Context Protocol (MCP)We cover Anthropic’s groundbreaking Model Context Protocol (MCP). Though it was released in November 2024, we've been seeing a lot of hype around it lately, and thought it was well worth digging into. Learn how this open standard is revolutionizing AI by enabling seamless integration between LLMs and external data sources, fundamentally transforming them into capable, context-aware agents. We explore the key benefits of MCP, including enhanced context retention across interactions, improved interoperability for agentic workflows, and the development of more capable AI agents that can execute complex tasks in real-world environments.Read our analysis of...2025-03-2515 min

AdoptedEmbracing the AI Agent Revolution | Aman Khan: Director of PM - Arize AIIn this insightful conversation, Deepak Anchala and Aman Khan (Director of Product Management, Arize AI) explore the transformative impact of Agentic AI on product management and development. They discuss the rise of AI agents, the necessity for companies to adapt their products for AI integration, and the evolving role of product managers in this new landscape. The dialogue emphasizes the importance of understanding customer needs, building effective AI strategies, and the significance of observability in AI products.Takeaways2025 is projected to be the year of AI agents.Companies need to design products for AI agents as users.AI...2025-03-2032 min

Generative AI Group PodcastWeek of 2025-03-09Alex: Hello and welcome to The Generative AI Group Digest for the week of 09 Mar 2025! Maya: We're Alex and Maya. Alex: First up, we’re talking about the buzz around ManusAI. Maya: ManusAI is being hyped quite a bit. Alex, do you get what’s unique about it compared to Anthropic or OpenAI’s Operator? Alex: Yeah, good question! Nirant joked it’s “Made in China” and challenges American dominance. Pratik did say it’s slow but great for engineering problems—not research. Maya: So is ManusAI more like a specialized tool rather than a general research assistant? Alex: It seems so. Rajesh...2025-03-0900 min

Deep PapersAI Roundup: DeepSeek’s Big Moves, Claude 3.7, and the Latest BreakthroughsThis week, we're mixing things up a little bit. Instead of diving deep into a single research paper, we cover the biggest AI developments from the past few weeks.We break down key announcements, including:DeepSeek’s Big Launch Week: A look at FlashMLA (DeepSeek’s new approach to efficient inference) and DeepEP (their enhanced pretraining method).Claude 3.7 & Claude Code: What’s new with Anthropic’s latest model, and what Claude Code brings to the AI coding assistant space.Stay ahead of the curve with this fast-paced recap of the most important AI updates. We'll be back...2025-03-0130 min

Deep PapersHow DeepSeek is Pushing the Boundaries of AI DevelopmentThis week, we dive into DeepSeek. SallyAnn DeLucia, Product Manager at Arize, and Nick Luzio, a Solutions Engineer, break down key insights on a model that have dominating headlines for its significant breakthrough in inference speed over other models. What’s next for AI (and open source)? From training strategies to real-world performance, here’s what you need to know.Read our analysis of DeepSeek, or dive into the latest AI research. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.2025-02-2129 min

Deep PapersMultiagent Finetuning: A Conversation with Researcher Yilun DuWe talk to Google DeepMind Senior Research Scientist (and incoming Assistant Professor at Harvard), Yilun Du, about his latest paper, "Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains." This paper introduces a multiagent finetuning framework that enhances the performance and diversity of language models by employing a society of agents with distinct roles, improving feedback mechanisms and overall output quality.The method enables autonomous self-improvement through iterative finetuning, achieving significant performance gains across various reasoning tasks. It's versatile, applicable to both open-source and proprietary LLMs, and can integrate with human-feedback-based methods like RLHF or DPO, paving the...2025-02-0430 min

Deep PapersTraining Large Language Models to Reason in Continuous Latent SpaceLLMs have typically been restricted to reason in the "language space," where chain-of-thought (CoT) is used to solve complex reasoning problems. But a new paper argues that language space may not always be the best for reasoning. In this paper read, we cover an exciting new technique from a team at Meta called Chain of Continuous Thought—also known as "Coconut." In the paper, "Training Large Language Models to Reason in a Continuous Latent Space" explores the potential of allowing LLMs to reason in an unrestricted latent space instead of being constrained by natural language tokens.Read a...2025-01-1424 min

Behind the CraftThe AI Skill That Will Define Your PM Career in 2025 | Aman Khan (Arize)My guest today is Aman Khan. Aman is the Director of Product at Arize AI. Last year, we both heard the Chief Product Officers of OpenAI and Anthropic share that AI evaluations will be the most important skill for PMs in 2025. Aman gave me a crash course on how to build this critical skill in our interview. Timestamps: (00:00) Evals force you to get into your user's shoes (03:42) 5 skills to build right now to become a great AI PM (07:43) How curiosity leads to better AI products (10:20) Y...2025-01-1246 min

Deep PapersLLMs as Judges: A Comprehensive Survey on LLM-Based Evaluation MethodsWe discuss a major survey of work and research on LLM-as-Judge from the last few years. "LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods" systematically examines the LLMs-as-Judge framework across five dimensions: functionality, methodology, applications, meta-evaluation, and limitations. This survey gives us a birds eye view of the advantages, limitations and methods for evaluating its effectiveness. Read a breakdown on our blog: https://arize.com/blog/llm-as-judge-survey-paper/Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.2024-12-2328 min

Deep PapersMerge, Ensemble, and Cooperate! A Survey on Collaborative LLM StrategiesLLMs have revolutionized natural language processing, showcasing remarkable versatility and capabilities. But individual LLMs often exhibit distinct strengths and weaknesses, influenced by differences in their training corpora. This diversity poses a challenge: how can we maximize the efficiency and utility of LLMs?A new paper, "Merge, Ensemble, and Cooperate: A Survey on Collaborative Strategies in the Era of Large Language Models," highlights collaborative strategies to address this challenge. In this week's episode, we summarize key insights from this paper and discuss practical implications of LLM collaboration strategies across three main approaches: merging, ensemble, and cooperation. We also...2024-12-1028 min

Fireside with VoxgigEpisode 226 John Gilhuly, Developer Advocate at Arize.comJohn Gilhuly is the chief developer advocate at Arize.com, an AI observability monitoring platform, and he joins us for a great discussion on the rapidly evolving AI space, and what that means for a company built to work with LLM applications. He talks to us about Arize’s new open source product, ‘Phoenix” - the Garageband to their Logic, and how the inspiration for building it came from how much the team feels they owe to the wider open source community. As much as it is a chance for them to build their own community, that element comes second...2024-11-2634 min

Deep PapersAgent-as-a-Judge: Evaluate Agents with AgentsThis week, we break down the “Agent-as-a-Judge” framework—a new agent evaluation paradigm that’s kind of like getting robots to grade each other’s homework. Where typical evaluation methods focus solely on outcomes or demand extensive manual work, this approach uses agent systems to evaluate agent systems, offering intermediate feedback throughout the task-solving process. With the power to unlock scalable self-improvement, Agent-as-a-Judge could redefine how we measure and enhance agent performance. Let's get into it! Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn an 2024-11-2324 min

🧠 _Becoming an AI PM | Aman Khan (Arize AI, ex-Spotify, Apple, Cruise) Podcast: Lenny's Podcast: Product | Growth | Career (LS 61 · TOP 0.1% what is this?)Episode: Becoming an AI PM | Aman Khan (Arize AI, ex-Spotify, Apple, Cruise)Pub date: 2024-11-14Get Podcast Transcript →powered by Listen411 - fast audio-to-text and summarizationAman Khan is Director of Product at Arize AI, an observability company for AI engineers at companies like Uber, Instacart, and Discord. Previously he was an AI Product Manager at Spotify on the ML Platform team, enabling hundreds of engineers to build and ship products across the company. He has also led and...2024-11-171h 17 $Lenny\'s Podcast: Product | Career | Growth$ Lenny's Podcast: Product | Career | GrowthBecoming an AI PM | Aman Khan (Arize AI, ex-Spotify, Apple, Cruise)Aman Khan is Director of Product at Arize AI, an observability company for AI engineers at companies like Uber, Instacart, and Discord. Previously he was an AI Product Manager at Spotify on the ML Platform team, enabling hundreds of engineers to build and ship products across the company. He has also led and worked on products at Cruise, Zipline, and Apple. In our conversation, we discuss:• What is an “AI product manager”?• How to break into AI PM• What separates top 5% AI PMs• How to thrive as an individual-contributor PM• Common p...2024-11-141h 17

Deep PapersIntroduction to OpenAI's Realtime APIWe break down OpenAI’s realtime API. Learn how to seamlessly integrate powerful language models into your applications for instant, context-aware responses that drive user engagement. Whether you’re building chatbots, dynamic content tools, or enhancing real-time collaboration, we walk through the API’s capabilities, potential use cases, and best practices for implementation. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.2024-11-1229 min

Deep PapersSwarm: OpenAI's Experimental Approach to Multi-Agent SystemsAs multi-agent systems grow in importance for fields ranging from customer support to autonomous decision-making, OpenAI has introduced Swarm, an experimental framework that simplifies the process of building and managing these systems. Swarm, a lightweight Python library, is designed for educational purposes, stripping away complex abstractions to reveal the foundational concepts of multi-agent architectures. In this podcast, we explore Swarm’s design, its practical applications, and how it stacks up against other frameworks. Whether you’re new to multi-agent systems or looking to deepen your understanding, Swarm offers a straightforward, hands-on way to get started.Read a Su...2024-10-2946 min

Deep PapersKV Cache ExplainedIn this episode, we dive into the intriguing mechanics behind why chat experiences with models like GPT often start slow but then rapidly pick up speed. The key? The KV cache. This essential but under-discussed component enables the seamless and snappy interactions we expect from modern AI systems.Harrison Chu breaks down how the KV cache works, how it relates to the transformer architecture, and why it's crucial for efficient AI responses. By the end of the episode, you'll have a clearer understanding of how top AI products leverage this technology to deliver fast, high-quality user experiences...2024-10-2404 min

Deep PapersThe Shrek Sampler: How Entropy-Based Sampling is Revolutionizing LLMsIn this byte-sized podcast, Harrison Chu, Director of Engineering at Arize, breaks down the Shrek Sampler. This innovative Entropy-Based Sampling technique--nicknamed the 'Shrek Sampler--is transforming LLMs. Harrison talks about how this method improves upon traditional sampling strategies by leveraging entropy and varentropy to produce more dynamic and intelligent responses. Explore its potential to enhance open-source AI models and enable human-like reasoning in smaller language models. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.2024-10-1603 min

Deep PapersGoogle's NotebookLM and the Future of AI-Generated AudioThis week, Aman Khan and Harrison Chu explore NotebookLM’s unique features, including its ability to generate realistic-sounding podcast episodes from text (but this podcast is very real!). They dive into some technical underpinnings of the product, specifically the SoundStorm model used for generating high-quality audio, and how it leverages a hierarchical vector quantization approach (RVQ) to maintain consistency in speaker voice and tone throughout long audio durations. The discussion also touches on ethical implications of such technology, particularly the potential for hallucinations and the need to balance creative freedom with factual accuracy. We close out with a f...2024-10-1543 min

Deep PapersExploring OpenAI's o1-preview and o1-miniOpenAI recently released its o1-preview, which they claim outperforms GPT-4o on a number of benchmarks. These models are designed to think more before answering and handle complex tasks better than their other models, especially science and math questions. We take a closer look at their latest crop of o1 models, and we also highlight some research our team did to see how they stack up against Claude Sonnet 3.5--using a real world use case. Read it on our blog: https://arize.com/blog/exploring-openai-o1-preview-and-o1-miniLearn more about AI observability a...2024-09-2742 min

Deep PapersBreaking Down Reflection Tuning: Enhancing LLM Performance with Self-LearningA recent announcement on X boasted a tuned model with pretty outstanding performance, and claimed these results were achieved through Reflection Tuning. However, people were unable to reproduce the results. We dive into some recent drama in the AI community as a jumping off point for a discussion about Reflection 70B.In 2023, there was a paper written about Reflection Tuning that this new model (Reflection 70B) draws concepts from. Reflection tuning is an optimization technique where models learn to improve their decision-making processes by “reflecting” on past actions or predictions. This method enables models to iteratively refine thei...2024-09-1926 min

Deep PapersComposable Interventions for Language ModelsThis week, we're excited to be joined by Kyle O'Brien, Applied Scientist at Microsoft, to discuss his most recent paper, Composable Interventions for Language Models. Kyle and his team present a new framework, composable interventions, that allows for the study of multiple interventions applied sequentially to the same language model. The discussion will cover their key findings from extensive experiments, revealing how different interventions—such as knowledge editing, model compression, and machine unlearning—interact with each other.Read it on the blog: https://arize.com/blog/composable-interventions-for-language-models/Learn more about AI observability and evaluation, join the...2024-09-1242 min $Let\'s Act: Sustainably Developing Africa and Beyond$ Let's Act: Sustainably Developing Africa and BeyondEP. 39: A Real Game Changer: Arize! Showcasing Africa in a Positive LightSend us a textHello Sustainable Friends!Our guest is HRH Princess Moradeun Adedoyin - Solarin.She is a Yoruba Princess, Pan- African Royal, broadcast journalist and the grand -daughter of HRM Oba William Christopher Adedoyin II, Akarigbo Anoko of Sagamu, Remo Kingdom, Nigeria, She took me down a vivid, insightful and colourful journey of her Kingdom, Nigeria, Africa and the Global world as we visited the corridors of community development, people empowerment, economic development, music and how her lineage have continued to pay forward great deeds as part o...2024-09-011h 07

Deep PapersJudging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-JudgesThis week’s paper presents a comprehensive study of the performance of various LLMs acting as judges. The researchers leverage TriviaQA as a benchmark for assessing objective knowledge reasoning of LLMs and evaluate them alongside human annotations which they find to have a high inter-annotator agreement. The study includes nine judge models and nine exam-taker models – both base and instruction-tuned. They assess the judge models’ alignment across different model sizes, families, and judge prompts to answer questions about the strengths and weaknesses of this paradigm, and what potential biases it may hold.Read it on the blog: https...2024-08-1639 min

Deep PapersBreaking Down Meta's Llama 3 Herd of ModelsMeta just released Llama 3.1 405B–according to them, it’s “the first openly available model that rivals the top AI models when it comes to state-of-the-art capabilities in general knowledge, steerability, math, tool use, and multilingual translation.” Will the latest Llama herd ignite new applications and modeling paradigms like synthetic data generation? Will it enable the improvement and training of smaller models, as well as model distillation? Meta thinks so. We’ll take a look at what they did here, talk about open source, and decide if we want to believe the hype.Read it on the blog: http...2024-08-0644 min

Behind The Scenes with SondaEMR Talk with Alan Tillinghast of ArizeIn this episode of Behind The Scenes, Sonda chats with Alan Tillinghast, CEO of Cantata Health Solutions. Alan leads the company's growth in providing behavioral health and human services software. His extensive experience in senior leadership roles gives him a unique perspective on the challenges of implementing and adopting new technology to enhance patient care and efficiency. Together, Sonda and Alan discuss Electronic Medical Records (EMR). Drawing from personal experience, they offer insights into the current challenges and new opportunities for EMR solutions, with a focus on Arize.Listen in as they discuss:Exploring how Arize...2024-07-3136 min

Deep PapersDSPy Assertions: Computational Constraints for Self-Refining Language Model PipelinesChaining language model (LM) calls as composable modules is fueling a new way of programming, but ensuring LMs adhere to important constraints requires heuristic “prompt engineering.” The paper this week introduces LM Assertions, a programming construct for expressing computational constraints that LMs should satisfy. The researchers integrated their constructs into the recent DSPy programming model for LMs and present new strategies that allow DSPy to compile programs with LM Assertions into more reliable and accurate systems. They also propose strategies to use assertions at inference time for automatic self-refinement with LMs. They reported on four diverse case studi...2024-07-2433 min

Deep PapersRAFT: Adapting Language Model to Domain Specific RAGWhere adapting LLMs to specialized domains is essential (e.g., recent news, enterprise private documents), we discuss a paper that asks how we adapt pre-trained LLMs for RAG in specialized domains. SallyAnn DeLucia is joined by Sai Kolasani, researcher at UC Berkeley’s RISE Lab (and Arize AI Intern), to talk about his work on RAFT: Adapting Language Model to Domain Specific RAG. RAFT (Retrieval-Augmented FineTuning) is a training recipe that improves an LLM’s ability to answer questions in a “open-book” in-domain settings. Given a question, and a set of retrieved documents, the model is trained to ignore...2024-06-2844 min

Deep PapersLLM Interpretability and Sparse Autoencoders: Research from OpenAI and AnthropicIt’s been an exciting couple weeks for GenAI! Join us as we discuss the latest research from OpenAI and Anthropic. We’re excited to chat about this significant step forward in understanding how LLMs work and the implications it has for deeper understanding of the neural activity of language models. We take a closer look at some recent research from both OpenAI and Anthropic. These two recent papers both focus on the sparse autoencoder--an unsupervised approach for extracting interpretable features from an LLM. In "Extracting Concepts from GPT-4," OpenAI researchers propose using k-sparse autoencoders to directly control sparsity, simpli...2024-06-1444 min

Deep PapersTrustworthy LLMs: A Survey and Guideline for Evaluating Large Language Models' AlignmentWe break down the paper--Trustworthy LLMs: A Survey and Guideline for Evaluating Large Language Models' Alignment.Ensuring alignment (aka: making models behave in accordance with human intentions) has become a critical task before deploying LLMs in real-world applications. However, a major challenge faced by practitioners is the lack of clear guidance on evaluating whether LLM outputs align with social norms, values, and regulations. To address this issue, this paper presents a comprehensive survey of key dimensions that are crucial to consider when assessing LLM trustworthiness. The survey covers seven major categories of LLM trustworthiness: reliability, safety, fairness...2024-05-3048 min

Deep PapersBreaking Down EvalGen: Who Validates the Validators?Due to the cumbersome nature of human evaluation and limitations of code-based evaluation, Large Language Models (LLMs) are increasingly being used to assist humans in evaluating LLM outputs. Yet LLM-generated evaluators often inherit the problems of the LLMs they evaluate, requiring further human validation.This week’s paper explores EvalGen, a mixed-initative approach to aligning LLM-generated evaluation functions with human preferences. EvalGen assists users in developing both criteria acceptable LLM outputs and developing functions to check these standards, ensuring evaluations reflect the users’ own grading standards.Read it on the blog: https://arize.com/blog/brea...2024-05-1344 min

Generation AIGeneration AI Podcast Episode #8 Amber RobertsThis month, we chat with Amber Roberts, Senior Technical PMM from Arize AI. We met Amber this year at the Data Council conference, where we learned a lot about the concepts around LLM Observability during her great presentation. We wanted to dive a bit deeper on the podcast. Amber chats with us about the foundational pillars of LLM observability including traces and spans and RAG/fine-tuning, the role of drift in LLMs, and how you can evaluate attributes like fairness and rudeness. Learn more about all the great projects and classes that Amber is...2024-05-0336 min

Deep PapersKeys To Understanding ReAct: Synergizing Reasoning and Acting in Language ModelsThis week we explore ReAct, an approach that enhances the reasoning and decision-making capabilities of LLMs by combining step-by-step reasoning with the ability to take actions and gather information from external sources in a unified framework.Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest on LinkedIn and X.2024-04-2645 min

Deep PapersDemystifying Chronos: Learning the Language of Time SeriesThis week, we’ve covering Amazon’s time series model: Chronos. Developing accurate machine-learning-based forecasting models has traditionally required substantial dataset-specific tuning and model customization. Chronos however, is built on a language model architecture and trained with billions of tokenized time series observations, enabling it to provide accurate zero-shot forecasts matching or exceeding purpose-built models.We dive into time series forecasting, some recent research our team has done, and take a community pulse on what people think of Chronos. Learn more about AI observability and evaluation, join the Arize AI Slack community or get the latest...2024-04-0444 min

Deep PapersAnthropic Claude 3This week we dive into the latest buzz in the AI world – the arrival of Claude 3. Claude 3 is the newest family of models in the LLM space, and Opus Claude 3 ( Anthropic's "most intelligent" Claude model ) challenges the likes of GPT-4.The Claude 3 family of models, according to Anthropic "sets new industry benchmarks," and includes "three state-of-the-art models in ascending order of capability: Claude 3 Haiku, Claude 3 Sonnet, and Claude 3 Opus." Each of these models "allows users to select the optimal balance of intelligence, speed, and cost." We explore Anthropic’s recent paper, and walk through Arize’s latest resear...2024-03-2543 min

Deep PapersReinforcement Learning in the Era of LLMsWe’re exploring Reinforcement Learning in the Era of LLMs this week with Claire Longo, Arize’s Head of Customer Success. Recent advancements in Large Language Models (LLMs) have garnered wide attention and led to successful products such as ChatGPT and GPT-4. Their proficiency in adhering to instructions and delivering harmless, helpful, and honest (3H) responses can largely be attributed to the technique of Reinforcement Learning from Human Feedback (RLHF). This week’s paper, aims to link the research in conventional RL to RL techniques used in LLM research and demystify this technique by discussing why, when, and how RL exc...2024-03-1544 min

Deep PapersSora: OpenAI’s Text-to-Video Generation ModelThis week, we discuss the implications of Text-to-Video Generation and speculate as to the possibilities (and limitations) of this incredible technology with some hot takes. Dat Ngo, ML Solutions Engineer at Arize, is joined by community member and AI Engineer Vibhu Sapra to review OpenAI’s technical report on their Text-To-Video Generation Model: Sora.According to OpenAI, “Sora can generate videos up to a minute long while maintaining visual quality and adherence to the user’s prompt.” At the time of this recording, the model had not been widely released yet, but was becoming available to red teamers...2024-03-0145 min

Dr. Howard Smith ReportsSustain and Schwinnng Branded Male Enhancement Capsules Have Hidden Prescription Medications Vidcast: https://www.instagram.com/p/C3TYqiBrJdu/ The FDA and the Today The World company are recalling Sustain, Schwinng, and Arize Male Enhancement Capsules since they contain the prescription drugs tadafil and nortadafil, both phosphodiesterase prescription drugs that are prescribed for erectile dysfunction. The problem is that these hidden drugs can have disastrous cardiovascular interactions with nitrate-containing drugs including nitroglycerine and isosorbide dinitrate, Isordil. Affected are: Sustain batch #s 230551 and 230571; Schwinnng lot #2108; Ariza lot #2107. These products were sold on amazon.com, sustain formula.com, and Sch...2024-02-1301 min

MLOps.communityLLM Evaluation with Arize AI's Aparna Dhinakaran // #210Large Language Models have taken the world by storm. But what are the real use cases? What are the challenges in productionizing them? In this event, you will hear from practitioners about how they are dealing with things such as cost optimization, latency requirements, trust of output, and debugging. You will also get the opportunity to join workshops that will teach you how to set up your use cases and skip over all the headaches. Join the AI in Production Conference on February 15 and 22 here: https://home.mlops.community/home/events/ai-in-production-2024-02-15 ________________________________________________________________________________________ Aparna Dhinakaran is the C...2024-02-0955 min

Deep PapersRAG vs Fine-TuningThis week, we’re discussing "RAG vs Fine-Tuning: Pipelines, Tradeoff, and a Case Study on Agriculture." This paper explores a pipeline for fine-tuning and RAG, and presents the tradeoffs of both for multiple popular LLMs, including Llama2-13B, GPT-3.5, and GPT-4. The authors propose a pipeline that consists of multiple stages, including extracting information from PDFs, generating questions and answers, using them for fine-tuning, and leveraging GPT-4 for evaluating the results. Overall, the results point to how systems built using LLMs can be adapted to respond and incorporate knowledge across a dimension that is critical for a s...2024-02-0839 min

Deep PapersHyDE: Precise Zero-Shot Dense Retrieval without Relevance LabelsWe discuss HyDE: a thrilling zero-shot learning technique that combines GPT-3’s language understanding with contrastive text encoders. HyDE revolutionizes information retrieval and grounding in real-world data by generating hypothetical documents from queries and retrieving similar real-world documents. It outperforms traditional unsupervised retrievers, rivaling fine-tuned retrievers across diverse tasks and languages. This leap in zero-shot learning efficiently retrieves relevant real-world information without task-specific fine-tuning, broadening AI model applicability and effectiveness. Link to transcript and live recording: https://arize.com/blog/hyde-paper-reading-and-discussion/Learn more about AI observability and evaluation, jo...2024-02-0236 min

Deep PapersPhi-2 ModelWe dive into Phi-2 and some of the major differences and use cases for a small language model (SLM) versus an LLM.With only 2.7 billion parameters, Phi-2 surpasses the performance of Mistral and Llama-2 models at 7B and 13B parameters on various aggregated benchmarks. Notably, it achieves better performance compared to 25x larger Llama-2-70B model on multi-step reasoning tasks, i.e., coding and math. Furthermore, Phi-2 matches or outperforms the recently-announced Google Gemini Nano 2, despite being smaller in size. Find the transcript and live recording: https://arize.com/blog/phi-2-modelL...2024-02-0244 min

Deep PapersA Deep Dive Into Generative's Newest Models: Gemini vs Mistral (Mixtral-8x7B)–Part IFor the last paper read of the year, Arize CPO & Co-Founder, Aparna Dhinakaran, is joined by a Dat Ngo (ML Solutions Architect) and Aman Khan (Product Manager) for an exploration of the new kids on the block: Gemini and Mixtral-8x7B. There's a lot to cover, so this week's paper read is Part I in a series about Mixtral and Gemini. In Part I, we provide some background and context for Mixtral 8x7B from Mistral AI, a high-quality sparse mixture of experts model (SMoE) that outperforms Llama 2 70B on most benchmarks with 6x faster inference Mi...2023-12-2747 min

Deep PapersHow to Prompt LLMs for Text-to-SQL: A Study in Zero-shot, Single-domain, and Cross-domain SettingsWe’re thrilled to be joined by Shuaichen Chang, LLM researcher and the author of this week’s paper to discuss his findings. Shuaichen’s research investigates the impact of prompt constructions on the performance of large language models (LLMs) in the text-to-SQL task, particularly focusing on zero-shot, single-domain, and cross-domain settings. Shuaichen and his team explore various strategies for prompt construction, evaluating the influence of database schema, content representation, and prompt length on LLMs’ effectiveness. The findings emphasize the importance of careful consideration in constructing prompts, highlighting the crucial role of table relationships and content, the effectiveness of in-domai...2023-12-1844 min

Deep PapersThe Geometry of Truth: Emergent Linear Structure in LLM Representation of True/False DatasetsFor this paper read, we’re joined by Samuel Marks, Postdoctoral Research Associate at Northeastern University, to discuss his paper, “The Geometry of Truth: Emergent Linear Structure in LLM Representation of True/False Datasets.” Samuel and his team curated high-quality datasets of true/false statements and used them to study in detail the structure of LLM representations of truth. Overall, they present evidence that language models linearly represent the truth or falsehood of factual statements and also introduce a novel technique, mass-mean probing, which generalizes better and is more causally implicated in model outputs than other probing techniques.Fi...2023-11-3041 min

Deep PapersTowards Monosemanticity: Decomposing Language Models With Dictionary LearningIn this paper read, we discuss “Towards Monosemanticity: Decomposing Language Models Into Understandable Components,” a paper from Anthropic that addresses the challenge of understanding the inner workings of neural networks, drawing parallels with the complexity of human brain function. It explores the concept of “features,” (patterns of neuron activations) providing a more interpretable way to dissect neural networks. By decomposing a layer of neurons into thousands of features, this approach uncovers hidden model properties that are not evident when examining individual neurons. These features are demonstrated to be more interpretable and consistent, offering the potential to steer model behavior and impr...2023-11-2044 min

Deep PapersRankVicuna: Zero-Shot Listwise Document Reranking with Open-Source Large Language ModelsWe discuss RankVicuna, the first fully open-source LLM capable of performing high-quality listwise reranking in a zero-shot setting. While researchers have successfully applied LLMs such as ChatGPT to reranking in an information retrieval context, such work has mostly been built on proprietary models hidden behind opaque API endpoints. This approach yields experimental results that are not reproducible and non-deterministic, threatening the veracity of outcomes that build on such shaky foundations. RankVicuna provides access to a fully open-source LLM and associated code infrastructure capable of performing high-quality reranking.Find the transcript and more here: https://arize.com/blog...2023-10-1843 min

Deep PapersExplaining Grokking Through Circuit EfficiencyJoin Arize Co-Founder & CEO Jason Lopatecki, and ML Solutions Engineer, Sally-Ann DeLucia, as they discuss “Explaining Grokking Through Circuit Efficiency." This paper explores novel predictions about grokking, providing significant evidence in favor of its explanation. Most strikingly, the research conducted in this paper demonstrates two novel and surprising behaviors: ungrokking, in which a network regresses from perfect to low test accuracy, and semi-grokking, in which a network shows delayed generalization to partial rather than perfect test accuracy.Find the transcript and more here: https://arize.com/blog/explaining-grokking-through-circuit-efficiency-paper-reading/Learn more about AI observability and evaluation, jo...2023-10-1736 min

Deep PapersLarge Content And Behavior Models To Understand, Simulate, And Optimize Content And BehaviorDeep Papers is a podcast series featuring deep dives on today’s seminal AI papers and research. Each episode profiles the people and techniques behind cutting-edge breakthroughs in machine learning. In this episode, we discuss the paper, “Large Content And Behavior Models To Understand, Simulate, And Optimize Content And Behavior.” This episode is led by SallyAnn Delucia (ML Solutions Engineer, Arize AI), and Amber Roberts (ML Solutions Engineer, Arize AI). The research they discuss highlights that while LLMs have great generalization capabilities, they struggle to effectively predict and optimize communication to get the...2023-09-2942 min

Deep PapersSkeleton of Thought: LLMs Can Do Parallel DecodingDeep Papers is a podcast series featuring deep dives on today’s seminal AI papers and research. Each episode profiles the people and techniques behind cutting-edge breakthroughs in machine learning. In this paper reading, we explore the paper ‘Skeleton-of-Thought’ (SoT) approach, aimed at reducing large language model latency while enhancing answer quality. This episode is led by Aparna Dhinakaran ( Chief Product Officer, Arize AI) and Sally-Ann Delucia (ML Solutions Engineer, Arize AI), with two of the paper authors: Xuefei Ning, Postdoctoral Researcher at Tsinghua University and Zinan Lin, Senior Resear...2023-08-3043 min

Deep PapersLlama 2: Open Foundation and Fine-Tuned Chat ModelsDeep Papers is a podcast series featuring deep dives on today’s seminal AI papers and research. Each episode profiles the people and techniques behind cutting-edge breakthroughs in machine learning. This episode is led by Aparna Dhinakaran ( Chief Product Officer, Arize AI) and Michael Schiff (Chief Technology Officer, Arize AI), as they discuss the paper "Llama 2: Open Foundation and Fine-Tuned Chat Models."In this paper reading, we explore the paper “Developing Llama 2: Pretrained Large Language Models Optimized for Dialogue.” The paper introduces Llama 2, a collection of pretrained and fine-tuned large language models ranging from 7 billion to 70 billion parame...2023-07-3130 min

Deep PapersLost in the Middle: How Language Models Use Long ContextsDeep Papers is a podcast series featuring deep dives on today’s seminal AI papers and research. Each episode profiles the people and techniques behind cutting-edge breakthroughs in machine learning. This episode is led by Sally-Ann DeLucia and Amber Roberts, as they discuss the paper "Lost in the Middle: How Language Models Use Long Contexts." This paper examines how well language models utilize longer input contexts. The study focuses on multi-document question answering and key-value retrieval tasks. The researchers find that performance is highest when relevant information is at the beginning or end of the context. Accessing in...2023-07-2642 min

Deep PapersOrca: Progressive Learning from Complex Explanation Traces of GPT-4Deep Papers is a podcast series featuring deep dives on today’s seminal AI papers and research. Hosted by AI Pub creator Brian Burns and Arize AI founders Jason Lopatecki and Aparna Dhinakaran, each episode profiles the people and techniques behind cutting-edge breakthroughs in machine learning.In this episode, we talk about Orca. Recent research focuses on improving smaller models through imitation learning using outputs from large foundation models (LFMs). Challenges include limited imitation signals, homogeneous training data, and a lack of rigorous evaluation, leading to overestimation of small model capabilities. To...2023-07-2142 min

Super Data Science: ML & AI Podcast with Jon Krohn689: Observing LLMs in Production to Automatically Catch IssuesArize's Amber Roberts and Xander Song join Jon Krohn this week, sharing invaluable insights into ML Observability, drift detection, retraining strategies, and the crucial task of ensuring fairness and ethical considerations in AI development.This episode is brought to you by Posit, the open-source data science company, by AWS Inferentia, and by Anaconda, the world's most popular Python distribution. Interested in sponsoring a SuperDataScience Podcast episode? Visit JonKrohn.com/podcast for sponsorship information.In this episode you will learn:• What is ML Observability [05:07]• What is Drift [08:18]• The different kinds of model drift [15:31]• H...2023-06-201h 18

Deep PapersToolformer: Training LLMs To Use ToolsDeep Papers is a podcast series featuring deep dives on today’s seminal AI papers and research. Hosted by AI Pub creator Brian Burns and Arize AI founders Jason Lopatecki and Aparna Dhinakaran, each episode profiles the people and techniques behind cutting-edge breakthroughs in machine learning. In this episode, we interview Timo Schick and Thomas Scialom, the Research Scientists at Meta AI behind Toolformer. "Vanilla" language models cannot access information about the external world. But what if we gave language models access to calculators, question-answer search, and other APIs to ge...2023-03-2034 min

Deep PapersHungry Hungry Hippos - H3Deep Papers is a podcast series featuring deep dives on today’s seminal AI papers and research. Hosted by AI Pub creator Brian Burns and Arize AI founders Jason Lopatecki and Aparna Dhinakaran, each episode profiles the people and techniques behind cutting-edge breakthroughs in machine learning. In this episode, we interview Dan Fu and Tri Dao, inventors of "Hungry Hungry Hippos" (aka "H3"). This language modeling architecture performs comparably to transformers, while admitting much longer context length: n log(n) rather than n^2 context scaling, for those technically inclined. Listen to le...2023-02-1341 min

Deep PapersChatGPT and InstructGPT: Aligning Language Models to Human IntentionDeep Papers is a podcast series featuring deep dives on today’s seminal AI papers and research. Hosted by AI Pub creator Brian Burns and Arize AI founders Jason Lopatecki and Aparna Dhinakaran, each episode profiles the people and techniques behind cutting-edge breakthroughs in machine learning. In this first episode, we’re joined by Long Ouyang and Ryan Lowe, research scientists at OpenAI and creators of InstructGPT. InstructGPT was one of the first major applications of Reinforcement Learning with Human Feedback to train large language models, and is t...2023-01-1847 min

How AI HappensArize Founding Engineer Tsion BehailuArize and its founding engineer, Tsion Behailu, are leaders in the machine learning observability space. After spending a few years working as a computer scientist at Google, Tsion’s curiosity drew her to the startup world where, since the beginning of the pandemic, she has been building breaking-edge technology. Rather than doing it all manually (as many companies still do to this day), Arize AI technology helps machine learning teams detect issues, understand why they happen, and improve overall model performance. During this episode, Tsion explains why this method is so advantageous, what she loves about working in the ma...2022-11-1025 min

MLOps.communityMonitoring Unstructured Data // Aparna Dhinakaran & Jason Lopatecki // Lightning Sessions #2Lightning Sessions #2 with Aparna Dhinakaran, Co-Founder and Chief Product Officer, and Jason Lopatecki, CEO and Co-Founder of Arize. Lightning Sessions is sponsored by Arize // Abstract Monitoring embeddings on unstructured data is not an easy feat let's be honest. Most of us know what it is but don't understand it one hundred percent. Thanks to Aparna and Jason of Arize for breaking down embedding so clearly. At the end of this Lightning talk, we get to see a demo of how Arize deals with unstructured data and how you can use Arize to combat that.2022-09-2710 min

The LaunchNotes Podcast | Product Management, Product Marketing, Product OpsPress releases are not dead with Tammy Le (Arize)We're joined by Tammy Le, VP of Marketing & Strategy at Arize. Tammy talks about launching a major new version of a product (including a free tier and new PLG motion) at their Machine Learning Observability Summit. ---- 🔗 Links: Arize: https://arize.com 🚀: Check out LaunchNotes: https://launchnotes.com 🙏: Join Launch Awesome: https://www.launchnotes.com/launch-awesome 2022-09-2141 min

Multifamily Women® PodcastBeyond Convenience: Arize’s Role in Enhancing Security and Efficiency in ApartmentsMultifamily Leadership is an Events, News, and Media Platform.It’s where Multifamily Innovation, Technology, Investments, and Leaders converge.As part of our Multifamily Women® Summit, we share time with innovative brands as part of our “Meet the Sponsors” series that showcases those brands that are supporting the platform to advance not only technology and innovation, but the women who are making all of that possible.If you haven’t registered for the Multifamily Women® Summit yet, now is the time. Go here to grab your ticket.Arize is a smart apartment...2022-08-2709 min

The Multifamily Innovation® PodcastA Multifamily Smart Home Advisor can Support you Today and into the FutureKyle Finney is a Business Development Manager at Arize. He works with real estate professionals to attract residents. Kyle is responsible for spearheading high-return marketing and sales campaigns.Arize is a smart home technology company that focuses on sensors and alarms. Their systems find minor issues and notify maintenance before they develop into serious problems (like water leaks). They also help with security, utility optimization, and enable self-guided tours after hours. In this episode, we covered:Why partnering can lead to more benefits than just buying a productHow smart home tech as an industry i...2022-06-0920 min