podcast
details
.com
Print
Share
Look for any podcast host, guest or anyone
Search
Showing episodes and shows of
Matei Zaharia
Shows
Daily Paper Cast
LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!
🤗 Upvotes: 20 | cs.AI Authors: Dacheng Li, Shiyi Cao, Tyler Griggs, Shu Liu, Xiangxi Mo, Shishir G. Patil, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica Title: LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters! Arxiv: http://arxiv.org/abs/2502.07374v1 Abstract: Large reasoning models (LRMs) tackle complex reasoning problems by following long chain-of-thoughts (Long CoT) that incorporate reflection, backtracking, and self-validation. However, the training techniques and data requirements to elicit Long CoT remain poorly understood. In this work, we find that a Lar...
2025-02-13
25 min
Daily Paper Cast
Drowning in Documents: Consequences of Scaling Reranker Inference
🤗 Paper Upvotes: 10 | cs.IR, cs.CL, cs.LG Authors: Mathew Jacob, Erik Lindgren, Matei Zaharia, Michael Carbin, Omar Khattab, Andrew Drozdov Title: Drowning in Documents: Consequences of Scaling Reranker Inference Arxiv: http://arxiv.org/abs/2411.11767v1 Abstract: Rerankers, typically cross-encoders, are often used to re-score the documents retrieved by cheaper initial IR systems. This is because, though expensive, rerankers are assumed to be more effective. We challenge this assumption by measuring reranker performance for full retrieval, not just re-scoring first-stage retrieval. Our experiments reveal a surprising trend: the...
2024-11-20
21 min
DataGen
#167 - Masterclass | Tout comprendre sur Databricks avec Frederic Forest
Frédéric Forest est Directeur Général et cofondateur de Datatorii, le cabinet de conseil spécialisé sur Databricks et Microsoft.🎬 CHAPITRES00:00 Générique00:34 Intro02:22 Dans quel(s) contexte(s) les entreprises mettent en place Databricks ?04:53 Quel type de stack peut-on mettre en place avec Databricks ?09:13 Les inconvénients et les avantages de Databricks15:56 Les conseils de Frédéric pour mettre en place Databricks17:13 Une tendance clé du marché : la coopétition entre Databricks et Microsoft20:54 Les quest...
2024-11-18
25 min
Papers Read on AI
Text2SQL is Not Enough: Unifying AI and Databases with TAG
AI systems that serve natural language questions over databases promise to unlock tremendous value. Such systems would allow users to leverage the powerful reasoning and knowledge capabilities of language models (LMs) alongside the scalable computational power of data management systems. These combined capabilities would empower users to ask arbitrary natural language questions over custom data sources. However, existing methods and benchmarks insufficiently explore this setting. Text2SQL methods focus solely on natural language questions that can be expressed in relational algebra, representing a small subset of the questions real users wish to ask. Likewise, Retrieval-Augmented Generation (RAG) considers the limited...
2024-09-09
42 min
Papers Read on AI
Ring Attention with Blockwise Transformers for Near-Infinite Context
Transformers have emerged as the architecture of choice for many state-of-the-art AI models, showcasing exceptional performance across a wide range of AI applications. However, the memory demands imposed by Transformers limit their ability to handle long sequences, thereby posing challenges in utilizing videos, actions, and other long-form sequences and modalities in complex environments. We present a novel approach, Ring Attention with Blockwise Transformers (Ring Attention), which leverages blockwise computation of self-attention and feedforward to distribute long sequences across multiple devices while fully overlapping the communication of key-value blocks with the computation of blockwise attention. Our approach enables training and inference...
2024-02-26
26 min
Papers Read on AI
World Model on Million-Length Video And Language With RingAttention
Current language models fall short in understanding aspects of the world not easily described in words, and struggle with complex, long-form tasks. Video sequences offer valuable temporal information absent in language and static images, making them attractive for joint modeling with language. Such models could develop a understanding of both human textual knowledge and the physical world, enabling broader AI capabilities for assisting humans. However, learning from millions of tokens of video and language sequences poses challenges due to memory constraints, computational complexity, and limited datasets. To address these challenges, we curate a large dataset of diverse videos and books...
2024-02-17
30 min
Papers Read on AI
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
The ML community is rapidly exploring techniques for prompting language models (LMs) and for stacking them into pipelines that solve complex tasks. Unfortunately, existing LM pipelines are typically implemented using hard-coded"prompt templates", i.e. lengthy strings discovered via trial and error. Toward a more systematic approach for developing and optimizing LM pipelines, we introduce DSPy, a programming model that abstracts LM pipelines as text transformation graphs, i.e. imperative computational graphs where LMs are invoked through declarative modules. DSPy modules are parameterized, meaning they can learn (by creating and collecting demonstrations) how to apply compositions of prompting, finetuning, augmentation...
2024-01-31
47 min
Vocea harului
Matei 26:31 Zaharia 13:6,7 Geneza 22:1,2 6-11 Ioan 16:32 19:25-27
Duminica seara, 21.01.2024
2024-01-22
22 min
MLOps.community
FrugalGPT: Better Quality and Lower Cost for LLM Applications // Lingjiao Chen // MLOps Podcast #172
MLOps Coffee Sessions #172 with Lingjiao Chen, FrugalGPT: Better Quality and Lower Cost for LLM Applications. This episode is sponsored by QuantumBlack. We are now accepting talk proposals for our next LLM in Production virtual conference on October 3rd. Apply to speak here: https://go.mlops.community/NSAX1O // Abstract There is a rapidly growing number of large language models (LLMs) that users can query for a fee. We review the cost associated with querying popular LLM APIs, e.g. GPT-4, ChatGPT, J1-Jumbo, and find that these models have heterogeneous pricing structures, with fees that can differ...
2023-08-22
1h 02
The Most Awesome Founder Podcast
EP 74 – Inspiration session #9: Golf, Gender and Generative AI
We are thrilled to announce the launch of the 9th edition of the inspiration session with Gerrit and Dries! 🎉 They will once more discuss diverse and thought-provoking topics, combining insights from an academic and a practitioner perspective. What are the legal implications of AI-generated inventions, which entrepreneurs should consider entering a venture studio and why women should be careful if their boss plays golf? These and other questions will be answered, so be prepared to learn, think, and laugh. Don't miss this captivating episode! 🔍 Discussed sources: Dahlander, Linus, et al. "Bl...
2023-08-02
1h 23
B2BaCEO (with Ashu Garg)
How to Deploy AI in Your Company (Matei Zaharia, CTO & Co-Founder of Databricks)
If you’re not hardcore about AI, then stop the podcast now — because this episode for real technology nerds. Ashu’s guest is Matei Zaharia, CTO and cofounder of Databricks, and a professor of computer science at Stanford University. Ashu and Matei cover a lot of ground — most of it technical, all of it very relevant for anyone who’s serious about building with artificial intelligence. They start with a discussion of Databricks’ early days: how the startup established a foothold in a market dominated by entrenched incumbents and some of the challenges Matei and his cofounders faced. The rest of the conve...
2023-06-02
39 min
Data Radicals
The Bazaar in the Cathedral with Matei Zaharia, CTO & Co-founder at Databricks and Creator of Apache Spark
When building a data platform, it’s important to stay true to your vision. Whether that's through creating a definitive user experience or an open platform that allows people to build upon it, you’re constructing a cathedral. This cathedral is sophisticated and dependable, and allows for a bazaar of business intelligence, machine learning, and AI use cases.In this episode, Satyen interviews Matei Zaharia, Chief Technologist and Co-founder of Databricks. Matei is an open source trailblazer and the creator of Apache Spark, a widely used framework for distributed data processing. He is also an Associate Professor of C...
2023-05-24
49 min
MLOps.community
The Birth and Growth of Spark: An Open Source Success Story // Matei Zaharia // MLOps Podcast #155
MLOps Coffee Sessions #155 with Matei Zaharia, The Birth and Growth of Spark: An Open Source Success Story, co-hosted by Vishnu Rachakonda. // Abstract We dive deep into the creation of Spark, with the creator himself - Matei Zaharia Chief technologist at Databricks. This episode also explores the development of Databricks' other open source home run ML Flow and the concept of "lake house ML". As a special treat Matei talked to us about the details of the "DSP" (Demonstrate Search Predict) project, which aims to enable building applications by combining LLMs and other text-returning systems. // About the guest: Matei has...
2023-04-25
58 min
Biserica Baptistă Providența Timișoara
Ce înseamnă pentru tine intrarea lui Isus în Ierusalim?
2023 - Samy Tuțac - Ce înseamnă pentru tine intrarea lui Isus în Ierusalim? |Venirea Regelui| - (Matei 21:1-9, Zaharia 9:9)
2023-04-12
56 min
No Priors: Artificial Intelligence | Technology | Startups
The Future is Small Models, with Matei Zaharia, CTO of Databricks
If you have 30 dollars, a few hours, and one server, then you are ready to create a ChatGPT-like model that can do what’s known as instruction-following. Databricks’ latest launch, Dolly, foreshadows a potential move in the industry toward smaller and more accessible but extremely capable AIs. Plus, Dolly is open source, requires less computing power, and fewer data parameters than its counterparts.Matei Zaharia, Cofounder & Chief Technologist at Databricks, joins Sarah and Elad to talk about how big data sets actually need to be, why manual annotation is becoming less necessary to train some models, and how...
2023-04-06
39 min
Unsupervised Learning
Ep 2: Databricks CTO Matei Zaharia on scaling and orchestrating large language models
Patrick and Jacob sit down with Matei Zaharia, Co-Founder and CTO at Databricks and Professor at Stanford. They discuss how companies are training and serving models in production with Databricks, where LLMs fall short for search and how to improve them, the state of the art AI research at Stanford, and how the size and cost of models is likely to change with technological advances in the coming years. (0:00) - Introduction(2:04) - Founding story of Databricks(6:03) - PhD classmates using early version of spark for Netflix competition(6:55) - Building a...
2023-03-07
46 min
ACM ByteCast
Matei Zaharia - Episode 32
In this episode of ACM ByteCast, Bruke Kifle hosts Matei Zaharia, computer scientist, educator, and creator of Apache Spark. Matei is the Chief Technologist and Co-Founder of Databricks and an Assistant Professor of Computer Science at Stanford. He started the Apache Spark project during his PhD at UC Berkeley in 2009 and has worked broadly on other widely used data and machine learning software, including MLflow, Delta Lake, and Apache Mesos. Matei's research was recognized through the 2014 ACM Doctoral Dissertation Award, an NSF Career Award, and the US Presidential Early Career Award for Scientists and Engineers. Matei, who...
2022-12-13
54 min
FLORIN V. FĂT
13. NU VĂ TEMEȚI
Bună, sunt Florin V. Făt și îți spun un bun venit la Podcastul: O VIAȚĂ EXCELENTĂ!Apreciez foarte mult faptul că ți-ai luat câteva minute pentru a asculta acest episod, practic pentru a te dezvolta din punct de vedere personal și spiritual. Pentru că mai sunt câteva zile până la sărbătoarea creștinească a Nașterii Fiului lui Dumnezeu, m-am gândit ca următoarele episoade, începând cu cel de astăzi să fie legat de atmosfera crăciunului. Așa că episodul de astăzi, precum și următoarele până la sfârșitul ace...
2022-12-10
14 min
Neural Search Talks — Zeta Alpha
ColBERT + ColBERTv2: late interaction at a reasonable inference cost
Andrew Yates (Assistant Professor at the University of Amsterdam) and Sergi Castella (Analyst at Zeta Alpha) discus the two influential papers introducing ColBERT (from 2020) and ColBERT v2 (from 2022), which mainly propose a fast late interaction operation to achieve a performance close to full cross-encoders but at a more manageable computational cost at inference; along with many other optimizations. 📄 ColBERT: "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT" by Omar Khattab and Matei Zaharia. https://arxiv.org/abs/2004.12832 📄 ColBERTv2: "ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction" by Keshav...
2022-08-16
57 min
Data Science Deep Dive
#5: Data Warehouse vs. Data Lake vs. Data Mesh
Es gibt viele spannende Technologien um Daten zu halten und zu bewegen. Wenn man noch keine Data Plattform oder Data Warehouse hat, welchen Ansatz sollte man dann verfolgen? Wir sprechen über: SQL-Datenbanken BI Cubes Data Warehouses Data Lakes Data Mesh Links: Amazon Web Services - Was ist ein Data Lake? Data Mesh and Lakehouse - Matei Zaharia, Databricks
2022-07-20
1h 00
Data Science Deep Dive
#5: Data Warehouse vs. Data Lake vs. Data Mesh
Es gibt viele spannende Technologien um Daten zu halten und zu bewegen. Wenn man noch keine Data Plattform oder Data Warehouse hat, welchen Ansatz sollte man dann verfolgen? Wir sprechen über: SQL-Datenbanken BI Cubes Data Warehouses Data Lakes Data Mesh Links: Amazon Web Services - Was ist ein Data Lake? Data Mesh and Lakehouse - Matei Zaharia, Databricks
2022-07-20
1h 00
cercetare
Când profesorii români vor fi și ei miliardari?
Ion Stoica și Matei Zaharia sunt cei mai noi miliardari români. Practic, niște iluștri necunoscuți în țară, cei doi au fost incluși de revista Forbes pe lista miliardarilor. Povestea celor doi români, proaspăt miliardari, a fost prezentată de... citiţi mai departe
2022-05-31
00 min
Stanford MLSys Seminar
#62 Dan Fu - Improving Transfer and Robustness of Supervised Contrastive Learning
Dan Fu - An ideal learned representation should display transferability and robustness. Supervised contrastive learning is a promising method for training accurate models, but produces representations that do not capture these properties due to class collapse -- when all points in a class map to the same representation. In this talk, we discuss how to alleviate these problems to improve the geometry of supervised contrastive learning. We identify two key principles: balancing the right amount of geometric "spread" in the embedding space, and inducing an inductive bias towards subclass clustering. We introduce two mechanisms for achieving these aims in...
2022-04-27
56 min
CS224U
Omar Khattab on neural information retrieval
Pronouncing "ColBERT", the origins of ColBERT, doing NLP from an IR perspective, how getting "scooped" can be productive, OpenQA and related tasks, PhD journeys, why even retrieval plus attention is not all you need, multilingual knowledge-intensive NLP, and aiming high in research projects. Transcript: https://web.stanford.edu/class/cs224u/podcast/khattab/ Omar's website Matei Zaharia Keshav Santhanam Steven Colbert thowing paper with Obama The ColBERT paper and the ColBERTv2 paper DeepImpact: Learning passage impacts for inverted indexes DPR: Dense passage retrieval for open-domain question answering Incorporating query term independence assumption for efficient retrieval and...
2022-04-26
1h 25
Stanford MLSys Seminar
#61 Kexin Rong - Big Data Analytics
Kexin Rong - Learned Indexing and Sampling for Improving Query Performance in Big-Data AnalyticsTraditional data analytics systems improve query efficiency via fine-grained, row-level indexing and sampling techniques. However, to keep up with the data volumes, increasingly many systems store and process datasets in large partitions containing hundreds of thousands of rows. Therefore, these analytics systems must adapt traditional techniques to work with coarse-grained data partitions as a basic unit to process queries efficiently. In this talk, I will discuss two related ideas that combine learning techniques with partitioning designs to improve the query efficiency in the...
2022-04-22
59 min
Stanford MLSys Seminar
#60 Igor Markov - Looper: An End-to-End ML Platform for Product Decisions
Igor Markov - Looper: an end-to-end ML platform for product decisionsEpisode 60 of the Stanford MLSys Seminar Series! Looper: an end-to-end ML platform for product decisions Speaker: Igor Markov Abstract: Modern software systems and products increasingly rely on machine learning models to make data-driven decisions based on interactions with users, infrastructure and other systems. For broader adoption, this practice must (i) accommodate product engineers without ML backgrounds, (ii) support fine-grain product-metric evaluation and (iii) optimize for product goals. To address shortcomings of prior platforms, we introduce general principles for and the architecture of an ML platform, Looper, wi...
2022-04-11
1h 00
Stanford MLSys Seminar
#59 Zhuohan Li - Alpa: Automated Model-Parallel Deep Learning
Zhuohan Li - Alpa: Automated Model-Parallel Deep LearningAlpa (https://github.com/alpa-projects/alpa) automates model-parallel training of large deep learning models by generating execution plans that unify data, operator, and pipeline parallelism. Alpa distributes the training of large deep learning models by viewing parallelisms as two hierarchical levels: inter-operator and intra-operator parallelisms. Based on it, Alpa constructs a new hierarchical space for massive model-parallel execution plans. Alpa designs a number of compilation passes to automatically derive the optimal parallel execution plan in each independent parallelism level and implements an efficient runtime to orchestrate the two-level parallel...
2022-04-05
55 min
Stanford MLSys Seminar
3/10/22 #58 Shruti Bhosale - Multilingual Machine Translation
Shruti Bhosale - Scaling Multilingual Machine Translation to Thousands of Language DirectionsExisting work in translation has demonstrated the potential of massively multilingual machine translation by training a single model able to translate between any pair of languages. However, much of this work is English-Centric by training only on data which was translated from or to English. While this is supported by large sources of training data, it does not reflect translation needs worldwide. In this talk, I will describe how we create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100...
2022-03-18
57 min
Stanford MLSys Seminar
3/3/22 #57 Vijay Janapa Reddi - TinyML, Harvard Style
Vijay Janapa Reddi - Tiny Machine LearningTiny machine learning (TinyML) is a fast-growing field at the intersection of ML algorithms and low-cost embedded systems. TinyML enables on-device analysis of sensor data (vision, audio, IMU, etc.) at ultra-low-power consumption (less than 1mW). Processing data close to the sensor allows for an expansive new variety of always-on ML use-cases that preserve bandwidth, latency, and energy while improving responsiveness and maintaining privacy. This talk introduces the vision behind TinyML and showcases some of the interesting applications that TinyML is enabling in the field, from wildlife conservation to supporting public...
2022-03-04
57 min
Stanford MLSys Seminar
2/24/22 #56 Fait Poms - Interactive Model Development
Fait Poms - A vision for interactive model development: efficient machine learning by bringing domain experts in the loopBuilding computer vision models today is an exercise in patience--days to weeks for human annotators to label data, hours to days to train and evaluate models, weeks to months of iteration to reach a production model. Without tolerance for this timeline or access to the massive compute and human resources required, building an accurate model can be challenging if not impossible. In this talk, we discuss a vision for interactive model development with iteration cycles of minutes, not...
2022-02-28
55 min
Stanford MLSys Seminar
1/28/21 #10 Travis Addair - Deep Learning at Scale with Horovod
Travis Addair - Horovod and the Evolution of Deep Learning at ScaleDeep neural networks are pushing the state of the art in numerous machine learning research domains; from computer vision, to natural language processing, and even tabular business data. However, scaling such models to train efficiently on large datasets imposes a unique set of challenges that traditional batch data processing systems were not designed to solve. Horovod is an open source framework that scales models written in TensorFlow, PyTorch, and MXNet to train seamlessly on hundreds of GPUs in parallel. In this talk, we'll explain the con...
2022-02-23
59 min
Stanford MLSys Seminar
2/17/22 #55 Doris Lee - Visualization for Data Science
Doris Lee - Always-on Dataframe Visualizations with LuxVisualizations help data scientists discover trends, patterns, identify outliers, and derive insights from their data. However, existing visualization libraries in Python require users to write a substantial amount of code for plotting even a single visualization, often hindering the flow of data exploration. In this talk, you will learn about Lux, a lightweight visualization tool on top of pandas dataframes. Lux recommends visualizations for free to users as they explore their data within a Jupyter notebook without the need to write additional code. Lux is used by data scientists...
2022-02-19
58 min
Stanford MLSys Seminar
1/21/21 #9 Song Han - Reducing AI's Carbon Footprint
Song Han - TinyML: Reducing the Carbon Footprint of Artificial Intelligence in the Internet of Things (IoT)Deep learning is computation-hungry and data-hungry. We aim to improve the computation efficiency and data efficiency of deep learning. I will first talk about MCUNet[1] that brings deep learning to IoT devices. The technique is tiny neural architecture search (TinyNAS) co-designed with a tiny inference engine (TinyEngine), enabling ImageNet-scale inference on an IoT device with only 1MB of FLASH. Next I will talk about TinyTL[2] that enables on-device training, reducing the memory footprint by 7-13x. Finally, I will describe D...
2022-02-15
56 min
Stanford MLSys Seminar
2/10/22 #54 Ellie Pavlick - Do Deep Models Learn Symbolic Reasoning?
Ellie Pavlick - Implementing Symbols and Rules with Neural NetworksMany aspects of human language and reasoning are well explained in terms of symbols and rules. However, state-of-the-art computational models are based on large neural networks which lack explicit symbolic representations of the type frequently used in cognitive theories. One response has been the development of neuro-symbolic models which introduce explicit representations of symbols into neural network architectures or loss functions. In terms of Marr's levels of analysis, such approaches achieve symbolic reasoning at the computational level ("what the system does and why") by introducing symbols and...
2022-02-12
1h 01
Stanford MLSys Seminar
12/10/20 #8 Kayvon Fatahalian - Video Analysis in Hours, Not Weeks
Kayvon Fatahalian - From Ideas to Video Analysis Models in Hours, Not WeeksMy students and I often find ourselves as "subject matter experts" needing to create video understanding models that serve computer graphics and video analysis applications. Unfortunately, like many, we are frustrated by how a smart grad student, armed with a large *unlabeled* video collection, a palette of pre-trained models, and an idea of what novel object or activity they want to detect/segment/classify, requires days-to-weeks to create and validate a model for their task. In this talk I will discuss challenges we've faced in...
2022-02-07
1h 03
Stanford MLSys Seminar
2/3/22 #53 Cody Coleman - Data Selection for Data-Centric AI
Cody Coleman - Data selection for Data-Centric AI: Data Quality Over QuantityData selection methods, such as active learning and core-set selection, improve the data efficiency of machine learning by identifying the most informative data points to label or train on. Across the data selection literature, there are many ways to identify these training examples. However, classical data selection methods are prohibitively expensive to apply in deep learning because of the larger datasets and models. This talk will describe two techniques to make data selection methods more tractable. First, "selection via proxy" (SVP) avoids expensive training and...
2022-02-05
55 min
Stanford MLSys Seminar
1/27/22 #52 Bilge Acun - Sustainability for AI
Bilge Acun - Designing Sustainable Datacenters with and for AIMachine learning has witnessed exponential growth over the recent years. In this talk, we will first explore the environmental implications of the super-linear growth trend of AI from a holistic perspective, spanning data, algorithms, and system hardware. System efficiency optimizations can significantly help reducing the carbon footprint of AI systems. However, predictions show that the efficiency improvements will not be enough to reduce the overall resource needs of AI as Jevon's Paradox suggests "efficiency increases consumption". Therefore, we need to design our datacenters with sustainability in mind...
2022-01-31
58 min
Stanford MLSys Seminar
12/3/20 #7 Matthias Poloczek - Bayesian Optimization
Matthias Poloczek - Scalable Bayesian Optimization for Industrial ApplicationsBayesian optimization has become a powerful method for the sample-efficient optimization of expensive black-box functions. These functions do not have a closed-form and are evaluated for example by running a complex economic simulation, by an experiment in the lab or in a market, or by a CFD simulation. Use cases arise in machine learning, e.g., when tuning the configuration of an ML model or when optimizing a reinforcement learning policy. Examples in engineering include the design of aerodynamic structures or materials discovery.In this talk I...
2022-01-24
59 min
Stanford MLSys Seminar
01/20/22 #51 Fred Sala - Weak Supervision for Diverse Datatypes
Fred Sala - Efficiently Constructing Datasets for Diverse DatatypesBuilding large datasets for data-hungry models is a key challenge in modern machine learning. Weak supervision frameworks have become a popular way to bypass this bottleneck. These approaches synthesize multiple noisy but cheaply-acquired estimates of labels into a set of high-quality pseudolabels for downstream training. In this talk, I introduce a technique that fuses weak supervision with structured prediction, enabling WS techniques to be applied to extremely diverse types of data. This approach allows for labels that can be continuous, manifold-valued (including, for example, points in hyperbolic space...
2022-01-21
53 min
Stanford MLSys Seminar
11/19/20 #6 Roy Frostig - The Story Behind JAX
Roy Frostig - JAX: accelerating machine learning research by composing function transformations in PythonJAX is a system for high-performance machine learning research and numerical computing. It offers the familiarity of Python+NumPy together with hardware acceleration, plus a set of composable function transformations: automatic differentiation, automatic batching, end-to-end compilation (via XLA), parallelizing over multiple accelerators, and more. JAX's core strength is its guarantee that these user-wielded transformations can be composed arbitrarily, so that programmers can write math (e.g. a loss function) and transform it into pieces of an ML program (e.g. a vectorized, compiled...
2022-01-17
1h 06
Stanford MLSys Seminar
01/13/22 #50 Deepak Narayanan - Resource-Efficient Deep Learning Execution
Deepak Narayanan - Resource-Efficient Deep Learning ExecutionDeep Learning models have enabled state-of-the-art results across a broad range of applications; however, training these models is extremely time- and resource-intensive, taking weeks on clusters with thousands of expensive accelerators in the extreme case. In this talk, I will describe two ideas that help improve the resource efficiency of model training.In the first half of the talk, I will discuss how pipelining can be used to accelerate distributed training. Pipeline parallelism facilitates model training with lower communication overhead than previous methods while still ensuring high compute...
2022-01-14
57 min
Stanford MLSys Seminar
11/12/20 #5 Chip Huyen - Principles of Good Machine Learning Systems Design
Chip Huyen - Principles of Good Machine Learning Systems DesignThis talk covers what it means to operationalize ML models. It starts by analyzing the difference between ML in research vs. in production, ML systems vs. traditional software, as well as myths about ML production.It then goes over the principles of good ML systems design and introduces an iterative framework for ML systems design, from scoping the project, data management, model development, deployment, maintenance, to business analysis. It covers the differences between DataOps, ML Engineering, MLOps, and data science, and where each fits into...
2022-01-10
1h 06
Stanford MLSys Seminar
11/5/20 #4 Alex Ratner - Programmatically Building & Managing Training Data with Snorkel
Alex Ratner - Programmatically Building & Managing Training Data with SnorkelOne of the key bottlenecks in building machine learning systems is creating and managing the massive training datasets that today's models require. In this talk, I will describe our work on Snorkel (snorkel.org), an open-source framework for building and managing training datasets, and describe three key operators for letting users build and manipulate training datasets: labeling functions, for labeling unlabeled data; transformation functions, for expressing data augmentation strategies; and slicing functions, for partitioning and structuring training datasets. These operators allow domain expert users to specify machine l...
2022-01-08
1h 13
Stanford MLSys Seminar
11/5/20 #3 Virginia Smith - On Heterogeneity in Federated Settings
Virginia Smith - On Heterogeneity in Federated SettingsA defining characteristic of federated learning is the presence of heterogeneity, i.e., that data and compute may differ significantly across the network. In this talk I show that the challenge of heterogeneity pervades the machine learning process in federated settings, affecting issues such as optimization, modeling, and fairness. In terms of optimization, I discuss FedProx, a distributed optimization method that offers robustness to systems and statistical heterogeneity. I then explore the role that heterogeneity plays in delivering models that are accurate and fair to all users/devices in...
2022-01-08
1h 00
Stanford MLSys Seminar
10/22/20 #2 Matei Zaharia - Machine Learning at Industrial Scale: Lessons from the MLflow Project
Matei Zaharia - Machine Learning at Industrial Scale: Lessons from the MLflow ProjectAlthough enterprise adoption of machine learning is still early on, many enterprises in all industries already have hundreds of internal ML applications. ML powers business processes with an impact of hundreds of millions of dollars in industrial IoT, finance, healthcare and retail. Building and operating these applications reliably requires infrastructure that is different from traditional software development, which has led to significant investment in the construction of “ML platforms” specifically designed to run ML applications. In this talk, I’ll discuss some of the common...
2022-01-08
59 min
Stanford MLSys Seminar
10/15/20 #1 Marco Tulio Ribeiro - Beyond Accuracy: Behavioral Testing of NLP Models with CheckList
Marco Tulio Ribeiro on "Beyond Accuracy: Behavioral Testing of NLP Models with CheckList"We will present CheckList, a task-agnostic methodology and tool for testing NLP models inspired by principles of behavioral testing in software engineering. We will show a lot of fun bugs we discovered with CheckList, both in commercial models (Microsoft, Amazon, Google) and research models (BERT, RoBERTA for sentiment analysis, QQP, SQuAD). We'll also present comparisons between CheckList and the status quo, in a case study at Microsoft and a user study with researchers and engineers. We show that CheckList is a really helpful process...
2022-01-08
1h 00
Stanford MLSys Seminar
01/06/22 #49 Beidi Chen - Pixelated Butterfly: Fast Machine Learning with Sparsity
Beidi Chen talks about "Pixelated Butterfly: Simple and Efficient Sparse Training for Neural Network Models." Overparameterized neural networks generalize well but are expensive to train. Ideally, one would like to reduce their computational cost while retaining their generalization benefits. Sparse model training is a simple and promising approach to achieve this, but there remain challenges as existing methods struggle with accuracy loss, slow training runtime, or difficulty in sparsifying all model components. The core problem is that searching for a sparsity mask over a discrete set of sparse matrices is difficult and expensive. To address this, our main insight...
2022-01-08
53 min
Zaharia Podcast
#28 - Mihnea Todosie: Efectul Matei, Statistică, Crypto
Mihnea Todosie este quantitative analyst în cadrul industriei de gambling. Am discutat despre modul în care statistica poate fi manipulată pentru a servi un anumit interes, despre Efectul Matei (The Matthew Effect) și direcția spre care se îndreaptă industria criptomonedelor.
2022-01-04
1h 31
MLOps.community
Data Selection for Data-Centric AI: Data Quality Over Quantity // Cody Coleman // Coffee Sessions #59
Coffee Sessions #59 with Cody Coleman, Data Quality Over Quantity or Data Selection for Data-Centric AI. // Abstract Big data has been critical to many of the successes in ML, but it brings its own problems. Working with massive datasets is cumbersome and expensive, especially with unstructured data like images, videos, and speech. Careful data selection can mitigate the pains of big data by focusing computational and labeling resources on the most valuable examples. Cody Coleman, a recent Ph.D. from Stanford University and founding member of MLCommons, joins us to describe how a more data-centric ap...
2021-10-11
1h 11
Vești bune
Ne încredem în Domnul și nu ne clătinăm
Vești bune - 71 - Ne încredem în Domnul și nu ne clătinămPsalmul 125:1 Cei ce se încred în Domnul sunt ca muntele Sionului, care nu se clatină, ci stă întărit pe vecie.2 Cum este înconjurat Ierusalimul de munți, așa înconjoară Domnul pe poporul Său, de acum și până în veac.3 Căci toiagul de cârmuire al răutății nu va rămâne pe moștenirea celor neprihăniți, pentru ca cei neprihăniți să nu întindă mâinile spre nelegiuire.4 Doamne, varsă...
2021-07-19
06 min
Data Brew by Databricks
Data Brew Season 2 Episode 1: ML in Production
For our second season, we will be focusing on machine learning, from research to production. We will interview folks in academia and industry to discuss topics such as data ethics, production-grade infrastructure for ML, hyperparameter tuning, AutoML, and many more.In the season opener, Matei Zaharia discusses how he entered the field of ML, best practices for productionizing ML pipelines, leveraging MLflow & the Lakehouse architecture for reproducible ML, and his current research in this field.See more at databricks.com/data-brew
2021-04-22
30 min
Cloud Engineering Archives - Software Engineering Daily
Multicloud with Ben Hindman
Most applications today are either deployed to on-premise environments or deployed to a single cloud provider. Developers who are deploying on-prem struggle to set up complicated open source tools like Kafka and Hadoop. Developers who are deploying to a cloud provider tend to stay within that specific cloud provider, because moving between different clouds and integrating services across clouds adds complexity. Ben Hindman started the Apache Mesos project when he was working in the Berkeley AMPLab. Mesos is a scheduler for resources in a distributed system, allowing compute and storage to be scheduled...
2019-01-08
1h 06
Greatest Hits Archives - Software Engineering Daily
Multicloud with Ben Hindman
Most applications today are either deployed to on-premise environments or deployed to a single cloud provider. Developers who are deploying on-prem struggle to set up complicated open source tools like Kafka and Hadoop. Developers who are deploying to a cloud provider tend to stay within that specific cloud provider, because moving between different clouds and integrating services across clouds adds complexity. Ben Hindman started the Apache Mesos project when he was working in the Berkeley AMPLab. Mesos is a scheduler for resources in a distributed system, allowing compute and storage to be scheduled...
2019-01-08
1h 06
Software Engineering Daily
Multicloud with Ben Hindman
Most applications today are either deployed to on-premise environments or deployed to a single cloud provider.Developers who are deploying on-prem struggle to set up complicated open source tools like Kafka and Hadoop. Developers who are deploying to a cloud provider tend to stay within that specific cloud provider, because moving between different clouds and integrating services across clouds adds complexity.Ben Hindman started the Apache Mesos project when he was working in the Berkeley AMPLab. Mesos is a scheduler for resources in a distributed system, allowing compute and storage to be scheduled onto jobs...
2019-01-08
1h 08
The InfoQ Podcast
Building a Data Science Capability with Stephanie Yee, Matei Zaharia, Sid Anand and Soups Ranjan
In this podcast, recorded live at QCon.ai, Principal Technical Advisor & QCon Chair Wes Reisz and InfoQ Editor-in-chief Charles Humble chair a panel discussion with Stephanie Yee, data scientist at StitchFix, Matei Zaharia, professor of computer science at Stanford and chief scientist at Data Bricks, Sid Anand, chief data engineer at PayPal, and Soups Ranjan, director of data science at CoinBase. Why listen to this podcast: - Before you start putting a data science team together make sure you have a business goal or question that you want to answer; If you have a specific question, like increasing lift on a...
2018-04-27
43 min
Greatest Hits Archives - Software Engineering Daily
Spark and Streaming with Matei Zaharia
Apache Spark is a system for processing large data sets in parallel. The core abstraction of Spark is the resilient distributed dataset (RDD), a working set of data that sits in memory for fast, iterative processing. Matei Zaharia created Spark with two goals: to provide a composable, high-level set of APIs for performing distributed processing; and to provide a unified engine for running complete apps. High-level APIs like SparkSQL and MLlib enables developers to build ambitious applications quickly. A developer using SparkSQL can work interactively with a huge dataset, which is a significant...
2018-02-26
53 min
Software Engineering Daily
Spark and Streaming with Matei Zaharia
Apache Spark is a system for processing large data sets in parallel. The core abstraction of Spark is the resilient distributed dataset (RDD), a working set of data that sits in memory for fast, iterative processing.Matei Zaharia created Spark with two goals: to provide a composable, high-level set of APIs for performing distributed processing; and to provide a unified engine for running complete apps.High-level APIs like SparkSQL and MLlib enable developers to build ambitious applications quickly. A developer using SparkSQL can work interactively with a huge dataset, which is a significant improvement on...
2018-02-26
56 min
Software Engineering Daily
Streaming Architecture with Tugdual Grall
Tugdual Grall is an engineer with MapR. In today’s episode, we explore use cases and architectural patterns for streaming analytics. Full disclosure: MapR is a sponsor of Software Engineering Daily.In past shows, we have covered data engineering in detail–we’ve looked at Uber’s streaming architecture, talked to Matei Zaharia about the basics of Apache Spark, and explored the history of Hadoop. To find all of our episodes about data engineering , download the Software Engineering Daily app for iOS or Android. These apps have all 650 of our episodes in a searchable format–we have recommenda...
2018-02-15
58 min
HRISTOS LUMINA LUMII | Ellen G. White
64. Un popor blestemat - HRISTOS LUMINA LUMII | Ellen G.White
HRISTOS LUMINA LUMII - de Ellen G.White 64.Un popor blestemat Intrarea triumfală a lui Isus în Ierusalim era o slabă preînchipuire a venirii Sale pe norii cerului, cu putere şi slavă, în mijlocul triumfului îngerilor şi al bucuriei sfinţilor. Atunci se vor împlini cuvintele lui Isus către preoţi şi farisei: „De acum încolo nu Mă veţi mai vedea, până când veţi zice: «Binecuvântat este Cel ce vine în Numele Domnului!»” (Matei 23, 39.) Lui Zaharia i se arătase în viziune profetică ziua aceea de triumf final: el mai văzuse şi soarta blestemată a celor care ave...
2018-01-29
20 min
O'Reilly Radar
The State Of Machine Learning In Apache Spark
In this episode of the Data Show, we look back to a recent conversation I had at the Spark Summit in San Francisco with Ion Stoica (UC Berkeley professor and executive chairman of Databricks) and Matei Zaharia (assistant professor at Stanford and chief technologist of Databricks). Stoica and Zaharia were core members of UC Berkeley’s AMPLab, which originated Apache Spark, Apache Mesos, and Alluxio.
2017-09-14
21 min
Intercer: Carti audio si muzica crestina
64. Un popor blestemat - HRISTOS LUMINA LUMII | Ellen G.White
HRISTOS LUMINA LUMII - de Ellen G.White 64. Un popor blestemat Intrarea triumfală a lui Isus în Ierusalim era o slabă preînchipuire a venirii Sale pe norii cerului, cu putere şi slavă, în mijlocul triumfului îngerilor şi al bucuriei sfinţilor. Atunci se vor împlini cuvintele lui Isus către preoţi şi farisei: „De acum încolo nu Mă veţi mai vedea, până când veţi zice: «Binecuvântat este Cel ce vine în Numele Domnului!»” (Matei 23, 39.) Lui Zaharia i se arătase în viziune profetică ziua aceea de triumf final: el mai văzuse şi soarta blestemată a celor care ave...
2016-03-26
20 min
Software Engineering Daily
Apache Spark Creator Matei Zaharia Interview
Matei Zaharia created Spark, and is the co-founder of Databricks, a company using Spark to power data science. Learn more about your ad choices. Visit megaphone.fm/adchoices
2015-08-03
56 min
a16z Podcast
a16z Podcast: A Conversation With the Inventor of Spark
One of the most active and fastest growing open source big data cluster computing projects is Apache Spark, which was originally developed at U.C. Berkeley's AMPLab and is now used by internet giants and other companies around the world. Including, as announced most recently, IBM. In this Q&A with Spark inventor Matei Zaharia -- also the CTO and co-founder of Databricks (and a professor at MIT) -- on the heels of the recent Spark Summit, we cover the difference between Hadoop MapReduce and Spark; what are the ingredients of a successful open source project; and...
2015-06-24
19 min
PREDICI AUDIO
Predica de Florii 2013 VG
28 aprilie 2013Biserica ”Isus Salvatorul”Matei 21:1-10 , Zaharia 9:9"Saltă de veselie, fiica Sionului! Strigă de bucurie, fiica Ierusalimului! Iată că Împăratul tău vine la tine; El este neprihănit şi biruitor, smerit şi călare pe un măgar, pe un mânz, pe mânzul unei măgăriţe."Astăzi sărbătorim intrarea triumfală a lui Isus în Ierusalim sau cum se mai spune în popor Sărbătoarea Floriilor. Doresc să vă adresez cu această ocazie calde felicitări, multă pace și sănătate, bucurie și fericire. Aceasta este duminica care a...
2013-04-28
25 min