podcast
details
.com
Print
Share
Look for any podcast host, guest or anyone
Search
Showing episodes and shows of
Jack Waudby
Shows
Disseminate: The Computer Science Research Podcast
Haralampos Gavriilidis | Fast and Scalable Data Transfer across Data Systems | #62
In this episode of Disseminate, we welcome Harry Gavrilidis back to the podcast to explore his latest research on fast and scalable data transfer across systems, soon to be presented at SIGMOD 2025. Building on his work with XDB, Harry introduces XDBC, a novel data transfer framework designed to balance performance and generalizability. They dive into the challenges of moving data across heterogeneous environments—ranging from cloud systems to IoT devices—and critique the limitations of current generic methods like JDBC and specialized point-to-point connectors.Harry walks us through the architecture of XDBC, which modularizes the data tran...
2025-06-16
56 min
Disseminate: The Computer Science Research Podcast
Haralampos Gavriilidis | SheetReader: Efficient spreadsheet parsing
In this episode of the DuckDB in Research series, Harry Gavriilidis (PhD student at TU Berlin) joins us to discuss Sheet Reader — a high-performance spreadsheet parser that dramatically outpaces traditional tools in both speed and memory efficiency. By taking advantage of the standardized structure of spreadsheet files and bypassing generic XML parsers, Sheet Reader delivers fast and lightweight parsing, even on large files. Now available as a DuckDB extension, it enables users to query spreadsheets directly with SQL and integrate them seamlessly into broader analytical workflows.Harry shares insights into the development process, performance benchmarks, and th...
2025-04-17
40 min
Disseminate: The Computer Science Research Podcast
Arjen P. de Vries | faiss: An extension for vector data & search
In this episode of the DuckDB in Research series, we’re joined by Arjen de Vries, Professor of Data Science at Radboud University. Arjen dives into his team’s development of a DuckDB extension for FAISS, a library originally developed at Facebook for efficient similarity search and vector operations.We explore the growing importance of embeddings and dense retrieval in modern information retrieval systems, and how DuckDB’s zero-copy architecture and tight integration with the Python ecosystem make it a compelling choice for managing large-scale vector data. Arjen shares insights into the technical challenges and architectural decisi...
2025-04-10
46 min
Disseminate: The Computer Science Research Podcast
David Justen | POLAR: Adaptive and non-invasive join order selection via plans of least resistance
In this episode, we sit down with David Justen to discuss his work on POLAR: Adaptive and Non-invasive Join Order Selection via Plans of Least Resistance which was implemented in DuckDB. David shares his journey in the database space, insights into performance optimization, and the challenges of working with modern analytical workloads. We dive into the intricacies of query compilation, vectorized execution, and how DuckDB is shaping the future of in-memory databases. Tune in for a deep dive into database internals, industry trends, and what’s next for high-performance data processing!Links: VLDB 2024 PaperDavid's Homepage
2025-04-03
51 min
Disseminate: The Computer Science Research Podcast
Daniël ten Wolde | DuckPGQ: A graph extension supporting SQL/PGQ
In this episode, we sit down with Daniël ten Wolde, a PhD researcher at CWI’s Database Architectures Group, to explore DuckPGQ—an extension to DuckDB that brings powerful graph querying capabilities to relational databases. Daniel shares his journey into database research, the motivations behind DuckPGQ, and how it simplifies working with graph data. We also dive into the technical challenges of implementing SQL Property Graph Queries (SQL PGQ) in DuckDB, discuss performance benchmarks, and explore the future of DuckPGQ in graph analytics and machine learning. Tune in to learn how this cutting-edge extension is bridging the gap betwe...
2025-03-20
48 min
Disseminate: The Computer Science Research Podcast
Till Döhmen | DuckDQ: A Python library for data quality checks in ML pipelines
In this episode we kick off our DuckDB in Research series with Till Döhmen, a software engineer at MotherDuck, where he leads AI efforts. Till shares insights into DuckDQ, a Python library designed for efficient data quality validation in machine learning pipelines, leveraging DuckDB’s high-performance querying capabilities.We discuss the challenges of ensuring data integrity in ML workflows, the inefficiencies of existing solutions, and how DuckDQ provides a lightweight, drop-in replacement that seamlessly integrates with scikit-learn. Till also reflects on his research journey, the impact of DuckDB’s optimizations, and the future potential of data...
2025-03-13
58 min
Disseminate: The Computer Science Research Podcast
Disseminate x DuckDB Coming Soon...
Hey folks! We have been collaborating with everyone's favourite in-process SQL OLAP database management system DuckDB to bring you a new podcast series - the DuckDB in Research series!At Disseminate our mission is to bridge the gap between research and industry by exploring research that has a real-world impact. DuckDB embodies this synergy—decades of research underpin its design, and now it’s making waves in the research community as a platform for others to build on and this is what the series will focus on! Join us as we k...
2025-03-06
02 min
Disseminate: The Computer Science Research Podcast
High Impact in Databases with... Anastasia Ailamaki
In this High Impact in Databases episode we talk to Anastasia Ailamaki.Anastasia is a Professor of Computer and Communication Sciences at the École Polytechnique Fédérale de Lausanne (EPFL). Tune in to hear Anastasia's story! The podcast is proudly sponsored by Pometry the developers behind Raphtory, the open source temporal graph analytics engine for Python and Rust.You can find Anastasia on:HomepageGoogle ScholarLinkedIn Hosted on Acast. See acast.com/privacy for more information.
2025-03-03
46 min
Disseminate: The Computer Science Research Podcast
Anastasiia Kozar | Fault Tolerance Placement in the Internet of Things | #61
In this episode, we chat with Anastasiia Kozar about her research on fault tolerance in resource-constrained environments. As IoT applications leverage sensors, edge devices, and cloud infrastructure, ensuring system reliability at the edge poses unique challenges. Unlike the cloud, edge devices operate without persistent backups or high availability standards, leading to increased vulnerability to failures. Anastasiia explains how traditional methods fall short, as they fail to align resource allocation with fault tolerance needs, often resulting in system underperformance.To address this, Anastasiia introduces a novel resource-aware approach that combines operator placement and fault tolerance into a...
2024-12-16
49 min
Disseminate: The Computer Science Research Podcast
Liana Patel | ACORN: Performant and Predicate-Agnostic Hybrid Search | #60
In this episode, we chat with with Liana Patel to discuss ACORN, a groundbreaking method for hybrid search in applications using mixed-modality data. As more systems require simultaneous access to embedded images, text, video, and structured data, traditional search methods struggle to maintain efficiency and flexibility. Liana explains how ACORN, leveraging Hierarchical Navigable Small Worlds (HNSW), enables efficient, predicate-agnostic searches by introducing innovative predicate subgraph traversal. This allows ACORN to outperform existing methods significantly, supporting complex query semantics and achieving 2–1,000 times higher throughput on diverse datasets. Tune in to learn more!Links:ACORN: Performant and Pr...
2024-11-11
52 min
Disseminate: The Computer Science Research Podcast
High Impact in Databases with... David Maier
In this High Impact episode we talk to David Maier.David is the Maseeh Professor Emeritus of Emerging Technologies at Portland State University. Tune in to hear David's story and learn about some of his most impactful work.The podcast is proudly sponsored by Pometry the developers behind Raphtory, the open source temporal graph analytics engine for Python and Rust.You can find David on:HomepageGoogle Scholar Hosted on Acast. See acast.com/privacy for more information.
2024-11-04
1h 02
Disseminate: The Computer Science Research Podcast
Raunak Shah | R2D2: Reducing Redundancy and Duplication in Data Lakes | #59
In this episode, Raunak Shah joins us to discuss the critical issue of data redundancy in enterprise data lakes, which can lead to soaring storage and maintenance costs. Raunak highlights how large-scale data environments, ranging from terabytes to petabytes, often contain duplicate and redundant datasets that are difficult to manage. He introduces the concept of "dataset containment" and explains its significance in identifying and reducing redundancy at the table level in these massive data lakes—an area where there has been little prior work.Raunak then dives into the details of R2D2, a novel three-step hi...
2024-10-28
31 min
Disseminate: The Computer Science Research Podcast
High Impact in Databases with... Aditya Parameswaran
In this High Impact episode we talk to Aditya Parameswaran about his some of his most impactful work.Aditya is an Associate Professor at the University of California, Berkeley. Tune in to hear Aditya's story! The podcast is proudly sponsored by Pometry the developers behind Raphtory, the open source temporal graph analytics engine for Python and Rust.Links:EPIC Data LabAnswering Queries using Humans, Algorithms and Databases (CIDR'11)Potter’s Wheel: An Interactive Data Cleaning System (VLDB'01)Online Aggregation (SIGMOD'97)Polaris: A System for Query, Analysis and Visualization of Mu...
2024-10-21
58 min
Disseminate: The Computer Science Research Podcast
Marco Costa | Taming Adversarial Queries with Optimal Range Filters | #58
In this episode, we sit down with Marco Costa to discuss the fascinating world of range filters, focusing on how they help optimize queries in databases by determining whether a range intersects with a given set of keys. Marco explains how traditional range filters, like Bloom filters, often result in high false positives and slow query times, especially when dealing with adversarial inputs where queries are correlated with the keys. He walks us through the limitations of existing heuristic-based solutions and the common challenges they face in maintaining accuracy and speed under such conditions.The highlight...
2024-10-14
37 min
Disseminate: The Computer Science Research Podcast
High Impact in Databases with... Ali Dasdan
In this High Impact episode we talk to Ali Dasdan, CTO at Zoominfo. Tune in to hear Ali's story and learn about some of his most impactful work such as his work on "Map-Reduce-Merge".The podcast is proudly sponsored by Pometry the developers behind Raphtory, the open source temporal graph analytics engine for Python and Rust.Materials mentioned on this episode:Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters (SIGMOD'07)The Art of Doing Science and Engineering: Learning to Learn, Richard HammingHow to Solve It, George PolyaSystems Architecting: Creating & Building Complex Systems...
2024-10-08
1h 03
Disseminate: The Computer Science Research Podcast
Matt Perron | Analytical Workload Cost and Performance Stability With Elastic Pools | #57
In this episode, we dive deep into the complexities of managing analytical query workloads with our guest, Matt Perron. Matt explains how the rapid and unpredictable fluctuations in resource demands present a significant challenge for provisioning. Traditional methods often lead to either over-provisioning, resulting in excessive costs, or under-provisioning, which causes poor query latency during demand spikes. However, there's a promising solution on the horizon. Matt shares insights from recent research that showcases the viability of using cloud functions to dynamically match compute supply with workload demand without the need for prior resource provisioning. While effective for low query...
2024-07-22
52 min
Disseminate: The Computer Science Research Podcast
High Impact in Databases with... Andreas Kipf
In this High Impact episode we talk to Andreas Kipf about his work on "Learned Cardinalities". Andreas is the Professor of Data Systems at Technische Universität Nürnberg (UTN). Tune in to hear Andreas's story and learn about some of his most impactful work.The podcast is proudly sponsored by Pometry the developers behind Raphtory, the open source temporal graph analytics engine for Python and Rust.Papers mentioned on this episode:Learned Cardinalities: Estimating Correlated Joins with Deep Learning CIDR'19The Case for Learned Index Structures SIGMOD'18Adaptive Op...
2024-07-15
53 min
Disseminate: The Computer Science Research Podcast
Marvin Wyrich & Justus Bogner | How Software Engineering Research Is Discussed on LinkedIn | #56
In this episode, we delve into the intersection of software engineering (SE) research and professional practice with experts Marvin Wyrich and Justus Bogner. As LinkedIn stands as the largest professional network globally, it serves as a critical platform for bridging the gap between SE researchers and practitioners. Marvin and Justus explore the dynamics of how research findings are shared and discussed on LinkedIn, providing both quantitative and qualitative insights into the effectiveness of these interactions. They reveal that a significant portion of SE research posts on LinkedIn are authored by individuals outside the original research team and that a...
2024-07-08
47 min
Disseminate: The Computer Science Research Podcast
High Impact in Databases with... Joe Hellerstein
In this High Impact episode we talk to Joe Hellerstein.Joe is the Jim Gray Professor of Computer Science at UC Berkeley. Tune in to hear Joe's story and learn about some of his most impactful work.The podcast is proudly sponsored by Pometry the developers behind Raphtory, the open source temporal graph analytics engine for Python and Rust. Hosted on Acast. See acast.com/privacy for more information.
2024-07-01
52 min
Disseminate: The Computer Science Research Podcast
Harry Goldstein | Property-Based Testing | #55
In this episode, we chat with Harry Goldstein about Property-Based Testing (PBT). Harry shares insights from interviews with PBT users at Jane Street, highlighting PBT's strengths in testing complex code and boosting developer confidence. Harry also discusses the challenges of writing properties and generating random data, and the difficulties in assessing test effectiveness. He identifies key areas for future improvement, such as performance enhancements and better random input generation. This episode is essential for those interested in the latest developments in software testing and PBT's future.Links:ICSE'24 Paper Harry's websiteX: @hgoldstein95 ...
2024-06-25
49 min
Disseminate: The Computer Science Research Podcast
High Impact in Databases with... Raghu Ramakrishnan
In this High Impact episode we talk to Raghu Ramakrishnan.Raghu is CTO for Data and a Technical Fellow at Microsoft. Tune in to hear Raghu's story and learn about some of his most impactful work.The podcast is proudly sponsored by Pometry the developers behind Raphtory, the open source temporal graph analytics engine for Python and Rust. Hosted on Acast. See acast.com/privacy for more information.
2024-06-17
23 min
Disseminate: The Computer Science Research Podcast
Gina Yuan | In-Network Assistance With Sidekick Protocols | #54
Join us as we chat with Gina Yuan about her pioneering work on sidekick protocols, designed to enhance the performance of encrypted transport protocols like QUIC and WebRTC. These protocols ensure privacy but limit in-network innovations. Gina explains how sidekick protocols allow intermediaries to assist endpoints without compromising encryption.Discover how Gina tackles the challenge of referencing opaque packets with her innovative quACK tool and learn about the real-world benefits, including improved Wi-Fi retransmissions, energy-saving proxy acknowledgments, and the PACUBIC congestion-control mechanism. This episode offers a glimpse into the future of network performance and security.
2024-06-10
55 min
Disseminate: The Computer Science Research Podcast
High Impact in Databases with... Moshe Vardi
Welcome to another episode of the High Impact series - today we talk with Moshe Vardi! Moshe is the Karen George Distinguished Service Professor in Computational Engineering at Rice University where his research focuses on automated reasoning. Tune in to hear Moshe's story and learn about some of his most impactful work.The podcast is proudly sponsored by Pometry the developers behind Raphtory, the open source temporal graph analytics engine for Python and Rust.You can find Moshe on X, LinkedIn, and Mastadon @vardi. Links to all his work can...
2024-06-03
47 min
Disseminate: The Computer Science Research Podcast
Tammy Sukprasert | Move Your Workloads To Sweden! | #53
In this episode, we dip our toes into the world of sustainable computing and interview Tammy Sukprasert about her research on reducing carbon emissions in cloud computing through workload scheduling. Tammy explores the concept of shifting cloud workloads across different times and locations to coincide with low-carbon energy availability. Unlike previous studies that focused on specific regions or workloads, her comprehensive analysis uses carbon intensity data from 123 regions to assess both batch and interactive workloads. She considers various factors such as job duration, deadlines, and service level objectives (SLOs). Tammy's findings reveal that while spatiotemporal workload shifting can reduce...
2024-05-27
32 min
Disseminate: The Computer Science Research Podcast
High Impact in Databases with... Ryan Marcus
Welcome to the first episode of the High Impact series!The High Impact series is inspired by a blog post “Most Influential Database Papers" by Ryan Marcus and today we talk to Ryan! Tune in to hear about Ryan's story so far. We chat about his current work before moving on to discuss his most impactful work. We also dig into what motivates him and how he handles setbacks, as well as getting his take on the current trends.The podcast is proudly sponsored by Pometry the developers behind Raphtory, the open source te...
2024-05-20
59 min
Disseminate: The Computer Science Research Podcast
Yazhuo Zhang | SIEVE is Simpler than LRU | #52
In this episode, we explore the world of caching with Yazhuo Zhang, who introduces the game-changing SIEVE algorithm. Traditional eviction algorithms have long struggled with a trade-off between efficiency, throughput, and simplicity. However, SIEVE disrupts this balance by offering a simpler alternative to LRU while outperforming state-of-the-art algorithms in both efficiency and scalability for web cache workloads. Implemented in five production cache libraries with minimal code changes, SIEVE's superiority shines through in a comprehensive evaluation across 1559 cache traces. With up to a remarkable 63.2% lower miss ratio than ARC and surpassing nine other algorithms in over 45% of cases, SIEVE's simplicity...
2024-05-13
43 min
Disseminate: The Computer Science Research Podcast
Introducing the High Impact Series...
Introducing the High Impact Series! Hey folks, we have a new series coming soon inspired by a blog post “Most Influential Database Papers" by Ryan Marcus. The series will feature interviews with the authors of some of the most impactful work in the field of databases. We will talk about the story behind some of their most impactful work, getting them to reflect on the impact it has had over years, as well as getting their take on the current trends in the field. Proudly sponsored by Pometry Hosted on Acast. Se...
2024-05-06
02 min
Disseminate: The Computer Science Research Podcast
Eleni Zapridou | Oligolithic Cross-task Optimizations across Isolated Workloads | #51
In this episode, we talk to Eleni Zapridou and delve into the challenges of data processing within enterprises, where multiple applications operate concurrently on shared resources. Traditional resource boundaries between applications often lead to increased costs and resource consumption. However, as Eleni explains the principle of functional isolation offers a solution by combining cross-task optimizations with performance isolation. We explore GroupShare, an innovative strategy that reduces CPU consumption and query latency, transforming data processing efficiency. Join us as we discuss the implications of functional isolation with Eleni and its potential to revolutionize enterprise data processing.
2024-04-29
38 min
Disseminate: The Computer Science Research Podcast
Pat Helland | Scalable OLTP in the Cloud: What’s the BIG DEAL? | #50
In this thought-provoking podcast episode, we dive into the world of scalable OLTP (OnLine Transaction Processing) systems with the insightful Pat Helland. As a seasoned expert in the field, Pat shares his insights on the critical role of isolation semantics in the scalability of OLTP systems, emphasizing its significance as the "BIG DEAL." By examining the interface between OLTP databases and applications, particularly through the lens of RCSI (READ COMMITTED SNAPSHOT ISOLATION) SQL databases, Pat talks about the limitations imposed by current database architectures and application patterns on scalability.Through a compelling thought experiment, Pat explores...
2024-04-15
1h 20
Disseminate: The Computer Science Research Podcast
Rui Liu | Towards Resource-adaptive Query Execution in Cloud Native Databases | #49
In this episode, we talk to Rui Liu and explore the transformative potential of Ratchet, a groundbreaking resource-adaptive query execution framework. We delve into the challenges posed by ephemeral resources in modern cloud environments and the innovative solutions offered by Ratchet. Rui guides us through the intricacies of Ratchet's design, highlighting its ability to enable adaptive query suspension and resumption, sophisticated resource arbitration for diverse workloads, and a fine-grained pricing model to navigate fluctuating resource availability. Join us as we uncover the future of cloud-native databases and workloads, and discover how Ratchet is poised to revolutionize the way we...
2024-04-01
53 min
Disseminate: The Computer Science Research Podcast
Yifei Yang | Predicate Transfer: Efficient Pre-Filtering on Multi-Join Queries | #48
In this episode, Yifei Yang introduces predicate transfer, a revolutionary method for optimizing join performance in databases. Predicate transfer builds on Bloom joins, extending its benefits to multi-table joins. Inspired by Yannakakis's theoretical insights, predicate transfer leverages Bloom filters to achieve significant speed improvements. Yang's evaluation shows an average 3.3× performance boost over Bloom join on the TPC-H benchmark, highlighting the potential of predicate transfer to revolutionize database query optimization. Join us as we explore the transformative impact of predicate transfer on database operations.Links:CIDR'24 PaperYifei's LinkedInBuy Me A CoffeeListener Survey Hosted on Acast. S...
2024-03-18
47 min
Disseminate: The Computer Science Research Podcast
Vikramank Singh | Panda: Performance Debugging for Databases using LLM Agents | #47
In this episode, Vikramank Singh introduces the Panda framework, aimed at refining Large Language Models' (LLMs) capability to address database performance issues. Vikramank elaborates on Panda's four components—Grounding, Verification, Affordance, and Feedback—illustrating how they collaborate to contextualize LLM responses and deliver actionable recommendations. By bridging the divide between technical knowledge and practical troubleshooting needs, Panda has the potential to revolutionize database debugging practices, offering a promising avenue for more effective and efficient resolution of performance challenges in database systems. Tune in to learn more! Links:CIDR'24 PaperVikramank's LinkedIn Hosted on Acast. See acas...
2024-03-04
1h 08
Disseminate: The Computer Science Research Podcast
Tamer Eldeeb | Chablis: Fast and General Transactions in Geo-Distributed Systems | #46
In this episode, Tamer Eldeeb sheds light on the challenges faced by geo-distributed database management systems (DBMSes) in supporting strictly-serializable transactions across multiple regions. He discusses the compromises often made between low-latency regional writes and restricted programming models in existing DBMS solutions. Tamer introduces Chablis, a groundbreaking geo-distributed, multi-versioned transactional key-value store designed to overcome these limitations.Chablis offers a general interface accommodating range and point reads, along with writes within multi-step strictly-serializable ACID transactions. Leveraging advancements in low-latency datacenter networks and innovative DBMS designs, Chablis eliminates the need for compromises, ensuring fast read-write transactions with low...
2024-02-12
1h 02
Disseminate: The Computer Science Research Podcast
Matt Butrovich | Tigger: A Database Proxy That Bounces With User-Bypass | #45
Summary: In this episode, we chat to Matt Butrovich about his research on database proxies. We discuss the inefficiencies of traditional database proxies, which operate in user-space, causing overhead due to buffer copying and system calls. Matt introduces "user-bypass" which leverages Linux's eBPF infrastructure to move application logic into kernel-space. Matt then tells us about Tigger, a PostgreSQL-compatible DBMS proxy, showcasing user-bypass benefits. Tune in to hear about the experiments that demonstrate how Tigger can achieve up to a 29% reduction in transaction latencies and a 42% reduction in CPU utilization compared to other widely-used proxies.
2023-12-18
1h 03
Disseminate: The Computer Science Research Podcast
Gábor Szárnyas | The LDBC Social Network Benchmark: Business Intelligence Workload | #44
Summary: In this episode, Gábor Szárnyas takes us on a journey through the LDBC Social Network Benchmark's Business Intelligence workload (SNB BI). Developed through collaboration between academia and industry the SNB BI is a comprehensive graph OLAP benchmark. It pushes the boundaries of synthetic and scalable analytical database benchmarks, featuring a sophisticated data generator and a temporal graph with small-world phenomena. The benchmark's query workload, rooted in LDBC's innovative design methodology, aims to drive future technical advancements in graph database systems. Gabor highlights SNB BI's unique features, including the adoption of "parameter curation" for stable qu...
2023-12-04
46 min
Disseminate: The Computer Science Research Podcast
Thaleia Doudali | Is Machine Learning Necessary for Cloud Resource Usage Forecasting? | #43
Summary:In this week's episode, we talk with Thaleia Doudali and explore the realm of cloud resource forecasting, focusing on the use of Long Short Term Memory (LSTM) neural networks, a popular machine learning model. Drawing from her research, Thaleia discusses the surprising discovery that, despite the complexity of ML models, accurate predictions often boil down to a simple shift of values by one time step. The discussion explores the nuances of time series data, encompassing resource metrics like CPU, memory, network, and disk I/O across different cloud providers and levels. Thaleia highlights...
2023-11-20
49 min
Disseminate: The Computer Science Research Podcast
Jinkun Geng | Nezha: Deployable and High-Performance Consensus Using Synchronized Clocks | #42
Summary: In this episode Jinkun Geng talks to us about Nezha, a high-performance consensus protocol. Nezha can be deployed by cloud tenants without support from cloud providers. Nezha bridges the gap between protocols such as MultiPaxos and Raft, which can be readily deployed, and protocols such as NOPaxos and Speculative Paxos, that provide better performance, but require access to technologies such as programmable switches and in-network prioritization, which cloud tenants do not have. Tune in to learn more! Links: Jinkun's HomepageNezha VLDB'23 PaperNezha GitLab Repo Hosted on Acast. See acast...
2023-10-23
55 min
Disseminate: The Computer Science Research Podcast
Dimitris Koutsoukos | NVM: Is it Not Very Meaningful for Databases? | #41
Summary: In this episode, Dimitris Koutsoukos talks to us about Persistent or Non Volatile Memory (PMEM) and we answer the question: Is it Not Very Meaningful for Databases? PMEM offers expanded memory capacity and faster access to persistent storage. However, (before Dimitris's work) there was no comprehensive empirical analysis of existing database engines under diferent PMEM modes, to understand how databases can benefit from the various hardware configurations. Dimitris and his colleagues have then analyzes multiple diferent engines under common benchmarks with PMEM in AppDirect mode and Memory mode - tune in to h...
2023-10-09
48 min
Disseminate: The Computer Science Research Podcast
Mohamed Alzayat | Groundhog: Efficient Request Isolation in FaaS | #40
Summary:Security is a core responsibility for Function-as-a-Service (FaaS) providers. The prevailing approach has each function execute in its own container to isolate concurrent executions of different functions. However, successive invocations of the same function commonly reuse the runtime state of a previous invocation in order to avoid container cold-start delays when invoking a function. Although efficient, this container reuse has security implications for functions that are invoked on behalf of differently privileged users or administrative domains: bugs in a function’s implementation, third-party library, or the language runtime may leak private data from on...
2023-09-11
42 min
Disseminate: The Computer Science Research Podcast
Cuong Nguyen | Detock: High Performance Multi-region Transactions at Scale | #39
Summary: In this episode Cuong Nguyen tells us about Detock, a geographically replicated database system. Tune in to learn about its specialised concurrency control and deadlock resolution protocols that enable processing strictly-serializable multi-region transactions with near-zero performance degradation at extremely high conflict and improves latency by up to a factor of 5.Links: SIGMOD PaperDetock Github RepoCuong's Homepage Hosted on Acast. See acast.com/privacy for more information.
2023-08-28
37 min
Disseminate: The Computer Science Research Podcast
Bogdan Stoica | WAFFLE: Exposing Memory Ordering Bugs Efficiently with Active Delay Injection | #38
Concurrency bugs are difficult to detect, reproduce, and diagnose, as they manifest under rare timing conditions. Recently, active delay injection has proven efficient for exposing one such type of bug — thread-safety violations — with low over-head, high coverage, and minimal code analysis. However, how to efficiently apply active delay injection to broader classes of concurrency bugs is still an open question.In this episode, Bogdan Stoica tells us about how answered this question by focusing on MemOrder bugs — a type of concurrency bug caused by incorrect timing between a memory access to a particular object and the object’s initi...
2023-08-14
55 min
Disseminate: The Computer Science Research Podcast
Roger Waleffe | MariusGNN: Resource-Efficient Out-of-Core Training of Graph Neural Networks | #37
Summary: In this episode, Roger Waleffe talks about Graph Neural Networks (GNNs) for large-scale graphs. Specifically, he reveals all about MariusGNN, the first system that utilises the entire storage hierarchy (including disk) for GNN training. Tune in to find out how MaruisGNN works and just how fast it goes (and how much more cost-efficient it is!) Links: Marius ProjectRoger's Homepage Roger's TwitterEuroSys'23 PaperSupport the podcast through Buy Me a Coffee Hosted on Acast. See acast.com/privacy for more information.
2023-07-31
1h 13
Disseminate: The Computer Science Research Podcast
Madelon Hulsebos | GitTables: A Large-Scale Corpus of Relational Tables | #36
Summary:The success of deep learning has sparked interest in improving relational table tasks, like data preparation and search, with table representation models trained on large table corpora. Existing table corpora primarily contain tables extracted from HTML pages, limiting the capability to represent offline database tables. To train and evaluate high-capacity models for applications beyond the Web, we need resources with tables that resemble relational database tables. In this episode, Madelon Hulsebos tells us all about such a resource! Tune in to learn more about GitTables!! Links: Madelon's websiteGitTables homepageSIGMOD'23 paper
2023-07-17
45 min
Disseminate: The Computer Science Research Podcast
Tarikul Islam Papon | ACEing the Bufferpool Management Paradigm for Modern Storage Devices | #35
Summary:Compared to hard disk drives (HDDs), solid-state drives (SSDs) have two fundamentally different properties: (i) read/write asymmetry (writes are slower than reads) and (ii) access concurrency (multiple I/Os can be executed in parallel to saturate the device bandwidth). But, database operators are often designed without considering storage asymmetry and concurrency resulting in device under utilization. In thie episode, Tarikul Islam Papon tells us about his work on a new Asymmetry & Concurrency aware bufferpool management (ACE) that batches writes based on device concurrency and performs them in parallel to amortize the asymmetric write cost. Tune...
2023-06-20
47 min
Disseminate: The Computer Science Research Podcast
Jian Zhang | VIPER: A Fast Snapshot Isolation Checker | #34
Summary:Snapshot isolation is supported by most commercial databases and is widely used by applications. However, checking, if given a set of transactions, a database ensures Snapshot Isolation is either slow or gives up soundness. In this episode, Jian Zhang tells us about VIPER, an SI checker that is sound, complete, and fast. Tune in to learn more!! Links:PaperGitHub repoJian's homepage Hosted on Acast. See acast.com/privacy for more information.
2023-06-09
42 min
Disseminate: The Computer Science Research Podcast
Ahmed Sayed | REFL: Resource Efficient Federated Learning | #33
Summary: Federated Learning (FL) enables distributed training by learners using local data, thereby enhancing privacy and reducing communication. However, it presents numerous challenges relating to the heterogeneity of the data distribution, device capabilities, and participant availability as deployments scale, which can impact both model convergence and bias. Existing FL schemes use random participant selection to improve fairness; however, this can result in inefficient use of resources and lower quality training. In this episode, Ahmed Sayed talks about how he and his colleagues address the question of resource efficiency in FL. He talks about the...
2023-05-26
58 min
Disseminate: The Computer Science Research Podcast
Subhadeep Sarkar | Log-structured Merge Trees | #32
Summary:Log-structured merge (LSM) trees have emerged as one of the most commonly used storage-based data structures in modern data systems as they offer high throughput for writes and good utilization of storage space. In this episode, Subhadeep Sarkar presents the fundamental principles of the LSM paradigm. He tells us about recent research on improving write performance and the various optimization techniques and hybrid designs adopted by LSM engines to accelerate reads. Tune in to find out more! Links:Personal websiteICDE'23 tutorialLinkedIn Hosted on Acast. See acast.com/privacy for...
2023-05-11
59 min
Disseminate: The Computer Science Research Podcast
Andra Ionescu | Topio: The Geodata Marketplace | #31
Summary: The increasing need for data trading across businesses nowadays has created a demand for data marketplaces. However, despite the intentions of both data providers and consumers, today’s data marketplaces remain mere data catalogs. In this episode, Andra tells us about her vision for marketplaces of the future which require a set of value-added services, such as advanced search and discovery. Also, she tell us about her and her team's effort to engineer and develop an open-source modular data market platform to enable both entrepreneurs and researchers to setup and experiment with data marketplaces. Tune in to...
2023-04-25
46 min
Disseminate: The Computer Science Research Podcast
Laurens Kuiper | These Rows Are Made For Sorting | #30
Summary: Sorting is one of the most well-studied problems in computer science and a vital operation for relational database systems. Despite this, little research has been published on implementing an efficient relational sorting operator. In this episode, Laurens Kuiper tells us about his work filling this gap! Tune in to hear about a micro-benchmarks that explores how to sort relational data efficiently for analytical database systems, taking into account different query execution engines as well as row and columnar data formats. Laurens also tells us about his implementation of a highly optimized row-based sorting approach in the...
2023-04-12
55 min
Disseminate: The Computer Science Research Podcast
Semih Salihoğlu | Kùzu Graph Database Management System | #29
Summary: In this episode Semih Salihoğlu tell us about Kùzu, an in-process property graph database management system built for query speed and scalability.Listen to hear the vision for Kùzu and to learn more about Kùzu's factorized query processor! Links:Kùzu GitHub repoCIDR papercontact@kuzudb.com Kùzu SlackKùzu TwitterKùzu Website - blog posts Semih mentioned can be found hereSemih's HomepageSemih's Twitter Hosted on Acast. See acast.com/privacy for more information.
2023-04-03
1h 17
Disseminate: The Computer Science Research Podcast
Lukas Vogel | Data Pipes: Declarative Control over Data Movement | #28
Summary:Today’s storage landscape offers a deep and heterogeneous stack of technologies that promises to meet even the most demanding data intensive workload needs. The diversity of technologies, however, presents a challenge. Parts of it are not controlled directly by the application, e.g., the cache layers, and the parts that are controlled, often require the programmer to deal with very different transfer mechanisms, such as disk and network APIs. Combining these different abstractions properly requires great skill, and even so, expert-written programs can lead to sub-optimal utilization of the storage stack and present performance unpredictability. In...
2023-03-28
50 min
Disseminate: The Computer Science Research Podcast
Haralampos Gavriilidis | In-Situ Cross-Database Query Processing | #27
Summary:Today’s organizations utilize a plethora of heterogeneous and autonomous DBMSes, many of those being spread across different geo-locations. It is therefore crucial to have effective and efficient cross-database query processing capabilities. In this episode, Haralampos Gavriilidis tell us about XDB, an efficient middleware system that runs cross database analytics over existing DBMSes. Tune in to learn more!Links:PreprintHaralampos's homepageSupport the podcast here! Hosted on Acast. See acast.com/privacy for more information.
2023-03-20
1h 00
Disseminate: The Computer Science Research Podcast
Paras Jain & Sarah Wooders | Skyplane: Fast Data Transfers Between Any Cloud | #26
Summary:This week Paras Jain and Sarah Wooders tell us about how you can quickly data transfers between any cloud with Skyplane. Tune in to learn more! Links:Skyplane homepageSarah's homepageParas's homepageSupport the podcast here Hosted on Acast. See acast.com/privacy for more information.
2023-03-13
46 min
Disseminate: The Computer Science Research Podcast
Yang Wang | Rethinking Concurrency Control in Databases | #25
Summary: Many database applications execute transactions under a weaker isolation level, such as READ COMMITTED. This often leads to concurrency bugs that look like race conditions in multi-threaded programs. While this problem is well known, philosophies of how to address this problem vary a lot, ranging from making a SERIALIZABLE database faster to living with weaker isolation and the consequence of concurrency bugs. In this episode, Yang talks about the consequences of these bugs, the root causes, and how developers have fixed 93 real-world concurrency bugs in database applications. Who's responsibility is it to prevent these bugs from...
2023-03-06
55 min
Disseminate: The Computer Science Research Podcast
Suyash Gupta | Chemistry behind Agreement | #24
Summary: Agreement protocols have been extensively used by distributed data management systems to provide robustness and high availability. The broad spectrum of design dimensions, applications, and fault models have resulted in different flavours of agreement protocols. This has made it hard to argue their correctness and has unintentionally created a disparity in understanding their design. In this episode, Suyash Gupta tell us about a unified framework that simplifies expressing different agreement protocols. Listen to find out more! Links: PaperWebsiteTwitter Hosted on Acast. See acast.com/privacy for more information.
2023-02-27
1h 03
Disseminate: The Computer Science Research Podcast
Tobias Ziegler | Is Scalable OLTP in the Cloud a Solved Problem? | #23
Summary: Many distributed cloud OLTP databases have settled on a shared-storage design coupled with a single-writer. This design choice is remarkable since conventional wisdom promotes using a shared-nothing architecture for building scalable systems. In this episode, Tobias revisits the question of what a scalable OLTP design for the cloud should look like by analysing the data access behaviour of different systems. Tune in to find out more!Links: PaperWebsiteEmail TwitterGoogle Scholar Hosted on Acast. See acast.com/privacy for more information.
2023-02-20
55 min
Disseminate: The Computer Science Research Podcast
Hamish Nicholson | HetCache: Synergising NVMe Storage and GPU acceleration for Memory-Efficient Analytics | #22
Summary:In this episode, Hamish Nicholson tells us about HetCache, a storage engine for analytical workloads that optimizes the data access paths and tunes data placement by co-optimizing for the combinations of different memories, compute devices, and queries. Specifically, we present how the increasingly complex storage hierarchy impacts analytical query processing in GPU-NVMe-accelerated servers. HetCache accelerates analytics on CPU-GPU servers for larger-than-memory datasets through proportional and access-path-aware data placement. Tune in to hear more!Links:PaperPersonal websiteLinkedInTwitter Hosted on Acast. See acast.com/privacy for more information.
2023-02-13
50 min
Disseminate: The Computer Science Research Podcast
Immanuel Haffner | mutable: A Modern DBMS for Research and Fast Prototyping | #21
Summary:Few to zero DBMSs provide extensibility together with implementations of modern concepts, like query compilation for example. This as an impeding factor in academic research. In this episode, Immanuel Haffner, presents mutable, a system that is fitted to academic research and education. mutable features a modular design, where individual components can be composed to form a complete system. Check out the episode to learn more!Links:PaperWebsiteMutable github repoBobby Tables xkcd Hosted on Acast. See acast.com/privacy for more information.
2023-02-06
1h 28
Disseminate: The Computer Science Research Podcast
Konstantinos Kallas | Practically Correct, Just-in-Time Shell Script Parallelization | #20
Summary: Recent shell-script parallelization systems enjoy mostly automated speedups by parallelizing scripts ahead-of-time. Unfortunately, such static parallelization is hampered by dynamic behavior pervasive in shell scripts—e.g., variable expansion and command substitution—which often requires reasoning about the current state of the shell and filesystem. Tune in to hear how Konstantinos Kallas and his colleagues overcame this issue (and others) with PaSH-JIT, a just-in-time (JIT) shell-script compiler!Links: OSDI paperPersonal websiteTwitterLinkedInPaSH homepage (you can find all associated papers here) Hosted on Acast. See acast.com/privacy for...
2023-01-30
57 min
Disseminate: The Computer Science Research Podcast
Vasily Sartakov | CAP-VMs: Capability-Based Isolation and Sharing in the Cloud #19
Summary: Cloud stacks must isolate application components, while permitting efficient data sharing between components deployed on the same physical host. Traditionally, the memory management unit (MMU) enforces isolation and permits sharing at page granularity. MMU approaches, however, lead to cloud stacks with large trusted computing bases in kernel space, and page granularity requires inefficient OS interfaces for data sharing. Forthcoming CPUs with hardware support for memory capabilities offer new opportunities to implement isolation and sharing at a finer granularity. In this episode, Vasily talks about his work on cVMs, a new VM-like abstraction that uses memory capabilities to...
2023-01-23
36 min
Disseminate: The Computer Science Research Podcast
Haoran Ma | MemLiner: Lining up Tracing and Application for a Far-Memory-Friendly Runtime | #18
Summary: Far-memory techniques that enable applications to use remote memory and are increasingly appealing in modern data centers, supporting applications’ large memory footprint and improving machines’ resource utilization. In this episode Haoran Ma tells us about the problems with current far-memory techniques and how they focus on OS-level optimizations and are agnostic to managed runtimes and garbage collections (GC) underneath applications written in high-level languages. Owing to different object-access patterns from applications, GC can severely interfere with existing far-memory techniques, breaking remote memory prefetching algorithms and causing severe local-memory misses. To address this Haoran and his colleagues deve...
2023-01-16
44 min
Disseminate: The Computer Science Research Podcast
Lexiang Huang | Metastable Failures in the Wild | #17
Summary: In this episode Lexiang Huang talks about a framework for understanding a class of failures in distributed systems called metastable failures. Lexiang tells us about his study on the prevalence of such failures in the wild and how he and his colleagues scoured over publicly available incident reports from many organizations, ranging from hyperscalers to small companies. Listen to the episode to find out about his main findings and gain a deeper understanding of metastable failures and how you can identity, prevent, and mitigate against them!Links: OSDI paper and t...
2023-01-09
53 min
Disseminate: The Computer Science Research Podcast
Andrew Quinn | Debugging the OmniTable Way | #16
Summary: Debugging is time-consuming, accounting for roughly 50% of a developer's time. In this episode Andrew Quinn tells us about the OmniTable, an abstraction that captures all execution state as a large queryable data table. In his research Andrew has built a query model around an OmniTable that supports SQL to simplify debugging. An OmniTable decouples debugging logic from the original execution, which SteamDrill, Andrew's prototype, uses to reduce the performance overhead of debugging (SteamDrill queries are an order-of-magnitude faster than existing debugging tools). Links: Andrew's HomepageDebugging the OmniTable Way OSDI'22...
2023-01-02
57 min
Disseminate: The Computer Science Research Podcast
Audrey Cheng | TAOBench: An End-to-End Benchmark for Social Network Workloads | #15
Summary: This episode features Audrey Cheng talking about TAOBench, a new benchmark that captures the social graph workload at Meta. Audrey tells us about the features of workload, how it compares with other benchmarks, and how it fills a gap in the existing space of benchmark. Also, we hear all about the fantastic real-world impact the benchmark has already had across a range of companies. Links:PaperPersonal websiteMeta blog postGitHub repo Hosted on Acast. See acast.com/privacy for more information.
2022-12-12
52 min
Disseminate: The Computer Science Research Podcast
George Konstantinidis | Enabling Personal Consent in Databases | #14
Summary: Users have the right to consent to the use of their data, but current methods are limited to very coarse-grained expressions of consent, as “opt-in/opt-out” choices for certain uses. In this episode, George talks about how he and his group identified the need for fine-grained consent management and how they formalized how to express and manage user consent and personal contracts of data usage in relational databases. Their approach enables data owners to express the intended data usage in formal specifications, called consent constraints, and enables a service provider that wants to honor these constraints, to a...
2022-12-05
55 min
Disseminate: The Computer Science Research Podcast
Per Fuchs | Sortledton: a Universal, Transactional Graph Data Structure | #13
Summary (VLDB abstract):Despite the wide adoption of graph processing across many different application domains, there is no underlying data structure that can serve a variety of graph workloads (analytics, traversals, and pattern matching) on dynamic graphs with transactional updates. In this episode, Per talks about Sortledton, a universal graph data structure that addresses the open problem by being carefully optimizing for the most relevant data access patterns used by graph computation kernels. It can support millions of transactional updates per second, while providing competitive performance (1.22x on average) for the most common graph...
2022-11-28
41 min
Disseminate: The Computer Science Research Podcast
George Theodorakis | Scabbard: Single-Node Fault-Tolerant Stream Processing | #12
Summary (VLDB abstract):Single-node multi-core stream processing engines (SPEs) can process hundreds of millions of tuples per second. Yet making them fault-tolerant with exactly-once semantics while retaining this performance is an open challenge: due to the limited I/O bandwidth of a single-node, it becomes infeasible to persist all stream data and operator state during execution. Instead, single-node SPEs rely on upstream distributed systems, such as Apache Kafka, to recover stream data after failure, necessitating complex clusterbased deployments. This lack of built-in fault-tolerance features has hindered the adoption of single-node SPEs. We describe Scabbard, the frst single-node...
2022-11-21
45 min
Disseminate: The Computer Science Research Podcast
Kevin Gaffney | SQLite: Past, Present, and Future | #11
Summary: In this episode Kevin Gaffney tells us about SQLite, the most widely deployed database engine in existence. SQLite is found in nearly every smartphone, computer, web browser, television, and automobile. Several factors are likely responsible for its ubiquity, including its in-process design, standalone codebase, extensive test suite, and cross-platform file format. While it supports complex analytical queries, SQLite is primarily designed for fast online transaction processing (OLTP), employing row-oriented execution and a B-tree storage format. However, fueled by the rise of edge computing and data science, there is a growing need for efficient in-process online analytical...
2022-11-14
48 min
Disseminate: The Computer Science Research Podcast
Matthias Jasny | P4DB - The Case for In-Network OLTP | #10
Summary: In this episode Matthias Jasny from TU Darmstadt talks about P4DB, a database that uses a programmable switch to accelerate OLTP workloads. The main idea of P4DB is that it implements a transaction processing engine on top of a P4-programmable switch. The switch can thus act as an accelerator in the network, especially when it is used to store and process hot (contended) tuples on the switch. P4DB provides significant benefits compared to traditional DBMS architectures and can achieve a speedup of up to 8x.Questions:
2022-08-08
27 min
Disseminate: The Computer Science Research Podcast
Tobias Ziegler | ScaleStore: A Fast and Cost-Efficient Storage Engine using DRAM, NVMe, and RDMA | #9
Summary: In this episode Tobias talks about his work on ScaleStore, a distributed storage engine that exploits DRAM caching, NVMe storage, and RDMA networking to achieve high performance, cost-efficiency, and scalability. Using low latency RDMA messages, ScaleStore implements a transparent memory abstraction that provides access to the aggregated DRAM memory and NVMe storage of all nodes. In contrast to existing distributed RDMA designs such as NAM-DB or FaRM, ScaleStore stores cold data on NVMe SSDs (flash), lowering the overall hardware cost significantly. At the heart of ScaleStore is a distributed caching st...
2022-08-01
23 min
Disseminate: The Computer Science Research Podcast
Chuzhe Tang | Ad Hoc Transactions in Web Applications: The Good, the Bad, and the Ugly | #8
Summary: Many transactions in web applications are constructed ad-hoc in the application code. For example, developers might explicitly use locking primitives or validation procedures to coordinate critical code fragments. In this episode, Chuzhe tells us these ad-hoc transactions, database operations coordinated by application code.Until Chuzhe’s work, little was known about them. In this episode he chats about the first comprehensive study on ad hoc transactions. By studying 91 ad hoc transactions among 8 popular open-source web applications, he and his co-authors found that (i) every studied application uses ad hoc transactions (up to 16 per ap...
2022-07-25
32 min
Disseminate: The Computer Science Research Podcast
Michael Abebe | Proteus: Autonomous Adaptive Storage for Mixed Workloads | #7
Summary:Enterprises use distributed database systems to meet the demands of mixed or hybrid transaction/analytical processing (HTAP) workloads that contain both transactional (OLTP) and analytical (OLAP) requests. Distributed HTAP systems typically maintain a complete copy of data in row-oriented storage format that is well-suited for OLTP workloads and a second complete copy in column-oriented storage format optimised for OLAP workloads. Maintaining these data copies consumes significant storage space and system resources. Conversely, if a system stores data in a single format, OLTP or OLAP workload performance suffers.In this interview, Michael talks about...
2022-07-18
27 min
Disseminate: The Computer Science Research Podcast
Hani Al-Sayeh | Juggler: Autonomous Cost Optimization and Performance Prediction of Big Data Applications | #6
Summary:Distributed in-memory processing frameworks accelerate iterative workloads by caching suitable datasets in memory rather than recomputing them in each iteration. Selecting appropriate datasets to cache as well as allocating a suitable cluster configuration for caching these datasets play a crucial role in achieving optimal performance. In practice, both are tedious, time-consuming tasks and are often neglected by end users, who are typically not aware of workload semantics, sizes of intermediate data, and cluster specification. To address these problems, Hani and his colleagues developed Juggler, an end-to-end framework, which autonomously selects appropriate datasets for caching and recommends...
2022-07-11
32 min
Disseminate: The Computer Science Research Podcast
Thomas Hütter | JEDI: These aren’t the JSON documents you’re looking for | #4
Summary:The JavaScript Object Notation (JSON) is a popular data format used in document stores to natively support semi-structured data.In this interview, Thomas talks about how he addressed the problem of JSON similarity lookup queries: given a query document and a distance threshold, retrieve all documents that are within the threshold from the query document, i.e., get me all similar documents!. Different from other hierarchical formats such as XML, JSON supports both ordered and unordered sibling collections within a single document which poses a new challenge to the tree model...
2022-07-08
11 min
Disseminate: The Computer Science Research Podcast
Sainyam Galhotra | Causal Feature Selection for Algorithmic Fairness | #5
Summary:The use of machine learning (ML) in high-stakes societal decisions has encouraged the consideration of fairness throughout the ML lifecycle. Although data integration is one of the primary steps to generate high-quality training data, most of the fairness literature ignores this stage. In this interview Sainyam discusses why he focuses on fairness in the integration component of data management, aiming to identify features that improve prediction without adding any bias to the dataset. Sainyam works under the causal fairness paradigm and without requiring the underlying structural causal model a priori, we has developed...
2022-07-08
12 min
Disseminate: The Computer Science Research Podcast
Draco Xu | TSUBASA: Climate Network Construction on Historical and Real-Time Data | #3
Summary: A climate network represents the global climate system by the interactions of a set of anomaly time-series. Network science has been applied on climate data to study the dynamics of a climate network. The core task and first step to enable interactive network science on climate data is the efficient construction and update of a climate network on user-defined time-windows. In this interview Draco talks about TSUBASA, an algorithm for the efficient construction of climate networks based on the exact calculation of Pearson’s correlation of large time-series. By pre-computing simple and low-overhead st...
2022-07-04
17 min
Disseminate: The Computer Science Research Podcast
Felix S Campbell | Efficient Answering of Historical What-if Queries | #2
Summary:In this interview Felix discusses "historical what-if queries", a novel type of what-if analysis that determines the effect of a hypothetical change to the transactional history of a database. For example, “how would revenue be affected if we would have charged an additional $6 for shipping?” In his research Felix has developed efficient techniques for answering these historical what-if queries, i.e., determining how a modified history affects the current database state. During the show, Felix talks about reenactment, a replay technique for transactional histories, and how he and his co-authors optimize this process usin...
2022-07-01
19 min
Disseminate: The Computer Science Research Podcast
Alex Isenko | Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning Preprocessing Pipelines | #1
Summary: Preprocessing pipelines in deep learning aim to provide sufficient data throughput to keep the training processes busy. Maximizing resource utilization is becoming more challenging as the throughput of training processes increases with hardware innovations (e.g., faster GPUs, TPUs, and inter-connects) and advanced parallelization techniques that yield better scalability. At the same time, the amount of training data needed in order to train increasingly complex models is growing. As a consequence of this development, data preprocessing and provisioning are becoming a severe bottleneck in end-to-end deep learning pipelines.In this...
2022-06-27
24 min
Disseminate: The Computer Science Research Podcast
Coming Soon | ACM SIGMOD/PODS 2022 | #0
Welcome to Disseminate! The podcast bringing you the cutting edge of Computer Science research in a digestible format. Each series will focus on papers published at a specific Computer Science conference, e.g., SIGMOD, CVPR, so we will cover a wide range of topics from distributed systems to computer vision. Each episode within a series will feature an interview with the author(s) of a paper published at that conference. The podcasts aims to be an alternative source of information for industry practitioners, researchers, and students. The podcast will be of particular use to practitioners as there will be...
2022-06-03
01 min