Disseminate: The Computer Science Research Podcast
Hani Al-Sayeh | Juggler: Autonomous Cost Optimization and Performance Prediction of Big Data Applications | #6
Distributed in-memory processing frameworks accelerate iterative workloads by caching suitable datasets in memory rather than recomputing them in each iteration. Selecting appropriate datasets to cache as well as allocating a suitable cluster configuration for caching these datasets play a crucial role in achieving optimal performance. In practice, both are tedious, time-consuming tasks and are often neglected by end users, who are typically not aware of workload semantics, sizes of intermediate data, and cluster specification. To address these problems, Hani and his colleagues developed Juggler, an end-to-end framework, which autonomously selects appropriate datasets for caching and recommends a correspondingly suitable cluster configuration to end users, with the aim of achieving optimal execution time and cost.
1:02 - Can you introduce your work and describe the current workflow for developing big data applications in the cloud?
2:49 - What is the challenge (maybe hidden challenge) facing application developers in this workflow? What harms performance?
5:36 - How does Juggler solve this problem?
11:55 - As an end user, how do I interact with Juggler?
14:07 - Can you talk us through your evaluation of Juggler? What were the key insights?
16:30 - What other tools are similar to Juggler? How do they compare?
18:17 - What are the limitations of Juggler?
21:57 - Who will find Juggler the most useful? Who is it for?
24:05 - Is Juggler publicly available?
24:23 - What is the most interesting (maybe unexpected) lesson you learned while working on this topic?
27:50 - What is next for Juggler? What do you have planned for future research?
28:49 - What attracted you to this research area?
29:45 - What do you think is the biggest challenge now in this area?
- Juggler: Autonomous Cost Optimization and Performance Prediction of Big Data Applications (SIGMOD 2022 paper)
- Juggler SIGMOD 22 presentation
- CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics (NSDI 2017 paper)
- Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics (NSDI 2016 paper)
View all episodes
10. Mohamed Alzayat | Groundhog: Efficient Request Isolation in FaaS | #4042:46Summary:Security is a core responsibility for Function-as-a-Service (FaaS) providers. The prevailing approach has each function execute in its own container to isolate concurrent executions of different functions. However, successive invocations of the same function commonly reuse the runtime state of a previous invocation in order to avoid container cold-start delays when invoking a function. Although efficient, this container reuse has security implications for functions that are invoked on behalf of differently privileged users or administrative domains: bugs in a function’s implementation, third-party library, or the language runtime may leak private data from one invocation of the function to subsequent invocations of the same function.In this episode, Mohamed Alzayat tells us about Groundhog, which isolates sequential invocations of a function by efficiently reverting to a clean state, free from any private data, after each invocation. Tune in to learn more about how Groundhog works and how it improves security in FaaS!Links:Mohamed's homepageGroundhog EuroSys'23 paperGroundhog codebase
9. Cuong Nguyen | Detock: High Performance Multi-region Transactions at Scale | #3937:28Summary: In this episode Cuong Nguyen tells us about Detock, a geographically replicated database system. Tune in to learn about its specialised concurrency control and deadlock resolution protocols that enable processing strictly-serializable multi-region transactions with near-zero performance degradation at extremely high conflict and improves latency by up to a factor of 5.Links: SIGMOD PaperDetock Github RepoCuong's Homepage
8. Bogdan Stoica | WAFFLE: Exposing Memory Ordering Bugs Efficiently with Active Delay Injection | #3855:57Concurrency bugs are difficult to detect, reproduce, and diagnose, as they manifest under rare timing conditions. Recently, active delay injection has proven efficient for exposing one such type of bug — thread-safety violations — with low over-head, high coverage, and minimal code analysis. However, how to efficiently apply active delay injection to broader classes of concurrency bugs is still an open question.In this episode, Bogdan Stoica tells us about how answered this question by focusing on MemOrder bugs — a type of concurrency bug caused by incorrect timing between a memory access to a particular object and the object’s initialization or deallocation. Tune to learn about Waffle — a delay injection tool that tailors key design points to better match the nature of MemOrder bugs. Links: EuroSys'23 PaperBogdan's HomepageWaffle's GitHub Repo
7. Roger Waleffe | MariusGNN: Resource-Efficient Out-of-Core Training of Graph Neural Networks | #3701:13:06Summary: In this episode, Roger Waleffe talks about Graph Neural Networks (GNNs) for large-scale graphs. Specifically, he reveals all about MariusGNN, the first system that utilises the entire storage hierarchy (including disk) for GNN training. Tune in to find out how MaruisGNN works and just how fast it goes (and how much more cost-efficient it is!) Links: Marius ProjectRoger's Homepage Roger's TwitterEuroSys'23 PaperSupport the podcast through Buy Me a Coffee
6. Madelon Hulsebos | GitTables: A Large-Scale Corpus of Relational Tables | #3645:54Summary:The success of deep learning has sparked interest in improving relational table tasks, like data preparation and search, with table representation models trained on large table corpora. Existing table corpora primarily contain tables extracted from HTML pages, limiting the capability to represent offline database tables. To train and evaluate high-capacity models for applications beyond the Web, we need resources with tables that resemble relational database tables. In this episode, Madelon Hulsebos tells us all about such a resource! Tune in to learn more about GitTables!! Links: Madelon's websiteGitTables homepageSIGMOD'23 paperBuy Me A Coffee!
5. Tarikul Islam Papon | ACEing the Bufferpool Management Paradigm for Modern Storage Devices | #3547:18Summary:Compared to hard disk drives (HDDs), solid-state drives (SSDs) have two fundamentally different properties: (i) read/write asymmetry (writes are slower than reads) and (ii) access concurrency (multiple I/Os can be executed in parallel to saturate the device bandwidth). But, database operators are often designed without considering storage asymmetry and concurrency resulting in device under utilization. In thie episode, Tarikul Islam Papon tells us about his work on a new Asymmetry & Concurrency aware bufferpool management (ACE) that batches writes based on device concurrency and performs them in parallel to amortize the asymmetric write cost. Tune in to learn more! Links:ICDE'23 PaperPapon's HomepagePapon's LinkedInBuy me a coffee
4. Jian Zhang | VIPER: A Fast Snapshot Isolation Checker | #3442:34Summary:Snapshot isolation is supported by most commercial databases and is widely used by applications. However, checking, if given a set of transactions, a database ensures Snapshot Isolation is either slow or gives up soundness. In this episode, Jian Zhang tells us about VIPER, an SI checker that is sound, complete, and fast. Tune in to learn more!! Links:PaperGitHub repoJian's homepage
3. Ahmed Sayed | REFL: Resource Efficient Federated Learning | #3358:53Summary: Federated Learning (FL) enables distributed training by learners using local data, thereby enhancing privacy and reducing communication. However, it presents numerous challenges relating to the heterogeneity of the data distribution, device capabilities, and participant availability as deployments scale, which can impact both model convergence and bias. Existing FL schemes use random participant selection to improve fairness; however, this can result in inefficient use of resources and lower quality training. In this episode, Ahmed Sayed talks about how he and his colleagues address the question of resource efficiency in FL. He talks about the benefits of intelligent participant selection, and incorporation of updates from straggling participants. Tune in to learn more!Links:EuroSys'23 PaperAhmed's LinkedIn Ahmed's HomepageAhmed's TwitterREFL Github
2. Subhadeep Sarkar | Log-structured Merge Trees | #3259:27Summary:Log-structured merge (LSM) trees have emerged as one of the most commonly used storage-based data structures in modern data systems as they offer high throughput for writes and good utilization of storage space. In this episode, Subhadeep Sarkar presents the fundamental principles of the LSM paradigm. He tells us about recent research on improving write performance and the various optimization techniques and hybrid designs adopted by LSM engines to accelerate reads. Tune in to find out more! Links:Personal websiteICDE'23 tutorialLinkedIn