cover art for Alex Isenko | Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning Preprocessing Pipelines | #1

Disseminate: The Computer Science Research Podcast

Alex Isenko | Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning Preprocessing Pipelines | #1

Season 1, Ep. 1

Preprocessing pipelines in deep learning aim to provide sufficient data throughput to keep the training processes busy. Maximizing resource utilization is becoming more challenging as the throughput of training processes increases with hardware innovations (e.g., faster GPUs, TPUs, and inter-connects) and advanced parallelization techniques that yield better scalability. At the same time, the amount of training data needed in order to train increasingly complex models is growing. As a consequence of this development, data preprocessing and provisioning are becoming a severe bottleneck in end-to-end deep learning pipelines.

In this interview Alex talks about his in-depth analysis of data preprocessing pipelines from four different machine learning domains. Additionally, he discusses a new perspective on efficiently preparing datasets for end-to-end deep learning pipelines and extract individual trade-offs to optimize throughput, preprocessing time, and storage consumption. Alex and his collaborators have developed an open-source profiling library that can automatically decide on a suitable preprocessing strategy to maximize throughput. By applying their generated insights to real-world use-cases, an increased throughput of 3x to 13x can be obtained compared to an untuned system while keeping the pipeline functionally identical. These findings show the enormous potential of data pipeline tuning.


0:36 - Can you explain to our listeners what is a deep learning pipeline?

1:33 - In this pipepline how does data pre-processing become a bottleneck?

5:40 - In the paper you analyse several different domains, can you go into more details about the domains and pipelines?

6:49 - What are the key insights from your analysis?

8:28 - What are the other insights?

13:23 - Your paper introduces PRESTO the opens source profiling library, can you tell us more about that?

15:56 - How does this compare to other tools in the space?

18:46 - Who will find PRESTO useful?

20:13 - What is the most interesting, unexpected, or challenging lesson you encountered whilst working on this topic?

22:10 - What do you have planned for future research?


Contact Info:

More episodes

View all episodes

  • 10. Mohamed Alzayat | Groundhog: Efficient Request Isolation in FaaS | #40

    Summary:Security is a core responsibility for Function-as-a-Service (FaaS) providers. The prevailing approach has each function execute in its own container to isolate concurrent executions of different functions. However, successive invocations of the same function commonly reuse the runtime state of a previous invocation in order to avoid container cold-start delays when invoking a function. Although efficient, this container reuse has security implications for functions that are invoked on behalf of differently privileged users or administrative domains: bugs in a function’s implementation, third-party library, or the language runtime may leak private data from one invocation of the function to subsequent invocations of the same function.In this episode, Mohamed Alzayat tells us about Groundhog, which isolates sequential invocations of a function by efficiently reverting to a clean state, free from any private data, after each invocation. Tune in to learn more about how Groundhog works and how it improves security in FaaS!Links:Mohamed's homepageGroundhog EuroSys'23 paperGroundhog codebase
  • 9. Cuong Nguyen | Detock: High Performance Multi-region Transactions at Scale | #39

    Summary: In this episode Cuong Nguyen tells us about Detock, a geographically replicated database system. Tune in to learn about its specialised concurrency control and deadlock resolution protocols that enable processing strictly-serializable multi-region transactions with near-zero performance degradation at extremely high conflict and improves latency by up to a factor of 5.Links: SIGMOD PaperDetock Github RepoCuong's Homepage
  • 8. Bogdan Stoica | WAFFLE: Exposing Memory Ordering Bugs Efficiently with Active Delay Injection | #38

    Concurrency bugs are difficult to detect, reproduce, and diagnose, as they manifest under rare timing conditions. Recently, active delay injection has proven efficient for exposing one such type of bug — thread-safety violations — with low over-head, high coverage, and minimal code analysis. However, how to efficiently apply active delay injection to broader classes of concurrency bugs is still an open question.In this episode, Bogdan Stoica tells us about how answered this question by focusing on MemOrder bugs — a type of concurrency bug caused by incorrect timing between a memory access to a particular object and the object’s initialization or deallocation. Tune to learn about Waffle — a delay injection tool that tailors key design points to better match the nature of MemOrder bugs. Links: EuroSys'23 PaperBogdan's HomepageWaffle's GitHub Repo
  • 7. Roger Waleffe | MariusGNN: Resource-Efficient Out-of-Core Training of Graph Neural Networks | #37

    Summary: In this episode, Roger Waleffe talks about Graph Neural Networks (GNNs) for large-scale graphs. Specifically, he reveals all about MariusGNN, the first system that utilises the entire storage hierarchy (including disk) for GNN training. Tune in to find out how MaruisGNN works and just how fast it goes (and how much more cost-efficient it is!) Links: Marius ProjectRoger's Homepage Roger's TwitterEuroSys'23 PaperSupport the podcast through Buy Me a Coffee
  • 6. Madelon Hulsebos | GitTables: A Large-Scale Corpus of Relational Tables | #36

    Summary:The success of deep learning has sparked interest in improving relational table tasks, like data preparation and search, with table representation models trained on large table corpora. Existing table corpora primarily contain tables extracted from HTML pages, limiting the capability to represent offline database tables. To train and evaluate high-capacity models for applications beyond the Web, we need resources with tables that resemble relational database tables. In this episode, Madelon Hulsebos tells us all about such a resource! Tune in to learn more about GitTables!! Links: Madelon's websiteGitTables homepageSIGMOD'23 paperBuy Me A Coffee!
  • 5. Tarikul Islam Papon | ACEing the Bufferpool Management Paradigm for Modern Storage Devices | #35

    Summary:Compared to hard disk drives (HDDs), solid-state drives (SSDs) have two fundamentally different properties: (i) read/write asymmetry (writes are slower than reads) and (ii) access concurrency (multiple I/Os can be executed in parallel to saturate the device bandwidth). But, database operators are often designed without considering storage asymmetry and concurrency resulting in device under utilization. In thie episode, Tarikul Islam Papon tells us about his work on a new Asymmetry & Concurrency aware bufferpool management (ACE) that batches writes based on device concurrency and performs them in parallel to amortize the asymmetric write cost. Tune in to learn more! Links:ICDE'23 PaperPapon's HomepagePapon's LinkedInBuy me a coffee
  • 4. Jian Zhang | VIPER: A Fast Snapshot Isolation Checker | #34

    Summary:Snapshot isolation is supported by most commercial databases and is widely used by applications. However, checking, if given a set of transactions, a database ensures Snapshot Isolation is either slow or gives up soundness. In this episode, Jian Zhang tells us about VIPER, an SI checker that is sound, complete, and fast. Tune in to learn more!! Links:PaperGitHub repoJian's homepage
  • 3. Ahmed Sayed | REFL: Resource Efficient Federated Learning | #33

    Summary: Federated Learning (FL) enables distributed training by learners using local data, thereby enhancing privacy and reducing communication. However, it presents numerous challenges relating to the heterogeneity of the data distribution, device capabilities, and participant availability as deployments scale, which can impact both model convergence and bias. Existing FL schemes use random participant selection to improve fairness; however, this can result in inefficient use of resources and lower quality training. In this episode, Ahmed Sayed talks about how he and his colleagues address the question of resource efficiency in FL. He talks about the benefits of intelligent participant selection, and incorporation of updates from straggling participants. Tune in to learn more!Links:EuroSys'23 PaperAhmed's LinkedIn Ahmed's HomepageAhmed's TwitterREFL Github
  • 2. Subhadeep Sarkar | Log-structured Merge Trees | #32

    Summary:Log-structured merge (LSM) trees have emerged as one of the most commonly used storage-based data structures in modern data systems as they offer high throughput for writes and good utilization of storage space. In this episode, Subhadeep Sarkar presents the fundamental principles of the LSM paradigm. He tells us about recent research on improving write performance and the various optimization techniques and hybrid designs adopted by LSM engines to accelerate reads. Tune in to find out more! Links:Personal websiteICDE'23 tutorialLinkedIn