Share

cover art for Sainyam Galhotra | Causal Feature Selection for Algorithmic Fairness | #5

Disseminate: The Computer Science Research Podcast

Sainyam Galhotra | Causal Feature Selection for Algorithmic Fairness | #5

Season 1, Ep. 5
Summary:


The use of machine learning (ML) in high-stakes societal decisions has encouraged the consideration of fairness throughout the ML lifecycle. Although data integration is one of the primary steps to generate high-quality training data, most of the fairness literature ignores this stage. In this interview Sainyam discusses why he focuses on fairness in the integration component of data management, aiming to identify features that improve prediction without adding any bias to the dataset. Sainyam works under the causal fairness paradigm and without requiring the underlying structural causal model a priori, we has developed an approach to identify a sub-collection of features that ensure fairness of the dataset by performing conditional independence tests between different subsets of features.


Questions:

0:35: Can you introduce your work and describe the problem you're aiming to solve?

2:39: Can you elaborate on what fairness mean?

3:51: Lets dig into your solution, how does the causal approach work?

4:41: How does your approach compare to other approach into your evaluations?

6:17: How can data scientists apply your findings to the real world?

7:54: What was the most unexpected challenge you faced while working on algorithmic fairness?

8:29: What is next for your research?

9:17: Tell us about your other publications at SIGMOD?

10:57: How can the research get involved in algorithmic fairness?


Links:

More episodes

View all episodes

  • 10. Mohamed Alzayat | Groundhog: Efficient Request Isolation in FaaS | #40

    42:46
    Summary:Security is a core responsibility for Function-as-a-Service (FaaS) providers. The prevailing approach has each function execute in its own container to isolate concurrent executions of different functions. However, successive invocations of the same function commonly reuse the runtime state of a previous invocation in order to avoid container cold-start delays when invoking a function. Although efficient, this container reuse has security implications for functions that are invoked on behalf of differently privileged users or administrative domains: bugs in a function’s implementation, third-party library, or the language runtime may leak private data from one invocation of the function to subsequent invocations of the same function.In this episode, Mohamed Alzayat tells us about Groundhog, which isolates sequential invocations of a function by efficiently reverting to a clean state, free from any private data, after each invocation. Tune in to learn more about how Groundhog works and how it improves security in FaaS!Links:Mohamed's homepageGroundhog EuroSys'23 paperGroundhog codebase
  • 9. Cuong Nguyen | Detock: High Performance Multi-region Transactions at Scale | #39

    37:28
    Summary: In this episode Cuong Nguyen tells us about Detock, a geographically replicated database system. Tune in to learn about its specialised concurrency control and deadlock resolution protocols that enable processing strictly-serializable multi-region transactions with near-zero performance degradation at extremely high conflict and improves latency by up to a factor of 5.Links: SIGMOD PaperDetock Github RepoCuong's Homepage
  • 8. Bogdan Stoica | WAFFLE: Exposing Memory Ordering Bugs Efficiently with Active Delay Injection | #38

    55:57
    Concurrency bugs are difficult to detect, reproduce, and diagnose, as they manifest under rare timing conditions. Recently, active delay injection has proven efficient for exposing one such type of bug — thread-safety violations — with low over-head, high coverage, and minimal code analysis. However, how to efficiently apply active delay injection to broader classes of concurrency bugs is still an open question.In this episode, Bogdan Stoica tells us about how answered this question by focusing on MemOrder bugs — a type of concurrency bug caused by incorrect timing between a memory access to a particular object and the object’s initialization or deallocation. Tune to learn about Waffle — a delay injection tool that tailors key design points to better match the nature of MemOrder bugs. Links: EuroSys'23 PaperBogdan's HomepageWaffle's GitHub Repo
  • 7. Roger Waleffe | MariusGNN: Resource-Efficient Out-of-Core Training of Graph Neural Networks | #37

    01:13:06
    Summary: In this episode, Roger Waleffe talks about Graph Neural Networks (GNNs) for large-scale graphs. Specifically, he reveals all about MariusGNN, the first system that utilises the entire storage hierarchy (including disk) for GNN training. Tune in to find out how MaruisGNN works and just how fast it goes (and how much more cost-efficient it is!) Links: Marius ProjectRoger's Homepage Roger's TwitterEuroSys'23 PaperSupport the podcast through Buy Me a Coffee
  • 6. Madelon Hulsebos | GitTables: A Large-Scale Corpus of Relational Tables | #36

    45:54
    Summary:The success of deep learning has sparked interest in improving relational table tasks, like data preparation and search, with table representation models trained on large table corpora. Existing table corpora primarily contain tables extracted from HTML pages, limiting the capability to represent offline database tables. To train and evaluate high-capacity models for applications beyond the Web, we need resources with tables that resemble relational database tables. In this episode, Madelon Hulsebos tells us all about such a resource! Tune in to learn more about GitTables!! Links: Madelon's websiteGitTables homepageSIGMOD'23 paperBuy Me A Coffee!
  • 5. Tarikul Islam Papon | ACEing the Bufferpool Management Paradigm for Modern Storage Devices | #35

    47:18
    Summary:Compared to hard disk drives (HDDs), solid-state drives (SSDs) have two fundamentally different properties: (i) read/write asymmetry (writes are slower than reads) and (ii) access concurrency (multiple I/Os can be executed in parallel to saturate the device bandwidth). But, database operators are often designed without considering storage asymmetry and concurrency resulting in device under utilization. In thie episode, Tarikul Islam Papon tells us about his work on a new Asymmetry & Concurrency aware bufferpool management (ACE) that batches writes based on device concurrency and performs them in parallel to amortize the asymmetric write cost. Tune in to learn more! Links:ICDE'23 PaperPapon's HomepagePapon's LinkedInBuy me a coffee
  • 4. Jian Zhang | VIPER: A Fast Snapshot Isolation Checker | #34

    42:34
    Summary:Snapshot isolation is supported by most commercial databases and is widely used by applications. However, checking, if given a set of transactions, a database ensures Snapshot Isolation is either slow or gives up soundness. In this episode, Jian Zhang tells us about VIPER, an SI checker that is sound, complete, and fast. Tune in to learn more!! Links:PaperGitHub repoJian's homepage
  • 3. Ahmed Sayed | REFL: Resource Efficient Federated Learning | #33

    58:53
    Summary: Federated Learning (FL) enables distributed training by learners using local data, thereby enhancing privacy and reducing communication. However, it presents numerous challenges relating to the heterogeneity of the data distribution, device capabilities, and participant availability as deployments scale, which can impact both model convergence and bias. Existing FL schemes use random participant selection to improve fairness; however, this can result in inefficient use of resources and lower quality training. In this episode, Ahmed Sayed talks about how he and his colleagues address the question of resource efficiency in FL. He talks about the benefits of intelligent participant selection, and incorporation of updates from straggling participants. Tune in to learn more!Links:EuroSys'23 PaperAhmed's LinkedIn Ahmed's HomepageAhmed's TwitterREFL Github
  • 2. Subhadeep Sarkar | Log-structured Merge Trees | #32

    59:27
    Summary:Log-structured merge (LSM) trees have emerged as one of the most commonly used storage-based data structures in modern data systems as they offer high throughput for writes and good utilization of storage space. In this episode, Subhadeep Sarkar presents the fundamental principles of the LSM paradigm. He tells us about recent research on improving write performance and the various optimization techniques and hybrid designs adopted by LSM engines to accelerate reads. Tune in to find out more! Links:Personal websiteICDE'23 tutorialLinkedIn