Keynote - Building Data-Intensive Systems that Care
Speaker: Sihem Amer-Yhia, CNRS Research Director & Deputy Director of Laboratory of Informatics of Grenoble, Univ. Grenoble Alpes
Abstract: Computer science has been about automating everything including data science pipelines. Data science and humans do not optimize for the same “objective function”. Many of us have been claiming that we build human-centric systems. But are we? and if we are, are we doing it properly? This talk will attempt to answer this question at various stages of the data science pipeline, illustrating the essential roles that humans take along the way, as data labelers, as domain experts, and as end-users, and providing recommendations for building data-intensive systems that truly care.
Bio: Sihem Amer-Yahia is a Silver Medal CNRS Research Director and Deputy Director of the Lab of Informatics of Grenoble. She works on exploratory data analysis and algorithmic upskilling. Prior to that she was Principal Scientist at QCRI, Senior Scientist at Yahoo! Research and Member of Technical Staff at at&t Labs. Sihem served as PC chair for SIGMOD 2023 and as the coordinator of the Diversity, Equity and Inclusion initiative for the database community. In 2024, she received the 2024 IEEE TCDE Impact Award, the SIGMOD Contributions Award, and the VLDB Women in Database Award.
Time: TBA
Keynote - Building AI-Driven Data Catalogs: A Great Playground for Human-in-the-Loop Research
Speaker: AnHai Doan, University of Wisconsin
Abstract: Organizations today often manage large numbers of datasets scattered across many locations. When launching data science or AI projects, users typically need to find a small number of relevant datasets—yet discovering them amidst a sea of options is extremely difficult. Data catalogs have emerged as a critical solution to this problem. A catalog processes datasets to construct a graph that captures metadata and relationships among them. Users can then explore this graph through browsing, keyword search, or natural language queries.
In this talk, I will argue that building such catalogs offers a rich testbed for human-in-the-loop research. Many catalog enrichment tasks—such as expanding cryptic column names, generating table descriptions, and matching tables—can benefit significantly from human feedback. So can discovery mechanisms like browsing, search, and natural language querying. Moreover, building these systems raises many crowdsourcing challenges, underscoring the need for a declarative crowdsourcing platform. Finally, deployed catalogs often support a variety of human-driven workflows that can be enhanced through partial automation. I will illustrate these ideas with SmartCat, a recent project at Wisconsin that combines generative AI, machine learning, big data techniques, and human-in-the-loop strategies to build intelligent data catalogs. Findings from SmartCat have been incorporated into EDI, a widely used data catalog for environmental scientists.
Bio: AnHai Doan is the Vilas Distinguished Achievement Professor and Gurindar Sohi Professor of Computer Science at the University of Wisconsin–Madison. His research spans databases, AI, and the Web, with a current focus on data integration, data science, and machine learning. He received the ACM Doctoral Dissertation Award, NSF CAREER Award, and Sloan Fellowship, and co-authored "Principles of Data Integration", a widely used textbook in the field. AnHai has worked extensively at the intersection of academia and industry. He served on the advisory board of Transformic, a Deep Web startup acquired by Google in 2005; was Chief Scientist at Kosmix, a social media startup acquired by Walmart in 2011; and co-founded GreenBay Technologies, an AI-driven data integration startup acquired by Informatica in 2020. He has also served on the SIGMOD Advisory Committee and Executive Committee, and was Co-Chair of SIGMOD 2020.
Time: TBA
Keynote - SQL and Large Language Models: A Marriage Made in Heaven?
Speaker: Paolo Papotti, EURECOM
Abstract: With the rise of pre-trained Large Language Models (LLMs), there is now an effective solution to store and use information extracted from massive corpora of documents. However, for data-intensive tasks over structured data, relational DBs and SQL queries are at the core of countless applications. While these two technologies may appear distant, in this talk we will see that they can interact effectively and with promising results. LLMs can help users express SQL queries (Semantic Parsing), but SQL queries can be used to evaluate LLMs (Benchmarking). Their combination can be further advanced, with opportunities to query with a unified SQL interface both LLMs and DBs. We present recent results on these topics and then conclude with an overview of the research challenges in effectively leveraging the combined power of SQL and LLMs.
Bio: Paolo Papotti is an Associate Professor at EURECOM (France) since 2017 and the holder of a Chair of Artificial Intelligence at the 3IA Institute since 2024. He got his PhD from Roma Tre University (Italy) in 2007 and had research positions at the Qatar Computing Research Institute (Qatar) and Arizona State University (USA). His research is focused on data management and NLP. He has authored more than 150 publications and his work has been recognized with best paper awards (CIKM 2024, ISWC 2024), two “Best of the Conference” citations (SIGMOD 2009, VLDB 2016), three best demo awards (SIGMOD 2015, DBA 2020, SIGMOD 2022), and two Google Faculty Research awards (2016, 2020).
Time: TBD
Invited Talk - Unleashing Data Science: It's Time to Fix the Data Preparation Problem
Speaker: El Kindi Rezig, Univsity of Utah
Abstract: When building Machine learning (ML) models, data scientists face a significant hurdle: data preparation. ML models are exactly as good as the data we train them on. Unfortunately, data preparation is tedious and laborious because it often requires human judgment on how to proceed. In fact, data scientists spend at least 80% of their time locating the datasets they want to analyze, integrating them together, and cleaning the results. In this talk, I will present my key contributions in data preparation for data science, which address the following problems: (1) data discovery: how to discover data of interest from a large collection of heterogeneous tables (e.g., data lakes); (2) error detection: how to find errors in the input and intermediate data in complex data workflows; and (3) data repairing: how to repair data errors with minimal human intervention. The developed systems are specifically designed to support data science development which poses particular requirements such as interactivity and modularity.
Bio: El Kindi Rezig is an Assistant Professor at the Kahlert School of Computing at the University of Utah. Previously, he was a research scientist and a postdoctoral associate at the Computer Science and Artificial Intelligence Laboratory (CSAIL) of MIT where he worked with Michael Stonebraker. He earned his Ph.D. in Computer Science from Purdue University under the supervision of Walid Aref and Mourad Ouzzani. His research interests revolve around data management in general and data quality in particular.
Time: TBD
Invited Talk - TBA
Speaker: Yao Wan, Huazhong University of Science and Technology
Abstract: TBA
Bio:
Time: TBA
Invited Talk - TBA
Speaker: Nancy Xia, University of College London
Abstract: TBA
Bio:
Time: TBA