Keynote - Building Data-Intensive Systems that Care
Speaker: Sihem Amer-Yahia, CNRS Research Director & Deputy Director of Laboratory of Informatics of Grenoble, Univ. Grenoble Alpes
Abstract: Computer science has been about automating everything including data science pipelines. Data science and humans do not optimize for the same “objective function”. Many of us have been claiming that we build human-centric systems. But are we? and if we are, are we doing it properly? This talk will attempt to answer this question at various stages of the data science pipeline, illustrating the essential roles that humans take along the way, as data labelers, as domain experts, and as end-users, and providing recommendations for building data-intensive systems that truly care.
Bio: Sihem Amer-Yahia is a Silver Medal CNRS Research Director and Deputy Director of the Lab of Informatics of Grenoble. She works on exploratory data analysis and algorithmic upskilling. Prior to that she was Principal Scientist at QCRI, Senior Scientist at Yahoo! Research and Member of Technical Staff at at&t Labs. Sihem served as PC chair for SIGMOD 2023 and as the coordinator of the Diversity, Equity and Inclusion initiative for the database community. In 2024, she received the 2024 IEEE TCDE Impact Award, the SIGMOD Contributions Award, and the VLDB Women in Database Award.
Keynote - Building AI-Driven Data Catalogs: A Great Playground for Human-in-the-Loop Research
Speaker: AnHai Doan, University of Wisconsin
Abstract: Organizations today often manage large numbers of datasets scattered across many locations. When launching data science or AI projects, users typically need to find a small number of relevant datasets—yet discovering them amidst a sea of options is extremely difficult. Data catalogs have emerged as a critical solution to this problem. A catalog processes datasets to construct a graph that captures metadata and relationships among them. Users can then explore this graph through browsing, keyword search, or natural language queries.
In this talk, I will argue that building such catalogs offers a rich testbed for human-in-the-loop research. Many catalog enrichment tasks—such as expanding cryptic column names, generating table descriptions, and matching tables—can benefit significantly from human feedback. So can discovery mechanisms like browsing, search, and natural language querying. Moreover, building these systems raises many crowdsourcing challenges, underscoring the need for a declarative crowdsourcing platform. Finally, deployed catalogs often support a variety of human-driven workflows that can be enhanced through partial automation. I will illustrate these ideas with SmartCat, a recent project at Wisconsin that combines generative AI, machine learning, big data techniques, and human-in-the-loop strategies to build intelligent data catalogs. Findings from SmartCat have been incorporated into EDI, a widely used data catalog for environmental scientists.
Bio: AnHai Doan is the Vilas Distinguished Achievement Professor and Gurindar Sohi Professor of Computer Science at the University of Wisconsin–Madison. His research spans databases, AI, and the Web, with a current focus on data integration, data science, and machine learning. He received the ACM Doctoral Dissertation Award, NSF CAREER Award, and Sloan Fellowship, and co-authored "Principles of Data Integration", a widely used textbook in the field. AnHai has worked extensively at the intersection of academia and industry. He served on the advisory board of Transformic, a Deep Web startup acquired by Google in 2005; was Chief Scientist at Kosmix, a social media startup acquired by Walmart in 2011; and co-founded GreenBay Technologies, an AI-driven data integration startup acquired by Informatica in 2020. He has also served on the SIGMOD Advisory Committee and Executive Committee, and was Co-Chair of SIGMOD 2020.
Keynote - SQL and Large Language Models: A Marriage Made in Heaven?
Speaker: Paolo Papotti, EURECOM
Abstract: With the rise of pre-trained Large Language Models (LLMs), there is now an effective solution to store and use information extracted from massive corpora of documents. However, for data-intensive tasks over structured data, relational DBs and SQL queries are at the core of countless applications. While these two technologies may appear distant, in this talk we will see that they can interact effectively and with promising results. LLMs can help users express SQL queries (Semantic Parsing), but SQL queries can be used to evaluate LLMs (Benchmarking). Their combination can be further advanced, with opportunities to query with a unified SQL interface both LLMs and DBs. We present recent results on these topics and then conclude with an overview of the research challenges in effectively leveraging the combined power of SQL and LLMs.
Bio: Paolo Papotti is an Associate Professor at EURECOM (France) since 2017 and the holder of a Chair of Artificial Intelligence at the 3IA Institute since 2024. He got his PhD from Roma Tre University (Italy) in 2007 and had research positions at the Qatar Computing Research Institute (Qatar) and Arizona State University (USA). His research is focused on data management and NLP. He has authored more than 150 publications and his work has been recognized with best paper awards (CIKM 2024, ISWC 2024), two “Best of the Conference” citations (SIGMOD 2009, VLDB 2016), three best demo awards (SIGMOD 2015, DBA 2020, SIGMOD 2022), and two Google Faculty Research awards (2016, 2020).
Invited Talk - Unleashing Data Science: It's Time to Fix the Data Preparation Problem
Speaker: El Kindi Rezig, University of Utah
Abstract: When building Machine learning (ML) models, data scientists face a significant hurdle: data preparation. ML models are exactly as good as the data we train them on. Unfortunately, data preparation is tedious and laborious because it often requires human judgment on how to proceed. In fact, data scientists spend at least 80% of their time locating the datasets they want to analyze, integrating them together, and cleaning the results. In this talk, I will present my key contributions in data preparation for data science, which address the following problems: (1) data discovery: how to discover data of interest from a large collection of heterogeneous tables (e.g., data lakes); (2) error detection: how to find errors in the input and intermediate data in complex data workflows; and (3) data repairing: how to repair data errors with minimal human intervention. The developed systems are specifically designed to support data science development which poses particular requirements such as interactivity and modularity.
Bio: El Kindi Rezig is an Assistant Professor at the Kahlert School of Computing at the University of Utah. Previously, he was a research scientist and a postdoctoral associate at the Computer Science and Artificial Intelligence Laboratory (CSAIL) of MIT where he worked with Michael Stonebraker. He earned his Ph.D. in Computer Science from Purdue University under the supervision of Walid Aref and Mourad Ouzzani. His research interests revolve around data management in general and data quality in particular.
Invited Talk - Sign2Vis: Automated Data Visualization From Sign Language
Speaker: Yao Wan, Huazhong University of Science and Technology
Abstract: Data visualizations, such as bar charts and histograms, are essential for analyzing and exploring data, enabling the effective communication of insights. While existing methods have been proposed to translate natural language descriptions into visualization queries, they focus solely on spoken languages, overlooking sign languages, which comprise about 200 variants used by 70 million Deaf and Hard-of-Hearing (DHH) individuals. To fill this gap, we propose Sign2Vis, a sign language interface that enables the DHH community to engage more fully with data analysis. We first construct a paired dataset that includes sign language pose videos and their corresponding visualization queries. Using this dataset, we evaluate a variety of models, including both pipeline-based and end-to-end approaches. Finally, we share key insights from our evaluation and highlight the need for more accessible and user-centered tools to support the DHH community in interactive data analytics.
Bio: Yao Wan is currently an Associate Professor in the College of Computer Science and Technology at Huazhong University of Science and Technology (HUST), China. He received his Ph.D. degree from Zhejiang University, China. He has held visiting positions at the University of Technology Sydney, Australia (2016), and the University of Illinois Chicago, USA (2018). At HUST, he leads the ONE Lab, which focuses on empowering machines to interact with the physical world through a unified natural language interface—Language + X, where X can represent code, vision, tables, etc. He has published over 50 papers in prestigious conferences across Artificial Intelligence, Data Mining, and Software Engineering, including ICML, NeurIPS, ICLR, SIGMOD, KDD, WWW, ACL, ICSE, FSE, and ASE.
Invited Talk - "How do you even know that stuff?" Barriers to expertise sharing among spreadsheet users
Speaker: Nancy Xia, University of College London
Abstract: Spreadsheet collaboration offers valuable opportunities for colleagues to learn from one another and share expertise. Such sharing is crucial for retaining technical skillsets within organisations, yet previous studies suggest that spreadsheet experts often struggle to disseminate their knowledge. Drawing on interviews with 31 spreadsheet users, this talk highlights how social norms and beliefs about the value of spreadsheet use shape engagement in knowledge-sharing behaviours. We identify key barriers, including difficulties in self-assessing one’s own expertise, challenges in adapting highly personalised strategies to subjective standards, and the influence of dismissive attitudes toward spreadsheet use. I emphasise the need to consider both technology design and social dynamics when supporting collaborative learning in feature-rich software environments.
Bio: Qing (Nancy) Xia is a final-year PhD student in Human-Computer Interaction at UCL, supervised by Professor Duncan Brumby, Dr Advait Sarkar, and Professor Anna Cox. Her research examines how social norms around technology shape collaborative behaviours such as workplace knowledge sharing. Her work integrates perspectives from organisational psychology, knowledge management, and behavioural science. Nancy’s PhD is co-sponsored by the Engineering and Physical Sciences Research Council and Microsoft Research.