Surajit
Chaudhuri
Data Management, Exploration and Mining Microsoft
Research
Research
Interests
ˇ
Self-Tuning
Database Systems
ˇ
Monitoring
Database Systems
ˇ
Data
Cleaning
ˇ
Synergy
of Information Retrieval and Databases
ˇ
Query
Optimization
Projects
I lead the Data Management,
Exploration and Mining group at Microsoft Research.
I am actively involved with the AutoAdmin
project that we started in 1997. The goal of this project is to make databases
self-tuning and self-administering by exploiting the knowledge of the workload.
Our primary focus has been in automated physical database design (VLDB 1997,
SIGMOD 1998, VLDB 2000) as well as on automated statistics management in
relational systems. I work closely with the other members of this project and
Microsoft SQL Server product group in doing this research. The Index Tuning
Wizard in Microsoft SQL Server 7.0 and SQL Server 2000 are based on the
technology that we developed as part of this project and represented the first
workload-driven commercial physical design tools on relational systems to
recommend indexes and indexes + materialized views respectively. We are further
expanding the scope of the automated physical design technology in the Database
Tuning Advisor feature of the upcoming release of SQL Server 2005. In 1998, we
initiated work on exploiting execution feedback to define
“self-tuning” histograms (SIGMOD 1999, SIGMOD 2002). More recently,
I have become interested in the problem of monitoring database systems.
Specifically, we worked on the problem of estimating progress of SQL queries
(“What percentage of the query execution has been
completed?”– SIGMOD 2004, SIGMOD 2005) as well as on a broader
architecture for monitoring database servers (SQLCM – IEEE ICDE 2004).
Data Cleaning project develops tools
and server infrastructure to effectively support data preparation, an essential
step before effective data analysis, be it simple aggregation, OLAP or data
mining, can be supported. Our work in this area strives to uncover fundamental
generic building blocks to ensure flexible ways of defining data cleaning. In
cooperation with SQL Server, we will be enabling fuzzy matching and fuzzy
de-duplication operation for the first time in the upcoming SQL Server 2005
product (as part of Data Transformation Services).
Text documents as well as structured
relational data are sources of our information. Integrated querying and browsing
of structured relational databases and that of text are of vital importance for
our ability to harness information effectively. I am investigating how
relational querying can be enriched by borrowing ideas from the information
retrieval. These include supporting keyword based search over databases as well
as auto-ranking of answers in database queries. This technology is promising to
solve the “empty answer” and “many answers” problem
(you ask a query and get no hits) in databases. Our papers in IEEE ICDE 2002,
CIDR 2003, VLDB 2004 and CIDR 2005 highlight our research directions.
Finally, I am interested in
understanding database systems challenges to enable business intelligence and
decision support more effectively on database platforms. In the past, I have
worked on optimization of complex SQL queries, e.g., optimization of queries
with group-by (VLDB 2004), user-defined predicates (VLDB 2006), exploiting
factorization for index unions/intersection plans (SIGMOD 2003), data mining
predicates (IEEE ICDE 2002). My more recent focus is revisiting the fundamental
assumptions in query optimization. Brian Babcock and I have a recent paper on
this topic in SIGMOD 2005.
Selected Professional Activities
- ACM Transactions on Database Systems
(TODS): Associate Editor
- IEEE
Transactions on Knowledge and Data Engineering (TKDE): Associate
Editor, 2001-2005
- IEEE Data
Engineering Bulletin : Associate Editor, 1998-1999
- ACM Digital Review:
Member of the Editorial Board
- 2005
ACM Conference on Management of Data (SIGMOD): Program Chair
- 1999
ACM Conference on Knowledge Discovery and Data Mining (KDD): Program
Co-chair
- 2003
ACM SIGMOD Conference: Industrial Track Chair and Member of the Best Paper
Awards Committee
- 2001
ACM Conference on Knowledge Discovery and Data Mining: Industrial Track
Co-chair
- 1999
ACM SIGMOD Conference: Industrial Track Co-chair
- 1998
IEEE Conference on Data Engineering (ICDE): Industrial Track Chair
- 2002
IEEE Conference on Data Engineering (ICDE): Chair, OLAP and Data
Warehousing Track
- 2002
VLDB 10-year award committee, Member
- NSF
Panelist
Invited Talks and Tutorials
- Surajit Chaudhuri, Gerhard
Weikum: Foundations of automated database tuning, Tutorial presented
at ACM SIGMOD 2005.
- Surajit Chaudhuri, Benoît
Dageville, Guy M.
Lohman: Self-Managing Technology in Database Management
Systems, Tutorial presented at VLDB 2004.
- Databases and IR: Perspectives of a SQL Guy, NSF
Information and Data Management PI Workshop, Seattle, 2003, pdf
version of slides
- Storage and Retrieval of XML Data Using Relational
Databases. Tutorial presented at VLDB 2001 and IEEE ICDE 2002 Conferences.
- An Overview of Data Warehousing and OLAP technology. Sigmod
Record, March 1997 (with Umesh Dayal). Tutorials Presented at 1996
VLDB, 1997 SIGMOD, 1998 EDBT and 1998 IEEE ICDE Conferences pdf
version
- An Overview of Query Optimization in Relational
Systems. Proceedings of 1998 ACM PODS. Invited Tutorial at ACM PODS
Conference, 1998, pdf
version of paper , pdf
version of slides
Selected Recent Publications
For a complete list of my
publications, please look up DBLP
- Towards a Robust Query Optimizer: A Principled and
Practical Approach, ACM SIGMOD
2005. (with Brian
Babcock)
- When Can We Trust Progress Estimators for SQL
Queries? ACM SIGMOD
2005. (with Raghav
Kaushik, Ravishankar
Ramamurthy)
- Automatic Physical Database Tuning: A Relaxation-based
Approach. ACM SIGMOD
2005. (with Nicolas
Bruno)
- Robust Identification of Fuzzy Duplicates. IEEE ICDE
2005. (with Venkatesh
Ganti, Rajeev
Motwani)
- Effective Use of Block-Level Sampling in Statistics
Estimation. ACM SIGMOD
2004. (with Gautam Das and Utkarsh Srivastava)
- Probabilistic Ranking of Database Query Results. VLDB
2004. (with Gautam
Das, Vagelis
Hristidis, Gerhard
Weikum)
- Estimating Progress of Long Running SQL Queries. ACM SIGMOD
2004. (with Vivek Narasayya, Ravishankar
Ramamurthy)
- SQLCM: A Continuous Monitoring Framework for
Relational Database Engines. IEEE ICDE
2004. (with Christian
König, Vivek
Narasayya)
- Factorizing Complex Predicates in Queries to Exploit
Indexes. SIGMOD
2003. (with Prasanna
Ganesan, Sunita
Sarawagi)
- Robust and efficient fuzzy match for online data
cleaning, ACM SIGMOD 2003 (with Kris Ganjam, Venkatesh Ganti, Rajeev
Motwani).
- Automated Ranking of Database Query Results. CIDR
2003 (with Sanjay Agrawal, Gautam Das, and Aristides Gionis)
- DBXplorer: A System For Keyword-Based Search Over
Relational Databases. IEEE ICDE 2002. (with Sanjay Agrawal and Gautam
Das).
- Efficient Evaluation of Queries with Mining
Predicates. Proceedings of IEEE International Conference on Data
Engineering, 2002. (with Vivek Narasayya and Sunita Sarawagi).
- STHoles: A Multidimensional Workload-Aware Histogram.
Proceedings of the ACM SIGMOD 2001. (with Nicolas Bruno and Luis
Gravano).
- Integrating Data Mining with SQL Databases: OLE DB
for Data Mining, Proceedings of 17th International Conference on Data
Engineering, 2001 (with Amir Netz, Surajit Chaudhuri, Usama M. Fayyad,
Jeff Bernhardt)
- Overcoming Limitations of Sampling for Aggregation
Queries. Proceedings of 17th International Conference on Data Engineering,
2001 (with Gautam Das, Mayur Datar, Rajeev Motwani and Vivek Narasayya).
- Rethinking Database System Architecture: Towards a
Self-tuning, RISC-style Database System. Proceedings of the 26th
International Conference on Very Large Databases (VLDB00) (with
Gerhard Weikum). pdf
version
- Automated Selection of Materialized Views and Indexes
for SQL Databases. Proceedings of the 26th International Conference on
Very Large Databases (VLDB00) (with Sanjay Agrawal and Vivek
Narasayya). pdf
version
- Towards Estimation Error Guarantees for Distinct
Values. 19th ACM SIGMOD-SIGACT-SIGART Symp. on Principles of Database
Systems, Dallas, USA. 2000 (with Moses
Charikar., Rajeev Motwani, and Vivek Narasayya). pdf
version
- Automating Statistics Management for Query
Optimizers. Proceedings of 16th International Conference on Data
Engineering, San Diego,
USA 2000 (with Vivek Narasayya). pdf
version
- Evaluating Top-k Selection Queries. Proceedings of
25th VLDB Conference, Edinburgh, Scotland , UK. 1999 (with Luis Gravano)
- Self-Tuning Histograms: Building Histograms Without
Looking at Data, Proceedings of ACM
SIGMOD, Philadelphia,
1999 (with Ashraf Aboulnaga) pdf
version
- On Random Sampling over Joins, ACM
SIGMOD 1999
(with Rajeev Motwani and Vivek Narasayya) pdf
version
- Random Sampling for Histogram Construction: How much
is enough? Proceedings of ACM SIGMOD,
Seattle, 1998 (with Vivek Narasayya and Rajeev Motwani) pdf
version
- AutoAdmin "What-If" Index Analysis Utility.
Proceedings of ACM SIGMOD,
Seattle,
1998 (with Vivek Narasayya). pdf
version
- An Efficient Cost-Driven Index Selection Tool for
Microsoft SQL Server. Proceedings of the 23rd International Conference
on Very Large Databases (VLDB97), Athens, Greece,
1997, pp. 146-155, 1997 (with Vivek Narasayya). pdf
version
- Data Mining and Database Systems: Where is the
Intersection?. IEEE Data Engineering Bulletin, March 1998
Selected Publications (Pre-MSR)
- Optimizing Queries with User-Defined Predicates, VLDB
Conference 1996 (with Kyuseok Shim)
- Optimizing Queries over Multimedia Repositories, SIGMOD
Conference 1996 (with Luis Gravano)
- Optimizing Queries with Aggregate Views, EDBT 1996
(with Kyuseok Shim)
- An Overview of Cost-based Optimization of Queries
with Aggregates Data Engineering Bulletin 18(3): 3-9, 1995 (with
Kyuseok Shim)
- Join Queries with External Text Sources: Execution
and Optimization Techniques SIGMOD Conference 1995: 410-422 (with
Umeshwar Dayal and Tak W. Yan)
- Optimizing Queries with Materialized Views ICDE
1995: 190-200 (with Ravi Krishnamurthy, Spyros Potamianos, Kyuseok
Shim)
- Including Group-By in Query Optimization VLDB
1994: 354-366 (with Kyuseok Shim)
- Optimization of Real Conjunctive Queries PODS
1993: 59-70 (with Moshe Y. Vardi)
Microsoft Research
One Microsoft Way
Redmond, WA 98052 USA
Contact information
(please, no soliciting):
Email surajitc@microsoft.com
Telephone 425-703-1938
Fax 425-936-7329