Glass of Milk


10:30 AM

Modern systems need to handle a variety of workloads and use cases. It is very difficult for one system architecture to cater to these use cases therefore, a “one-size-fits-all” architecture does not work in practice. Even with specialized architectures, customers must make compromises on performance, functionality, and usability when running “hybrid” workloads. Further, after the initial deployment, customer’s workloads and requirements may change.
Over the last several years, at Microsoft, we have been on a quest to design practical instance optimized systems (IOS), systems which are custom-fit or “tailored” to customer’s initial requirements, and continuously “adapt” to their changing requirements. A key insight is that by “learning” from the specific data and workload distributions, a system can instance optimize itself to that specific data and workload. I’ll talk about the enabling trends for instance optimization and present our work on instance optimized indexes and storage layouts. I’ll conclude with some open research challenges.


Umar is a Principal Researcher in the Data Systems Group at Microsoft Research, Redmond. He currently works on instance optimized systems – applying ML to systems, and on improving price-performance for cloud-based transactional and analytics platforms.

11:00 AM

‪Xuanhe Zhou

Transforming a SQL query into semantic-equivalent ones with higher performance (query rewrite) is a fundamental but important problem in query optimization. Existing methods either rely on DBAs to take hours to analyze and rewrite a query, or apply heuristic rewrite rules in default orders (e.g., rewrite from the root node in topdown order). However, query rewrite has been proven to be NP-hard. And in this talk, we will present some preliminary researches on designing advanced searching and learning methods to solve the query rewrite problem, and given some future work for logic query optimization.


I am Xuanhe Zhou, currently a second-year Ph.D. student in the Dept. of Computer Science and Technology (CST), Tsinghua University. I am a member of the Database Group of Tsinghua and under the supervision of Guoliang Li. My research interest is on AI/ML for data management, including but not limited to: (1) tuning database management systems using advanced machine learning methods; (2) optimizing query execution in both logic and physical stages.

11:20 AM

The predominant paradigm today for learned DBMS components is workload-driven learning, i.e., running a representative set of queries on the database and use the observations to train a machine learning model. This approach, however, has two major downsides. First, collecting the training data can be very expensive, since many queries have to be executed on potentially large databases. Second, training data has to be recollected when the workload or the database changes.
Hence, in this talk we present our vision to tackle the high costs and inflexibility of workload-driven learning. First, we introduce data-driven learning where the idea is to learn the data distribution over a complex relational schema. In contrast to workload-driven learning, no large workload has to be executed on the database to gather training data. While data-driven learning has many applications such as cardinality estimation or approximate query processing (AQP), many tasks such as physical cost estimation cannot be supported. We thus propose a second technique called zero-shot learning which is a general paradigm for learned DBMS systems. Here, the idea is to train models that generalize to unseen databases out-of-the-box, i.e., without requiring workloads as training data or retraining. The idea is to train a model that has observed a variety of workloads on different databases and can thus generalize. Initial results on the task of physical cost estimation suggest the feasibility of this approach. Finally, we discuss research opportunities which are enabled by zero-shot learning.


I am a PhD student at Technical University of Darmstadt working with Prof. Binnig on systems for machine learning including topics such as learned cardinality estimation, cloud design advisors, physical cost estimation etc.

11:40 AM

Scanning and filtering over multi-dimensional tables are key operations in modern analytical database engines. To optimize the performance of these operations, databases often use multi-dimensional indexes or specialized data layouts (e.g., Z-order). However, these schemes are hard to tune and their performance is inconsistent.

In this talk, I will present Flood and Tsunami, two learned multi-dimensional read-optimized indexes for in-memory analytic workloads. By automatically co-optimizing the index structure and data layout for a particular dataset and workload, Flood and Tsunami achieve up to 10X faster query performance and 100X smaller index size than optimally-tuned traditional indexes. I will conclude by giving a brief overview of ongoing work on a new instance-optimized caching mechanism that goes hand in hand with data layouts.


Jialin is a fourth year PhD student in the MIT Data Systems Group, where he is advised by Prof. Tim Kraska. His research focuses on using machine learning to enhance database systems, with an emphasis on index structures and data storage layouts. His research is supported by a Facebook Fellowship. Prior to MIT, Jialin received his BS from Stanford University.

01:00 PM

In this talk, we discuss machine programming, which is principally aimed at the automation of software development. We discuss how our research team at Intel Labs, Machine Programming Research, is working toward new ways to automatically develop software based on two key tenets: (i) improving software developer productivity and (ii) improving software quality (e.g., correctness, performance, maintainability, etc.). We discuss the three pillars of machine programming, which is the foundation of all the work we do. We then explore some concrete examples of MP systems that we have recently built that have demonstrated state-of-the-art performance in code semantics similarity, debugging, and optimization.


Justin Gottschlich is a Principal AI Scientist and the Director & Founder of Machine Programming Research at Intel Labs. He also has an academic appointment as an Adjunct Assistant Professor at the University of Pennsylvania. Justin is the Principal Investigator of the upcoming Intel Machine Programming Research Center, which will focus on the automation of software development. He co-founded the ACM SIGPLAN Machine Programming Symposium (MAPS) and serves as its Steering Committee Chair. He is currently serving on two technical advisory boards: the 2020 NSF Expeditions led by MIT “Understanding the World Through Code” and Inteon, a new machine programming (MP) venture funded by Intel. Justin received his PhD in Computer Engineering from the University of Colorado-Boulder in 2011 and has 40+ peer-reviewed publications, ~50 issued patents, with 100+ patents pending. Justin and his team’s research have been highlighted in mainstream venues like Communications of the ACM, MIT Technology Review, The New York Times, and The Wall Street Journal.

01:30 PM

In the new era of ML-based systems, many works have proposed ML as a remedy to the problems of cost and cardinality estimation in query optimization. These works usually include either building an ML model to estimate the cost or cardinality of a query plan or learning the entire optimizer. Although these solutions have demonstrated promising results, they come with a new challenge which our community has been largely overlooked: the low availability of training data. In ML-based query optimization training data includes large and diverge plan workloads together with their execution time or output cardinality. Collecting such training data has a very high cost in terms of time and money due to the development and execution of thousands of realistic query plans. In this tutorial, we will discuss how we can overcome this challenge using an innovative data-driven framework for efficiently generating large labeled training datasets for ML-based query optimization tailored to users’ small input workload.


Zoi Kaoudi is a Senior Researcher in the DIMA team of the Technical University of Berlin. She has previously worked as a Scientist in the Qatar Computing Research Institute (QCRI) of the Hamad Bin Khalifa University in Qatar, in IMIS-Athena Research Center as a research associate and Inria as a postdoctoral researcher. She received her PhD from the National and Kapodistrian University of Athens in 2011. Her research interests include learning-based query optimization, scalable machine learning systems, and distributed RDF data management. She has been the proceedings chair of EDBT 2019, has co-chaired the TKDE poster track co-located with ICDE 2018, and co-organized the MLDAS 2019 held in Qatar. She has co-authored articles in both database and Semantic Web communities and served as a member of a Program Committee for several international database conferences.

01:50 PM

In this talk, we introduce a new deep learning approach to cardinality estimation, which is the core problem in cost-based query optimization. We propose a new neural network model (MSCN) that can capture correlations between columns. Trained with past queries, our model can predict the cardinalities of future queries and significantly enhances the quality of cardinality estimation. We also briefly discuss follow-up work on a new loss function (Flow-Loss) that improves MSCN's impact on resulting query plans by focusing the model capacity on the estimates that matter.


Andreas Kipf is a postdoc researcher in the MIT Data Systems Group working with Prof. Tim Kraska. His interests are in improving systems with machine learning with a focus on indexing and query optimization. Andreas earned his PhD at TUM where he worked with Prof. Alfons Kemper and Prof. Thomas Neumann. During his PhD, he interned with Google in Mountain View & Zurich to work on query-driven materialization and lightweight secondary indexing. Andreas won the 2016 SIGMOD Best Demonstration Award and the 2017 SIGMOD Programming Contest.

02:30 PM

We revisit some fundamental and ubiquitous problems in data structures design, such as predecessor search and rank/select primitives, by exploiting a reduction from the input data to the geometry of points in a Cartesian plane [Ao et al., VLDB 11].
We introduce techniques that discover, or “learn”, the regularities in this Cartesian plane and solve the problems above in an efficient algorithmic way [Ferragina and Vinciguerra, VLDB 20; Boffa et al., ALENEX 21]. Surprisingly, we show that it is possible to both learn the data regularities and retain the space-time complexity bounds of traditional data structures, which translates into robust performance in practice, as we show by experimenting with them on some large datasets.
Motivated by these results, we dig into the core components of these structures and present the first mathematically grounded answer to why learning-based compressed data structures can outperform their traditional counterparts [Ferragina et al., Theor. Comp. Sci. 2021].
We conclude by discussing the plethora of research opportunities that these new approaches to data structures design open up.


Giorgio Vinciguerra is a third-year PhD student in Computer Science at the University of Pisa supervised by Prof. Paolo Ferragina. He holds a Master's degree in Computer Science from the same University. He has been a visiting student at the Harvard DASlab led by Prof. Stratos Idreos. His main research interests are compressed data structures and algorithm engineering.

02:50 PM

General-purpose index structures such as the B-tree cannot exploit common patterns in real-world data distribution, prompting the use of machine learning (ML) models for indexing. However, learned indexing is still in its infancy. In general, and similar to the analogy of “petrol vs electric” engines, adopting machine learning techniques could yield elegant methods in data management, but the design and maintenance of an ML-enhanced DBMS open numerous challenges that require comprehensive research.

We adopt a different position: We advocate that leveraging patterns in data distribution does not necessarily require transplanting and replacing the heart of a classical index structure. Therefore, instead of going back to the drawing board to fit data systems with learned models, we can develop lightweight “hybrid engines” where machine learning and prediction can be embedded into traditional index structures with minimal changes to the original algorithms. Our several case studies show how the idea of learned indexes can be embedded into an existing algorithmic index without modifying the heart components and the theoretical performance guarantees of the well-established traditional indexes such as B+tree.


Ali Hadian is a final-year PhD candidate at Imperial College London working on machine learning applications on modern database systems, specifically the in-memory indexes. His research interests lie in the intersection of machine learning, big data systems, and knowledge graphs.

03:10 PM

In the era of increasing application diversity and data growth, designing scalable storage engines is one of the critical challenges in modern data systems. With a goal to create efficient and tailored solutions for the problem at hand, we present Cosine, a key-value storage engine that self-designs to one of sextillion (10^36) of possible designs on cloud based on the application context. At its core, Cosine uses analytical I/O models and learned concurrency models to precisely reason about the performance of arbitrary designs. The learned and non-learned components in Cosine feed off each other to outperform by up to 60X over the state-of-the-art storage engines.


Subarna Chatterjee is a post-doc at Harvard University advised by Stratos Idreos. Her research is about data systems design on the cloud using learning approaches to optimize system design. In 2016, she was selected as one of the "10 Women in Networking/Communications That You Should Watch" and is one of the young scientists to attend the Heidelberg Laureate Forum.

04:00 PM

This talk will summarize five key lessons I’ve personally learned over my time developing learned query optimization techniques. After a brief summary of work on learned query optimization, I will discuss (1) the nature of “arbitrary” queries, (2) the secret art of implementing ML techniques, (3) how to make evaluations of learned components (more) meaningful, (4) the tradeoffs between coarse- and fine-grained learning techniques, and (5) how to develop tools to make end-users more comfortable with learned components.


I'm currently a CS postdoc researcher at MIT under the supervision of Tim Kraska. I also work as a researcher at Intel Labs. I research applications of machine learning to systems, especially databases.

04:20 PM

Cloud services are increasingly adopting new programming models, such as microservices and serverless compute. While these frameworks offer several advantages, such as better modularity, ease of maintenance and deployment, they also introduce new hardware and software challenges.

In this talk, I will briefly discuss the challenges that these new cloud models introduce in hardware and software, and present some of our work on accelerating critical microservice computation, and on employing ML to improve the cloud’s performance predictability and resource efficiency. I will first discuss Seer, a performance debugging system that identifies root causes of unpredictable performance in multi-tier interactive microservices, and Sage, which improves on Seer by taking a completely unsupervised learning approach to data-driven performance debugging, making it both practical and scalable.


Christina Delimitrou is an Assistant Professor and the John and Norma Balen Sesquicentennial Faculty Fellow at Cornell University, where she works on computer architecture and computer systems. She specifically focuses on improving the performance predictability and resource efficiency of large-scale cloud infrastructures by revisiting the way these systems are designed and managed. Christina is the recipient of the 2020 TCCA Young Computer Architect Award, an Intel Rising Star Award, a Microsoft Research Faculty Fellowship, an NSF CAREER Award, a Sloan Research Scholarship, two Google Research Award, and a Facebook Faculty Research Award. Her work has also received 4 IEEE Micro Top Picks awards and several best paper awards. Before joining Cornell, Christina received her PhD from Stanford University. She had previously earned an MS also from Stanford, and a diploma in Electrical and Computer Engineering from the National Technical University of Athens. More information can be found at:

04:40 PM

We introduce BOURBON, a log-structured merge (LSM) tree that utilizes machine learning to provide fast lookups. We base the design and implementation of BOURBON on empirically-grounded principles that we derive through careful analysis of LSM design. BOURBON employs greedy piecewise linear regression to learn key distributions, enabling fast lookup with minimal computation, and applies a cost-benefit strategy to decide when learning will be worthwhile. Through a series of experiments on both synthetic and real-world datasets, we show that BOURBON improves lookup performance by 1.23x-1.78xas compared to state-of-the-art production LSMs.


Aishwarya Ganesan is a postdoctoral researcher at VMware research. She recently earned her PhD from the University of Wisconsin – Madison. She is broadly interested in distributed systems and storage systems. Her research primarily focuses on building distributed storage systems that provide strong guarantees yet also perform well. Aishwarya’s research has been recognized with best-paper awards at FAST 20 and FAST 18, and a best paper award nomination at FAST 17. She was selected for the Rising Stars in EECS workshop and is a recipient of a Facebook PhD Fellowship. She also received the graduate student instructor award for teaching distributed systems at UW Madison.

05:20 PM

Summarizing a large dataset with a reduced-size data synopsis
has applications from database query optimization to approximate
query processing. Increasingly, data synopsis approaches leverage
the inherent compression properties of machine learning (ML) models
to achieve state-of-the-art results. This talk will deconstruct this
trend to understand the key mechanisms behind machine learning's
recent success in a historically well-established area of research.
(The Good) I present a series of results that suggest ML models are
astonishingly accurate at many different types of high-dimensional data
summarization. (The Bad) I show that in ""medium-dimensional"" regimes
it is possible to design new classical data synopsis techniques that
meet of exceed the performance of ML models. (The Ugly) I discuss the
under-appreciated reliability-gap between ML models and classical data
summarization techniques.


Sanjay Krishnan is an Assistant Professor of Computer Science at
the University of Chicago. His research studies the intersection of machine
learning and database systems. Sanjay completed his PhD and Master’s Degree
at UC Berkeley in Computer Science in 2018. Sanjay's work has received a number
of awards including the 2016 SIGMOD Best Demonstration award, 2015 IEEE GHTC
Best Paper award, and Sage Scholar award.

05:40 PM

Join is arguably the most costly and frequently used operation in relational query processing. Join algorithms usually spend the majority of their time on scanning and attempting to join the parts of the base relations that do not satisfy the join condition and do not generate any results. This causes slow response time, particularly, in interactive and exploratory environments where users would like real-time performance. In this paper, we outline our vision on using online learning and adaptation to execute joins efficiently. In our approach, scan operators that precede a join, learn which parts of the relations are more likely to join during the query execution and produce more results faster by doing fewer I/O accesses. Our empirical studies using standard benchmarks indicate that this approach outperforms similar methods considerably.


Arash Termehchy is an Associate Professor at the School of EECS in Oregon State University. He received his PhD from University Of Illinois at Urbana-Champaign. His research interests are using ML in complex systems, learning & reasoning over heterogeneous data, and human-centric data systems. His research is recognized by the ACM SIGMOD Research Highlight Award, Best of Conference Selections of SIGMOD and ICDE, Best Student Paper Award of ICDE, and Yahoo! Key Scientific Challenges Award.

06:00 PM

In this talk, we will present practical aspects of building systems that learn over large clouds. We will discuss our experiences from introducing learning-based features in SCOPE and Spark query optimizers touch upon some of the open challenges we see going forward.


Alekh Jindal is a Principle Scientist at Gray Systems Lab (GSL), Microsoft and manages the Redmond site of the lab. His research focuses on improving the performance of large-scale data-intensive systems. Earlier, he was a postdoc associate in the Database Group at MIT CSAIL. Alekh received his PhD from Saarland University, working on flexible and scalable data storage for traditional databases as well as for MapReduce. In the past 10 years, Alekh has served as a chair, PC member and reviewer at top-tier conferences in the field including SIGMOD, VLDB, ICDE, and SOCC. He received best paper awards at SIGMOD 2021, VLDB 2014 and CIDR 2011.