TU Delft publishes several papers on Opertus Mundi research
14—Jun—2021, by Kathrin Lenvain
Article by Andra Ionescu, TU Delft
Easy data access.
The data management group from TU Delft focuses on researching methods to help the users discover and integrate data sources. The popular data repository nowadays is the data lake, where users can store a vast amount of heterogeneous datasets. The users of data lakes put tremendous effort to find the right information and to integrate the datasets which are ultimately used to perform various data related tasks. Finding such datasets is a tedious process and sometimes the user inspects the data manually in order to find the relevant information. As data is produced at an unprecedented rate, the need and expectation to make it easily available for the end-users is growing.
Data discovery as the solution.
As such, data discovery has become an important subject in the data management community. Data discovery is the process of finding relevant data among thousands of disparate heterogeneous datasets and it represents the means of enabling the user to find relevant datasets and fulfill an information need. Dataset discovery represents the first step in any data management pipeline, as the users need to first discover the information and then perform the task at hand.
The two research areas.
Despite the popularity of data lakes, and the wealth of research on dataset discovery, we still lack proper tools to aid users in the discovery of datasets as well as the augmentation of existing datasets with information extracted from multiple tables. Therefore, we research the problem from two angles.
Firstly, dataset discovery methods rely on some form of schema matching: the process of establishing correspondences between datasets. Traditionally, schema matching has been used to find matching pairs of columns between a source and a target schema. Nowadays schema matching serves as a building block for indicating and ranking inter-dataset relationships. Moreover, the latest discovery methods employ existing schema matching algorithms in an ad-hoc fashion due to the lack of openly-available datasets with ground truth, reference method implementations, and evaluation metrics. Our findings suggest that there is no one size fits all, as each algorithm performs best in specific scenarios, and even the simple baselines perform well under certain conditions. Also, we should focus more on the human-in-the-loop approach, as the users can tremendously help in discovering datasets.
To this end, our second angle aims to improve the usability and performance of the data discovery process. We propose an interactive system that facilitates data exploration, discovery, and augmentation through an interactive process. We aim to design a system that requires less prerequisite knowledge from users and provides users with more understandable results. Starting from a simple keyword search and via mouse points and clicks, a non-expert user can explore the data lake. With users in full control of each step of the process, they have a better interpretation of the intermediate and final results, which they can export in known formats and directly use them to downstream data science applications.
All integrated in Topio.
Our work implemented by TU Delft in dataset discovery will greatly contribute to the Topio marketplace, as we can show to the users how the datasets relate and link to each other, helping them to make an informed purchase.
The research findings have been published in several scientific papers as listed below:
- Andra Ionescu, Interactive Data Discovery in Data lakes, In the Proceedings of the 2021 PhD Workshop at VLDB International Conference on Very Large Data Bases.
- Pedro Fortunato Silvestre, Marios Fragkoulis, Diomidis Spinellis, Asterios Katsifodimos, Clonos: Consistent Causal Recovery for Highly-Available Streaming Dataflows, In the Proceedings of the 2021 ACM SIGMOD International Conference on the Management of Data.
- Christos Koutras, George Siachamis, Andra Ionescu, Kyriakos Psarakis, Jerry Brons, Marios Fragkoulis, Christoph Lofi, Angela Bonifati, Asterios Katsifodimos, Valentine: Evaluating Matching Techniques for Dataset Discovery, In the Proceedings of the 2021 IEEE International Conference on Data Engineering (ICDE).
- Paris Carbone, Marios Fragkoulis, Vasiliki Kalavri, Asterios Katsifodimos, Beyond Analytics: The Evolution of Stream Processing Systems, In the Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (tutorial).
- Christos Koutras, Marios Fragkoulis, Asterios Katsifodimos, Christoph Lofi, REMA: Graph Embeddings-based Relational Schema Matching, SEA Data workshop colocated with EDBT 2020.