NTUA (supervisor: Prof. Yannis Vassiliou), July 2023
Abstract. This thesis introduces novel indexing techniques aimed at facilitating the visual exploration of data stored in large raw files. In today's data-driven society, data is produced at an extraordinary pace, and the ability to rapidly process and comprehend this data is becoming increasingly vital. Conventional data exploration tools heavily rely on Database Management Systems (DBMS), which require data loading and indexing for analysis. However, these procedures can be expensive, time-consuming, and impractical, especially when the data may be discarded after analysis. The initial part of this thesis sheds light on the shortcomings of existing tools and methodologies for in-situ data exploration, establishing a compelling argument for a more efficient system. Subsequently, we present a formal visual exploration model where user operations are translated into data access operations. Furthermore, we unveil novel memory indexing techniques and cost models, with a special emphasis on adaptive indexing and lightweight data structures. These techniques are specifically designed to manage large volumes of raw data, effectively minimizing the I/O cost of accessing the data file and quickly initiating user exploratory analysis by generating a crude version of the index when the user first requests to analyze a file. This index becomes more detailed and adapts to user exploration with each user operation. Additionally, to handle scenarios with limited resources, a resource-aware index initialization mechanism is introduced, and efficient approximation algorithms are proposed to solve the corresponding optimization problem. Through extensive experimentation using both real and synthetic datasets, the proposed techniques have been demonstrated to outperform existing solutions, thus addressing the need for more efficient and intuitive raw data exploration methods. These indexing techniques and schemes form the backbone of the RawVis system, enabling efficient query processing and bypassing expensive data preprocessing steps such as data loading and DBMS indexing. RawVis provides a complete and efficient client-server architecture for visual data exploration directly over the raw data files, including a rich user interface that presents a wide array of options for visualization and analysis. The application of RawVis is demonstrated through a user study, highlighting its ability to offer immediate and meaningful analytics. In summary, this thesis offers a significant contribution to the field of raw data exploration by unveiling a novel system and techniques that notably enhance data handling efficiency, reduce resource usage, and amplify the user experience in terms of speed and interactivity.