IMSI focuses on the following research areas:

Big Data Analytics and Machine Learning

Scalable, interactive Big Data analytics. One key direction of the group is to address a number of challenges relating to the data itself, the infrastructure and the users. Data challenges include its scale, heterogeneity, structure or lack thereof, dynamic nature and privacy. In this context, the group focuses on exploration and analysis of noisy data, including overlapping, incomplete or contradicting data from multiple sources, which is common due to the emergence of data aggregators for example. Approaches include query-time cleaning, repairing, deduplicating, clustering and exploration. Infrastructure challenges include the structure of the hardware (edge devices, distributed platforms, supercomputers), the physical storage vs. processing nodes and network structure, as well as application-specific data workflows. The group focuses on extreme-scale analytics by physical optimization over several criteria including runtime, throughput, latency, scheduling, system and monetary resources. The main objective is to bring computation closer to data, for example by in-situ data processing, leveraging hardware specificities without affecting the interface to applications and decoupling engine primitives from the underlying data store platform. The user challenge is essentially to make technologies accessible to non-expert data analysts. In this sense, novel algorithms are investigated to support exploring, processing, visualizing and extracting insights from data on the fly, guided by user interaction. Besides exploring data technology to assist machine learning, the group also investigates learning techniques to assist or even replace traditional data engine functions, such as query optimization, scheduling and workload management.

Large-scale Machine Learning. Beyond data analytics, a second direction is to investigate Machine Learning models to make predictions at large scale. This includes learning new representations from raw data, which can be used to solve new tasks. The key challenges are the scale and diversity of data, the missing, noisy or inconsistent supervision, as well as the dynamic nature of both. Besides the massively parallel processing on distributed platforms, learning on large-scale data is facilitated by continual learning on streaming data and non-parametric models that can be adapted easily. The group studies and builds on recent advances of self-supervised learning to compensate for the flaws or lack of supervision and extends the state of the art towards learning compact representations to enable scaling up. The focus is on strong mathematical foundations and interdisciplinary research to handle data of multiple modalities including vision, language, time series as well as structured and high-dimensional data. Several application domains are considered, including geometric modeling and CAD/CAM, scientific databases and publications, information retrieval, bioinformatics. The results are applicable to several sectors including health, education, environment, transportation, finance, materials, food and agriculture.

Big Data Research Infrastructures

Generic Data infrastructures. Such infrastructures provide generic scalable data processing services for very large and heterogeneous scientific data, ready to be used as a building software block for other RIs. The group has a leading role in HELIX, a horizontal digital RI for data-intensive research, handling the data management, analysis, sharing, and reuse needs of Greek scientists and innovators in a cross-disciplinary, scalable, and low-cost manner. HELIX provides its services also as an autonomous RI in support of data sharing, open access publishing, and data experimentation.

Open Science. A critical mission of the European Commission is to provide unlimited, barrier free, Open Access to research outputs financed by public funding in EU. The group has a leading role in OpenAIRE, an RI whose mission is to fulfill the European Open Science Cloud (EOSC) vision, but has also a global outreach. Its operations already provide the glue for many user- and research-driven functionalities, whether these come from the long tail of science (repositories and local support) or domain disciplined research communities or other Ris.

Health. The practice of life sciences is continuously becoming more data-driven. The group has a leading role in RIs (ELIXIR-GR, Inspired-RIs, Oncopmnet) serving a range of domains from genomics and structural biology to medicine. ELIXIR-GR is the Greek node of ELIXIR, the distributed ESFRI RI for data, tools, standards, and training, serving the life science community for open, integrated, and state-of-the-art bioinformatics and biocomputing resources. Inspired-RIs focuses on integrated structural biology, drug screening and target functional characterization. Oncopmnet implements the Hellenic Precision Medicine Network on Oncology, providing digital tools and systems for organization, processing, and analysis of cancer data.

Humanities and Digital Curation. Humanities also goes digital. Thus, a strategic action line is the development of digital research infrastructures for humanities at the national and European levels with an emphasis on the lifecycle of curated data. The action line has been undertaken by the Digital Curation Unit (DCU) (a department of IMSI since 2009, led by Prof. Panos Constantopoulos, but now carrying out its activites under the Big Data Research Infrastructures Group). Digital curation encompasses a set of activities aiming at the production of high quality, dependable digital assets; their organization, archiving and long-term preservation; and the generation of added value from digital assets by means of resource-based knowledge elicitation. To ensure the adequate capture of the context of digital resources and their subsequent creative and effective use, the DCU adopts a multidisciplinary approach that considers the full lifecycle of digital assets, such as records, digital surrogates and scholarly/ scientific datasets.

Cloud Platforms and Data Services

Data processing and analytics on the cloud. The key focus of the group is on technologies that enable scalable data analytics on the cloud. Our research is focused on the data analytics services layer, addressing any type of scalability problem using solutions based on the distribution of computation to multiple cores, VMs, or containers. Parallel in-memory data analytics operators, for complex data e.g., spatial, intervals, time series, incomplete and heterogenous data, are some of the most active research efforts. The group is also working on developing cloud-based data analytics services in the context of different disciplines and sectors. E.g., for energy analytics, for life sciences and medical data, data analysis operations tailored for pattern detection and extraction for energy monitoring data (timeseries) are optimized for accuracy and scalability in cloud environments. For telco data, research has focused on end-to-end big data solutions for stream analytics on network quality data coming from IoT devices, such as drones and autonomous cars. For scholarly data, research concerns performance and accuracy optimization of entity resolution and entity interlinking in data integration workflows.

Privacy-based processing of data in the cloud. A key concern with cloud-based analytics is privacy concerns and restrictions when personal data are involved. There are several privacy preserving strategies that can be employed to protect personal data including, designing principles, encryption, differential private algorithms and data anonymization techniques. The group has been active in most of these aspects, providing designing and governance principles for health information systems, data anonymization techniques and tools etc. The group supports the public open-source data anonymization tool Amnesia (

Data services. In modern cloud environments, data services (database-as-a-service, ML, etc.) often need to operate next to where data is generated, e.g., for reducing data transfer overhead or when sensitive data cannot move out of the production system, etc. In this context, new end-user data services and applications require in situ analysis, i.e., analysis is performed directly on the data residing at the edge, without the need to move and load the data on a cloud database. In this context, the group has been active in developing in-situ techniques, such as scalable interactive visualization techniques for in-situ visual analysis of data. The group has developed a public open-source visualization tool VisualFacts (

Domain specific and explainable AI services. It is often the case that applying generic state of the art ML algorithms and workflows is not adequate to effectively solve specialized, but quite significant for real-world application, tasks. This has become evident in various scenarios, including Earth Observation and analytics settings, as well as in medical image analysis. Our aim is to research how state of the art ML/DL algorithms and methodologies can be properly extended, utilizing domain knowledge, to effectively solve real-world problems. In parallel, explainability is to become a de-facto requirement for several types of AI systems and services. In this context, the group implements model agnostic explainability services, with emphasis on user interactivity and the explainability of fairness of AI systems, and their deployment in the form of Functions-as-a-Service.

Distributed and Web Information Systems

Web of Data. The Semantic Web is a collection of technologies that enable the linking and semantic annotation of various types of data from heterogeneous sources, leveraging information from standard vocabularies and ontologies. Linked Data, i.e., interrelated datasets, can boost knowledge discovery and data-driven analytics. Entity resolution and similarity joins lie at the heart of the interlinking process, as well as data integration in general. Addressing these problems raises challenges both in terms of efficiency and effectiveness. Regarding the former, scaling to very large collections of entities requires elaborate techniques for candidate selection and filtering. Achieving high accuracy is also challenging, due to the presence of various types of attributes, similarity measures and linking criteria, which leads to a large parameter space, involving different tradeoffs with respect to precision and recall.

Dynamics and Evolution of the Data Web. The management of evolving information in a decentralized setting introduces problems related to the archiving and preservation of interlinked information, temporal modelling & evolution management (change detection and propagation) as well as benchmarking techniques in this area. In our view, changes are discrete objects that have complex structure and retain their semantic and temporal characteristics, rather than being isolated low-level transformations on data.

Geosocial networks. An increasingly large portion of data on the Web is associated with a spatial and/or temporal dimension. Also, spatial and temporal attributes are often inherently present in information generated by sensor networks and peer-to-peer systems. Location data and location-based services have a significant and widely recognized value in most, if not all, sectors of the data economy. Searching, integrating and mining geospatial data and time series is an active field of research with numerous new challenges.

Leveraging Social Data. The availability of online data through social networks, especially Twitter, gives rise to several disparate and challenging problems: (a) how to leverage social data for obtaining new knowledge (data journalism, public opinion trends, brand monitoring), (b) how to use knowledge graphs to create meaningful associations and recommendations between tweets and users, and (c) how to use diffusion patterns in Twitter to detect fake news.

User-centric Systems and Applications

Futuristic Data Interfaces. Data is considered the 21st century’s most valuable commodity. Analysts exploring data sets for insight, scientists looking for patterns, and consumers looking for information are just a few examples of user groups that need to access and dig into data. Despite technological advances in the data exploration and data management domains, existing systems are falling behind in bridging the chasm between data and users, making data accessible and useful only to the few. A futuristic data interface would enable interaction with data using non-traditional paradigms, including natural language and visual means, would understand the data as well as the user intent, would guide the user, and make suggestions, and altogether help the user leverage data for all sorts of purposes (from finding answers to questions to revealing patterns and finding solutions to problems) in a more natural way. Such systems require the synergy of several technologies and innovation in all these fronts, including natural language interfaces, data analytics, visualization, conversational AI, and data management.

Intelligent Interactive Data Exploration. The group is pushing the data exploration front working on mixed-initiative data exploration tools that help users quickly discover data parts or insights of interest. We consider this process bound by a two-way communication where: (a) an intelligent system discovers and recommends interesting data/insights, tailored to the user needs, and (b) the user interacts with the system, providing feedback that guides the exploration process. The research group’s goal is to study the challenges that arise from mixed-initiative exploration paradigms, develop new interactive data exploration techniques that combine efficiency with effectiveness, novel recommendation methods in the context of data exploration, and methodologies for the evaluation of such systems, and systematically evaluate algorithms and systems using different data and real-life use cases.

Fair and Ethical Algorithmic Systems. Algorithmic systems, driven by large amounts of data, are increasingly being used in all aspects of society to assist people in forming opinions and taking decisions. For instance, search engines and recommender systems amongst others are used to help us in making all sorts of decisions from selecting restaurants and books, to choosing friends and careers. Other systems are used in school admissions, housing, pricing of goods and services, job applicant selection, and so forth. Such algorithmic systems offer enormous opportunities, but they also raise concerns regarding how fair they are. Hence, beyond efficiency and effectiveness of systems, the group investigates models and methods for fairness in algorithmic systems. Fairness, explainability, transparency are different sides of the same problem: how to make systems trustworthy.

User-Driven Data Management. The group is taking a holistic approach to user-centric systems and applications: building algorithms, systems, interfaces, and evaluation methodologies. In the effort to enable user-centric approaches, the group’s research often focuses at the level of how to understand data and queries, and learn how to best process user queries, developing algorithms that leverage the best of both worlds, data management and deep learning, to build systems that can learn from user queries and from data to not only process queries more efficiently but also to understand user intention, adapt to users, and help the user achieve their information goals more effectively.