Methods on retrieving sources and data from the Web, for supporting scientific innovation Full text

Giorgos Giannopoulos
National Technical University of Athens, Greece
PhD Thesis

The thesis handles re-ranking problems, including personalization, diversification, and hybrid search of entities on the web. Specifically, we studied and proposed novel methods for re-ranking web search results by capturing information needs of users or groups of users. We base our methods on ranking function training models, utilizing information extractedfrom user's search history (clickstream data - queries, results and clicked results). Further, we propose methods for semi-automatic semantic annotation of documents using ontology classes, for hybrid document search (using keywords and ontology classes) and for personalization of keyword search on semantic (RDF) data. Moreover, we evaluate/propose heuristics and introduce criteria for diversification of user comments on social networks, as well as for diversification of keyword search on semantic, structured data. Finally, we propose a first cut approach on re-ranking search results on name changing biological entities. Next, we discuss each of the above methods in more detail.

Through the presented research, we implemented methods for more effective utilization of users' search histories, through ranking function training. Specifically, first, we proposed a method for enriching the extracted information from user's clickstream data (search history), for faster ranking function training. Next, we proposed and implemented methods for training multiple ranking functions, based either on search content or on user behavior. The novelty of the methods lies on gathering collaborative information from all users and grouping this information into clusters that represent diverse content or diverse search behavior. The final ranking of the results is achieved by combining rankings produced by models trained on different clusters. Moreover, we studied the adaptation of the problem of search result diversification into the scenario of diversifying user comments on news articles. We defined problem specific diversification criteria and applied several heuristic diversification algorithms. In order to assess the effectiveness of the proposed methods, we defined problem specific evaluation measures. Beyond that, we proposed a first cut approach for diversifying keyword search results on semantic (RDF) data, utilizing the schema and structure characterizing the data and the properties interconnecting the data. Finally, we examined indexing schemes and ranking algorithms for entities whose naming changes through time, as it stands for certain categories of biological entities.

The aforementioned works were evaluated in several search scenarios, as well as on diverse datasets, such as documents-web pages, user comments, semantic annotations on documents and biological entities. The evaluation results showed that the above methods improved the effectiveness of baseline methods in the specific research problems, leading to the publication of more than ten articles in international conferences, workshops and journals. Further, through the work done on the specific areas, new, interesting problems arised, that are described in the individual publications and can be handled in future works.