An Architecture for Selective Web Harvesting: The Use Case of Heritrix
1st International Workshop on Archiving Community Memories (ARCOMEM), in conjunction with iPRES2013
2013
Conference/Workshop
- Contact persons: Vassilis Plachouras , Yannis Stavrakas
Abstract.
In this paper we provide a brief overview of the crawling architecture of ARCOMEM and how it addresses the challenges arising in the context of selective web harvesting. We describe some of the main technologies developed to perform selective harvesting and we focus on a modified version of the open source crawler Heritrix, which we have adapted to t in ACROMEM's crawling architecture. The simulation experiments we have performed show that the proposed architecture is effective in a focused crawling setting.