An Architecture for Selective Web Harvesting: The Use Case of Heritrix Full text

Vassilis Plachouras, Florent Carpentier, Julien Masanes, Thomas Risse, Pierre Senellart, Patrick Siehndel, Yannis Stavrakas
1st International Workshop on Archiving Community Memories (ARCOMEM), in conjunction with iPRES2013
Abstract. In this paper we provide a brief overview of the crawling architecture of ARCOMEM and how it addresses the challenges arising in the context of selective web harvesting. We describe some of the main technologies developed to perform selective harvesting and we focus on a modified version of the open source crawler Heritrix, which we have adapted to t in ACROMEM's crawling architecture. The simulation experiments we have performed show that the proposed architecture is effective in a focused crawling setting.