Semi-automatic Geocoding for Prersistent Web pages Full text

Charikleia Lontou
School of Electrical and Computer Engineering, NTUA
2008
Diploma Thesis
Abstract. This thesis presents a methodology for the semi-automatic geocoding of persistent Web pages, i.e., relating identifiers in texts to geographic co-ordinates using a combined automatic and human-centered approach. Geoparsing and geocoding algorithms are successfully applied to identify phone numbers and addresses, however when more generic geo identifiers are involved, automatic algorithms produce a significant number of false positives (Venizelos as a person) and false negatives (Venizelos as the name of Athens international airport). This thesis advocates human intervention to improve on automatic geocoding results and develops therefore a Web browser extension that (i) allows for the manual geocoding of text portions and (ii) the updating, including deletion of automatically generated results. This proposed approach is especially helpful for persistent Web pages such as Wikipedia, i.e., pages that have a certain value to the community, are well cared for and change rather slowly. Here, geocoding can become a regular part of Web page authoring. The geocoding of a Web page is stored in a database, i.e., a Web page is stored in terms of its URL, the geocoded text portions in terms of their position on the Web page and the respective co-ordinates, and the date of the geocoding, i.e., the version of the Web page. Further, the geocoding of a Web page is displayed as highlighted text and by means of a map, i.e., clicking on a text portions shows the respective position on a map. In our case, Google Maps was used for this task. In terms of technology, the developed system includes the automatic geocoding tool developed by Albert-David Angel, an Apache Web server and Tomcat servlet engine, as well as the browser extension developed in java script and java for the interaction with the servlet. This work also includes a set of use cases that illustrate the applicability and usefulness of the overall approach.