ACM Trans. Spatial Algorithms and Systems 3(4): 12:1-12:31
Abstract. Data analytics has an ever increasing impact on tackling various societal challenges. In this article, we investigate how data from several heterogeneous online sources can be used to discover insights and make predictions about the spatial distribution of crime in large urban environments. A series of important research questions is addressed, following a purely data-driven approach and methodology. First, we examine how useful different types of data are for the task of crime levels prediction, focusing especially on how prediction accuracy can be improved by combining data from multiple information sources. To that end, we not only investigate prediction accuracy across all individual areas studied, but also examine how these predictions affect the accuracy of identified crime hotspots. Then, we look into individual features, aiming to identify and quantify the most important factors. Finally, we drill down to different crime types, elaborating on how the prediction accuracy and the importance of individual features vary across them. Our analysis involves six different datasets, from which more than 3,000 features are extracted, filtered, and used to learn models for predicting crime rates across 14 different crime categories. Our results indicate that combining data from multiple information sources can significantly improve prediction accuracy. They also highlight which features affect prediction accuracy the most, as well as for which particular crime categories the predictions are more accurate.