A MapReduce Based k-NN Joins Probabilistic Classifier Full text

Georgios Chatzigeorgakidis, Sophia Karagiorgou, Spiros Athanasiou, and Spiros Skiadopoulos
2105 IEEE International Conference on Big Data (IEEE BigData 2015)
Abstract. Water management field has concentrated great interest, with the potential to affect the long term well-being, the societal economy and security. In parallel, it imposes specific research challenges which have not been already met, due to the lack of fine-grained data. Knowledge extraction and decision making for efficient management in the energy field has attracted a lot of interest in Big Data research. However, the water domain is strikingly absent, with minimal focused work on data exploitation and useful information extraction. The goal of this work is to discover persistent and meaningful knowledge from water consumption data and provide efficient and scalable big data management and analysis services. We propose a novel methodology which exploits machine learning techniques and introduces a robust probabilistic classifier which is able to operate on data of arbitrary dimensionality and of huge volume. It also provides added value services and new operation models for the water management domain, inducing sustainable behavioural changes for consumers, which can further raise social awareness. It does so through a new k-Nearest Neighbour based algorithm, developed in a parallel and distributed environment, which operates over Big Data and discovers useful knowledge about consumption classes and other water related attitudinal properties. A detailed experimental evaluation assesses the effectiveness and efficiency of the algorithm on prediction precision along with the provision of analytics. The results show that this method is prosperous and provides accurate and interesting results that allow us to identify useful characteristics, not only for the households, but also for the water utilities.