Data Reduction in the String Space for Efficient kNN Classification Through Space Partitioning

Open Access

12 May 2020

journal article
research article
Published by MDPI AG in Applied Sciences

Vol. 10 (10), 3356
https://doi.org/10.3390/app10103356

Abstract

Within the Pattern Recognition field, two representations are generally considered for encoding the data: statistical codifications, which describe elements as feature vectors, and structural representations, which encode elements as high-level symbolic data structures such as strings, trees or graphs. While the vast majority of classifiers are capable of addressing statistical spaces, only some particular methods are suitable for structural representations. The kNN classifier constitutes one of the scarce examples of algorithms capable of tackling both statistical and structural spaces. This method is based on the computation of the dissimilarity between all the samples of the set, which is the main reason for its high versatility, but in turn, for its low efficiency as well. Prototype Generation is one of the possibilities for palliating this issue. These mechanisms generate a reduced version of the initial dataset by performing data transformation and aggregation processes on the initial collection. Nevertheless, these generation processes are quite dependent on the data representation considered, being not generally well defined for structural data. In this work we present the adaptation of the generation-based reduction algorithm Reduction through Homogeneous Clusters to the case of string data. This algorithm performs the reduction by partitioning the space into class-homogeneous clusters for then generating a representative prototype as the median value of each group. Thus, the main issue to tackle is the retrieval of the median element of a set of strings. Our comprehensive experimentation comparatively assesses the performance of this algorithm in both the statistical and the string-based spaces. Results prove the relevance of our approach by showing a competitive compromise between classification rate and data reduction.

Keywords

Funding Information

Generalitat Valenciana (ACIF/2019/ 042)
Ministerio de Economía, Industria y Competitividad, Gobierno de España (TIN2017-86576-R)

This publication has 26 references indexed in Scilit:

Improving kNN multi-label classification in Prototype Selection scenarios using class proposals
Pattern Recognition, 2015
RHC: a non-parametric cluster-based data reduction for efficient $$k$$ k -NN classification
Pattern Analysis and Applications, 2014
A new iterative algorithm for computing a quality approximate median of strings based on edit operations
Pattern Recognition Letters, 2014
The dissimilarity space: Bridging structural and statistical pattern recognition
Pattern Recognition Letters, 2012
Prototype reduction techniques: A comparison among different approaches
Expert Systems with Applications, 2011
Towards the unification of structural and statistical pattern recognition
Pattern Recognition Letters, 2011
Comparison of AESA and LAESA search algorithms using string and tree-edit-distances
Pattern Recognition Letters, 2003
Online and off-line handwriting recognition: a comprehensive survey
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2000
Gradient-based learning applied to document recognition
Proceedings of the IEEE, 1998
A database for handwritten text recognition research
IEEE Transactions on Pattern Analysis and Machine Intelligence, 1994

Cited by 9 articles