Data Reduction in the String Space for Efficient kNN Classification Through Space Partitioning
Open Access
- 12 May 2020
- journal article
- research article
- Published by MDPI AG in Applied Sciences
- Vol. 10 (10), 3356
- https://doi.org/10.3390/app10103356
Abstract
Within the Pattern Recognition field, two representations are generally considered for encoding the data: statistical codifications, which describe elements as feature vectors, and structural representations, which encode elements as high-level symbolic data structures such as strings, trees or graphs. While the vast majority of classifiers are capable of addressing statistical spaces, only some particular methods are suitable for structural representations. The kNN classifier constitutes one of the scarce examples of algorithms capable of tackling both statistical and structural spaces. This method is based on the computation of the dissimilarity between all the samples of the set, which is the main reason for its high versatility, but in turn, for its low efficiency as well. Prototype Generation is one of the possibilities for palliating this issue. These mechanisms generate a reduced version of the initial dataset by performing data transformation and aggregation processes on the initial collection. Nevertheless, these generation processes are quite dependent on the data representation considered, being not generally well defined for structural data. In this work we present the adaptation of the generation-based reduction algorithm Reduction through Homogeneous Clusters to the case of string data. This algorithm performs the reduction by partitioning the space into class-homogeneous clusters for then generating a representative prototype as the median value of each group. Thus, the main issue to tackle is the retrieval of the median element of a set of strings. Our comprehensive experimentation comparatively assesses the performance of this algorithm in both the statistical and the string-based spaces. Results prove the relevance of our approach by showing a competitive compromise between classification rate and data reduction.Keywords
Funding Information
- Generalitat Valenciana (ACIF/2019/ 042)
- Ministerio de Economía, Industria y Competitividad, Gobierno de España (TIN2017-86576-R)
This publication has 26 references indexed in Scilit:
- Improving kNN multi-label classification in Prototype Selection scenarios using class proposalsPattern Recognition, 2015
- RHC: a non-parametric cluster-based data reduction for efficient $$k$$ k -NN classificationPattern Analysis and Applications, 2014
- A new iterative algorithm for computing a quality approximate median of strings based on edit operationsPattern Recognition Letters, 2014
- The dissimilarity space: Bridging structural and statistical pattern recognitionPattern Recognition Letters, 2012
- Prototype reduction techniques: A comparison among different approachesExpert Systems with Applications, 2011
- Towards the unification of structural and statistical pattern recognitionPattern Recognition Letters, 2011
- Comparison of AESA and LAESA search algorithms using string and tree-edit-distancesPattern Recognition Letters, 2003
- Online and off-line handwriting recognition: a comprehensive surveyPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2000
- Gradient-based learning applied to document recognitionProceedings of the IEEE, 1998
- A database for handwritten text recognition researchIEEE Transactions on Pattern Analysis and Machine Intelligence, 1994