Diacritics restoration based on word n-grams for Slovak texts

Open Access

1 January 2021

journal article
research article
Published by Walter de Gruyter GmbH in Open Computer Science

Vol. 11 (1), 180-189
https://doi.org/10.1515/comp-2020-0143

Abstract

Despite the modern boom in technology, we are still faced with the fact that people write texts without diacritics. There are two main reasons for this. The first, historical reason stems from the past when the use of diacritics was troublesome and people would write text without them. The second one is the speed - typing without diacritics is usually faster. Text without diacritics is easy to understand for people, but for some types of documents, missing diacritics can cause a problem. This is also an issue when computers process such text. In this paper, we propose an algorithm based on word n-grams (a contiguous sequence of n words) that can restore diacritics of text written in the Slovak language. We also compare and evaluate our results with other algorithms developed for Slovak text.

Keywords

This publication has 7 references indexed in Scilit:

Diacritics Restoration Using Deep Neural Networks
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2018
Sentiment analysis of social network posts in Slovak language
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2017
Diacritics Restoration in the Slovak Texts Using Hidden Markov Model
Published by Springer Nature ,2016
Automatic Diacritics Restoration for Hungarian
Published by Association for Computational Linguistics (ACL) ,2015
Mobile forensic reference materials :
Published by National Institute of Standards and Technology (NIST) ,2009
IRSTLM: an open source toolkit for handling large scale language models
Published by International Speech Communication Association ,2008
Maximum entropy based restoration of Arabic diacritics
Published by Association for Computational Linguistics (ACL) ,2006

Cited by 1 article