Diacritics restoration based on word n-grams for Slovak texts
Open Access
- 1 January 2021
- journal article
- research article
- Published by Walter de Gruyter GmbH in Open Computer Science
- Vol. 11 (1), 180-189
- https://doi.org/10.1515/comp-2020-0143
Abstract
Despite the modern boom in technology, we are still faced with the fact that people write texts without diacritics. There are two main reasons for this. The first, historical reason stems from the past when the use of diacritics was troublesome and people would write text without them. The second one is the speed - typing without diacritics is usually faster. Text without diacritics is easy to understand for people, but for some types of documents, missing diacritics can cause a problem. This is also an issue when computers process such text. In this paper, we propose an algorithm based on word n-grams (a contiguous sequence of n words) that can restore diacritics of text written in the Slovak language. We also compare and evaluate our results with other algorithms developed for Slovak text.Keywords
This publication has 7 references indexed in Scilit:
- Diacritics Restoration Using Deep Neural NetworksPublished by Institute of Electrical and Electronics Engineers (IEEE) ,2018
- Sentiment analysis of social network posts in Slovak languagePublished by Institute of Electrical and Electronics Engineers (IEEE) ,2017
- Diacritics Restoration in the Slovak Texts Using Hidden Markov ModelPublished by Springer Nature ,2016
- Automatic Diacritics Restoration for HungarianPublished by Association for Computational Linguistics (ACL) ,2015
- Mobile forensic reference materials :Published by National Institute of Standards and Technology (NIST) ,2009
- IRSTLM: an open source toolkit for handling large scale language modelsPublished by International Speech Communication Association ,2008
- Maximum entropy based restoration of Arabic diacriticsPublished by Association for Computational Linguistics (ACL) ,2006