Detecting and Correcting Real-word Errors in Tamil Sentences

Ratnasingam Sakuntharaj, Sinnathamby Mahesan

Abstract


Spell checkers concern two types of errors namely non-word errors and real-word errors. Non-word errors can be of two categories: First one is that the word itself is invalid; the other is that the word is valid but not present in a valid lexicon. Real-word error means the word is valid but inappropriate in the context of the sentence. An approach to correcting real-word errors in Tamil language is proposed in this paper. A bigram probability model is constructed to determine appropriateness of the valid word in the context of the sentence using a 3GB volume of corpora of Tamil text. In case of lacking appropriateness, the word is marked as a real-word error and minimum edit distance technique is used to find lexically similar words, and the appropriateness of such words is measured by a word-level n-gram language probability model. A hash table with word-length as the key is used to speed up the search for words to check for the lexical similarity. Words of lengths of m-1 to m+1 are considered with m being the length of the word found to be ‘inappropriate’. Test results show that the suggestions generated by the system are with more than 98% accuracy as approved by a Scholar in Tamil.

Full Text:

PDF

References


Damerau, 1964. A technique for computer detection and correction of spelling errors. Communications of the ACM, March, 7(3), pp. 171-176.

Jurafsky & Martin, 2017. Language Modeling with N-grams. [Online]

Available at: https://web.stanford.edu/~jurafsky/slp3/4.pdf

Kukich, 1992. Techniques for Automatically Correcting Words in Text. ACM Computing Survey, 24(4), pp. 377--439.

Navalar, 1998. Tamil Grammar Questions and Answers. No. 366, Kankesanthurai Road, Jaffna: Vannai Santhayarmadam.

Nuhman, 2013. Basic Tamil Grammar. University of Peradeniya, Readers Association, Kalmunai.

Sakuntharaj & Mahesan, 2016. A novel hybrid approach to detect and correct spelling in Tamil text. Galle, Sri Lanka, International Conference on Information and Automation for Sustainability (ICIAfS), pp. 1-6.

Sakuntharaj & Mahesan, 2017. Use of a Novel Hash-Table for Speeding-up Suggestions for Misspelt Tamil Words. Kandy, Sri Lanka, 12th IEEE International Conference on Industrial and Information System (ICIIS).

Sakuntharaj & Mahesan, 2017. Use of N-gram Technique with a Hash Table to Generate Suggestions for Tamil Misspelt Words. Jaffna, Sri Lanka, Jaffna Science Association, p. 12.

Samanta, Pratip, Chaudhuri & Bidyut, 2013. A simple real-word error detection and correction using local word bigram and trigram. Taiwan, ROCLING.

Sangar, 2006. Tamil Grammar. Puduchcheri, India, Nanmozi Printers.

Wagner & Fischer, 1971. The String to String Correction Problem. J. ACM, Volume 21, pp. 168-173.


Refbacks

  • There are currently no refbacks.


Creative Commons Licence
Ruhuna Journal of Science by University of Ruhuna is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

eISSN: 2536-8400

Print ISSN: 1800-279X (Before 2014)