Detecting Errors in POS-Tagged Corpus on XGBoost and Cross Validation


KIPS Transactions on Software and Data Engineering, Vol. 9, No. 7, pp. 221-228, Jul. 2020
https://doi.org/10.3745/KTSDE.2020.9.7.221,   PDF Download:
Keywords: Error Detection, POS-tagged Corpus, XGBoost, Cross-Validation
Abstract

Part-of-Speech (POS) tagged corpus is a collection of electronic text in which each word is annotated with a tag as the corresponding POS and is widely used for various training data for natural language processing. The training data generally assumes that there are no errors, but in reality they include various types of errors, which cause performance degradation of systems trained using the data. To alleviate this problem, we propose a novel method for detecting errors in the existing POS tagged corpus using the classifier of XGBoost and cross-validation as evaluation techniques. We first train a classifier of a POS tagger using the POS-tagged corpus with some errors and then detect errors from the POS-tagged corpus using cross-validation, but the classifier cannot detect errors because there is no training data for detecting POS tagged errors. We thus detect errors by comparing the outputs (probabilities of POS) of the classifier, adjusting hyperparameters. The hyperparameters is estimated by a small scale error-tagged corpus, in which text is sampled from a POS-tagged corpus and which is marked up POS errors by experts. In this paper, we use recall and precision as evaluation metrics which are widely used in information retrieval. We have shown that the proposed method is valid by comparing two distributions of the sample (the error-tagged corpus) and the population (the POS-tagged corpus) because all detected errors cannot be checked. In the near future, we will apply the proposed method to a dependency tree-tagged corpus and a semantic role tagged corpus.


Statistics
Show / Hide Statistics

Statistics (Cumulative Counts from September 1st, 2017)
Multiple requests among the same browser session are counted as one view.
If you mouse over a chart, the values of data points will be shown.


Cite this article
[IEEE Style]
M. Choi, C. Kim, H. Park, M. Cheon, H. Yoon, Y. Namgoong, J. Kim, J. Kim, "Detecting Errors in POS-Tagged Corpus on XGBoost and Cross Validation," KIPS Transactions on Software and Data Engineering, vol. 9, no. 7, pp. 221-228, 2020. DOI: https://doi.org/10.3745/KTSDE.2020.9.7.221.

[ACM Style]
Min-Seok Choi, Chang-Hyun Kim, Ho-Min Park, Min-Ah Cheon, Ho Yoon, Young Namgoong, Jae-Kyun Kim, and Jae-Hoon Kim. 2020. Detecting Errors in POS-Tagged Corpus on XGBoost and Cross Validation. KIPS Transactions on Software and Data Engineering, 9, 7, (2020), 221-228. DOI: https://doi.org/10.3745/KTSDE.2020.9.7.221.