PlEWiC - error corpus for Polish

PlEWiC (Polish Language Errors from Wikipedia) was created in the result of automatic extraction of language error from Polish Wikipedia edition history. The method is described in:

  • Roman Grundkiewicz, Automatic Extraction of Polish Language Errors from Text Edition History, Proceedings of the 16th International Conference on Text, Speech and Dialogue TSD 2013, Springer, LNCS, pages 129--136, Czech, September 2013 pdf bib

Corpus contains above 1.53 mln sentences and about 1.71 mln naturally-occuring language error examples. Presentation describing the corpus is available at:

Sample

The first version of the PlEWiC is publicly available in YAML format:

Scripts: