List of text corpora

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

Text corpora (singular: text corpus) are large and structured sets of texts, which have been systematically collected. Text corpora are used by both AI developers to train large language models and corpus linguists and within other branches of linguistics for statistical analysis, hypothesis testing, finding patterns of language use, investigating language change and variation, and teaching language proficiency.[1]

English language

[edit | edit source]

European languages

[edit | edit source]

Slavic

[edit | edit source]

East Slavic

[edit | edit source]

South Slavic

[edit | edit source]

West Slavic

[edit | edit source]

German

[edit | edit source]

Middle Eastern Languages

[edit | edit source]

Turkic languages

[edit | edit source]

Devanagari

[edit | edit source]

East Asian Languages

[edit | edit source]

South Asian Languages

[edit | edit source]

African languages

[edit | edit source]

Parallel corpora of diverse languages

[edit | edit source]
  • Chinese/English Political Interpreting Corpus (CEPIC) [28][29] consists of transcripts of speeches delivered by top political figures from Hong Kong, Beijing, Washington DC and London, as well as their translated/interpreted texts. Developed by Jun Pan and HKBU Library.
  • Europarl Corpus - proceedings of the European Parliament from 1996 to 2012
  • EUR-Lex corpus - collection of all official languages of the European Union, created from the EUR-Lex database[30]
  • OPUS: Open source Parallel Corpus in many many languages[31]
  • Tatoeba A parallel corpus which contains over 8.9 million sentences in multiple languages; 107 languages have more than 1,000 sentences each; a further 81 languages have from 100 to 1,000 sentences each.[32]
  • NTU-Multilingual Corpus in 7 languages (ara, eng, ind, jpn, kor, mcn, vie)[33] (legacy repo)
  • SeedLing corpus - A Seed Corpus for the Human Language Project with 1000+ languages from various sources.[34]
  • GRALIS parallel texts for various Slavic languages, compiled by the institute for Slavic languages at Graz University (Branko Tošović et al.)
  • The ACTRES Parallel Corpus (P-ACTRES 2.0) is a bidirectional English-Spanish corpus consisting of original texts in one language and their translation into the other. P-ACTRES 2.0 contains over 6 million words considering both directions together.[35]

{{#section-h::Parallel_text|Parallel corpora}}

Comparable Corpora

[edit | edit source]

L2 (English) Corpora

[edit | edit source]
  • Cambridge Learner Corpus[43]
  • Corpus of Academic Written and Spoken English (CAWSE),[44] a collection of Chinese students’ English language samples in academic settings. Freely downloadable online.  
  • English as a Lingua Franca in Academic Settings (ELFA),[45] an academic ELF corpus.[46][47]
  • International Corpus of Learner English (ICLE),[48] a corpus of learner written English.
  • Louvain International Database of Spoken English Interlanguage (LINDSEI),[49] a corpus of learner spoken English.
  • Trinity Lancaster Corpus, one of the largest corpus of L2 spoken English.[50][51]
  • University of Pittsburgh English Language Institute Corpus (PELIC)[52]
  • Vienna-Oxford International Corpus of English (VOICE),[53] an ELF corpus.[46]

References

[edit | edit source]
  1. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  2. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  3. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  4. ^ Professor Mark Davies at BYU created an online tool to search Google's English language corpus, drawn from Google Books, at http://googlebooks.byu.edu/x.asp.
  5. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value). A search engine for the Google Books Ngram Corpus that supports wildcard queries and offers an API.
  6. ^ [1],Basque corpora
  7. ^ (in Spanish) Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  8. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  9. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  10. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  11. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  12. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  13. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  14. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  15. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  16. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  17. ^ Hadi Veisi, Mohammad MohammadAmini, Hawre Hosseini; Toward Kurdish language processing: Experiments in collecting and processing the AsoSoft text corpus, Digital Scholarship in the Humanities, fqy074, https://doi.org/10.1093/llc/fqy074
  18. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  19. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  20. ^ D. Upeksha, C. Wijayarathna, M. Siriwardena, L. Lasandun, C. Wimalasuriya, N. de Silva, and G. Dias . 2015. Implementing a Corpus for Sinhala Language. In Symposium on Language Technology for South Asia.
  21. ^ Glossa (uio.no)
  22. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  23. ^ https://arxiv.org/pdf/2102.06991.pdf, https://wortschatz.uni-leipzig.de/en/download/Hausa
  24. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  25. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  26. ^ https://www.researchgate.net/publication/336274457_Digital_Yoruba_Corpus, https://www.sketchengine.eu/corpora-and-languages/yoruba-text-corpora/
  27. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  28. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  29. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  30. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  31. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  32. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  33. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  34. ^ Guy Emerson, Liling Tan, Susanne Fertmann, Alexis Palmer and Michaela Regneri . 2014. SeedLing: Building and using a seed corpus for the Human Language Project. In Proceedings of the use of Computational methods in the study of Endangered Languages (ComputEL) Workshop. Baltimore, USA.
  35. ^ H. Sanjurjo-González and M. Izquierdo. 2019. P-ACTRES 2.0: A parallel corpus for cross-linguistic research. In Parallel Corpora for Contrastive and Translation Studies: New resources and applications (pp. 215-231). John Benjamins Publishing.
  36. ^ Liling Tan, Marcos Zampieri, Nikola Ljubešic, and Jörg Tiedemann. Merging comparable data sources for the discrimination of similar languages: The DSL corpus collection. In Proceedings of the 7th Workshop on Building and Using Comparable Corpora (BUCC). 2014.
  37. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  38. ^ Belinkov, Y., Habash, N., Kilgarriff, A., Ordan, N., Roth, R., & Suchomel, V. (2013). arTen-Ten: a new, vast corpus for Arabic. Proceedings of WACL.
  39. ^ Kilgarriff, A., & Renau, I. (2013). esTenTen, a vast web corpus of Peninsular and American Spanish. Procedia - Social and Behavioral Sciences, 95, 12-19.
  40. ^ Хохлова, М. В. (2016). Обзор больших русскоязычных корпусов текстов. In Материалы научной конференции" Интернет и современное общество" (pp. 74-77).
  41. ^ Khokhlova, M. (2016). Comparison of High-Frequency Nouns from the Perspective of Large Corpora. RASLAN 2016 Recent Advances in Slavonic Natural Language Processing, 9.
  42. ^ Trampuš, M., & Novak, B. (2012, October). Internals of an aggregated web news feed. In Proceedings of the Fifteenth International Information Science Conference IS SiKDD 2012 (pp. 431-434)
  43. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  44. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  45. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  46. ^ a b Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  47. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  48. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  49. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  50. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  51. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  52. ^ Juffs, A., Han, N-R., & Naismith, B. (2020). The University of Pittsburgh English Language Corpus (PELIC) [Data set]. Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
  53. ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).

See also

[edit | edit source]