Tatoeba
| Error creating thumbnail: The logo of Tatoeba | |
Screenshot | |
Type of site | Online parallel corpora |
|---|---|
| Available in | 57 languages of the interface; content in 426 languages (February 2025) languages |
| Country of origin | France |
| Owner | Association Tatoeba |
| Founder | Trang Ho |
| Key people | Allan Simon |
| URL | tatoeba |
| Commercial | No |
| Registration | Optional |
| Launched | 2006 |
| Current status | Online |
Content license | CC BY (some sentences under CC0), audio varies |
Tatoeba is a free collection of example sentences with translations geared towards foreign language learners. It is available in more than 400 languages. Its name comes from the Japanese phrase tatoeba (例えば), meaning 'for example'. It is written and maintained by a community of volunteers through a model of open collaboration. Individual contributors are known as "Tatoebans". It is run by Association Tatoeba, a French non-profit organization funded through donations.
History and development
[edit | edit source]In 2006, Trang Ho was frustrated that unlike some of their Japanese counterparts, German bilingual dictionaries didn't feature full-text search of usage examples with translations.[1] It led her to imagine her ideal dictionary[2] and to build a prototype hosted on SourceForge under the name "multilangdict."[3] The main focus was already the crowdsourcing of translated sentences: "A Wikipedia type of thing, except people add sentences, not articles."
Alongside her studies at the University of Technology of Compiègne, Trang Ho gradually improved her website with a few classmates. She rebuilt the project from scratch twice and rebranded it as Tatoeba. In September 2007, about 150,000 English-Japanese sentence pairs from the Tanaka Corpus — a public-domain compilation released in 2001 by Hyogo University professor Yasuhito Tanaka and maintained by Jim Breen and Paul Blay — were imported into the Tatoeba Corpus.[4] In December 2008, Trang Ho released the first version of the current codebase built around a more flexible data model.[5] The following month, the website moved to the tatoeba.org domain.[6]
Over the 2009-2010 academic year, Allan Simon — then a student at SUPINFO — became a core developer of Tatoeba. Together with Trang Ho and other young developers, they made Tatoeba more social: sentence lists, user profiles, private messaging, and Facebook-inspired Wall. They also introduced significant features like sentence linking, tagging, and "translation of translation" search. In November 2010, Tatoeba passed the 600,000 sentences mark. Within a year, the number of sentences added daily had increased almost 50-fold.[7]
Between 2014 and 2016, a new team of developers formed around Trang Ho.[8] They mentored students at the Google Summer of Code 2014[9] and added features to improve corpus quality.
Over the 2018-2020 period, support from the Mozilla Foundation as part of the Common Voice project allowed Tatoeba to make its platform more open and user-friendly.[10][11]
Openness
[edit | edit source]| Year | Editors | ±% |
|---|---|---|
| 2010 | 1,399 | — |
| 2011 | 1,989 | +42.2% |
| 2012 | 2,322 | +16.7% |
| 2013 | 2,377 | +2.4% |
| 2014 | 2,248 | −5.4% |
| 2015 | 2,506 | +11.5% |
| 2016 | 2,085 | −16.8% |
| 2017 | 1,481 | −29.0% |
| 2018 | 1,583 | +6.9% |
| 2019 | 1,420 | −10.3% |
| 2020 | 1,735 | +22.2% |
| 2021 | 1,540 | −11.2% |
| 2022 | 1,377 | −10.6% |
| 2023 | 1,336 | −3.0% |
| 2024 | 1,211 | −9.4% |
| Source: Tatoeba contributions | ||
Use
[edit | edit source]Users can search for words to retrieve sentences that use them. Results can be filtered by language, number of words, tag, and other criteria.[12]
Each sentence is displayed next to its translations and "translations of translations". A comment section facilitates feedback and corrections.
Registered users can build downloadable lists of sentences, which are private, public or collaborative.
Contribution
[edit | edit source]Tatoebans are encouraged to contribute in their strongest language.[13] They can add original sentences and translate existing ones. They can proofread or comment on other users' sentences, and "adopt" sentences without an owner. Advanced contributors are also allowed to tag, link, and unlink sentences.
When the owner of a sentence does not respond to a correction request, only a corpus maintainer has the power to update or delete the sentence.
Governance
[edit | edit source]As founder of Tatoeba, Trang Ho has long been the project's BDFL.
In 2011, she set up a nonprofit organization to oversee the project.
In 2022, she decided to step aside in favor of a small group of experienced Tatoebans.[14]
Languages
[edit | edit source]As of February 2025, the Tatoeba Corpus has over 12,600,000 sentences in 426 languages. 66 of these languages have 10,000 or more sentences. Over 1 million sentences have audio recordings.[15]
The sentences are interrelated within a graph that has more than 25,900,000 links. 276 language pairs have over 10,000 translated sentences.[16]
| Language | Number of links |
|---|---|
| English | |
| French | |
| Russian | |
| Esperanto | |
| German | |
| Spanish | |
| Italian | |
| Turkish | |
| Dutch | |
| Portuguese | |
| Japanese | |
| Hungarian | |
| Ukrainian | |
| Kabyle | |
| Hebrew | |
| Finnish | |
| Polish | |
| Mandarin Chinese | |
| Danish | |
| Swedish |
Operation
[edit | edit source]Tatoeba received a grant from Mozilla Drumbeat in December 2010.[18][19]
Some work on the Tatoeba infrastructure was sponsored by Google Summer of Code, 2014 edition.[9]
Since 2014, Tatoeba has been supported by donations.[20]
In May 2018 they received a $25,000 Mozilla Open Source Support (MOSS) program grant.[10]
In August 2019 they received a $15,000 Mozilla Open Source Support (MOSS) program grant.[11]
Access to content
[edit | edit source]Licensing
[edit | edit source]By default, the sentences of the Tatoeba Corpus are published under a CC BY license,[21] freeing it for academic and other use. Users can also contribute sentences under CC0, though translations of those sentences currently can't share the same license.[22]
Audio recordings of the sentences use the speaker's choice of license, such as CC BY, CC BY-SA, CC BY-NC, or no public license at all.[23]
Offline use
[edit | edit source]Visitors can download tab-delimited sentence pairs ready for import into Anki and similar Spaced Repetition Software at the Tatoeba website.[16]
Software development tools
[edit | edit source]An unstable API is available for software developers.[24]
Related projects
[edit | edit source]Second-language acquisition
[edit | edit source]Tatoeba sentences can be used to build lexicographic references for language learners. The JMdict Japanese-English dictionary selects its example sentences from the Tatoeba Corpus.[25] OpenRussian is a free Russian dictionary built primarily from the content of Wiktionary and Tatoeba.[26] GoodExample tries to automatically extract a diverse set of high-quality example sentences from the English Tatoeba Corpus.[27]
Tatoeba datasets can power incidental learning experiences that blend the acquisition of a foreign language with the user's everyday activities like web browsing or book reading.[28][29] A team at MIT Media Lab used example sentences from Tatoeba in WordSense, a mixed reality platform that enables "serendipitous language learning in the wild."[30] More recently, Japanese researchers implemented a Tatoeba search feature in an integrated writing assistance environment.[31]
Although the sentences in the Tatoeba Corpus are not all authentic, they are sometimes used to build data-driven learning applications. BES (Basic English Sentence) Search is a non-commercial tool for finding beginner-level English sentences for use in teaching materials.[32] It has over 1 million sentences, most of them from Tatoeba.[33] Reverso uses Tatoeba parallel corpora in its commercial bilingual concordancer.[34]
Example sentences are also used as a base for exercises. Charles Kelly and Paul Raine, both EFL teachers in Japan, have developed language learning activities based on sentences curated from the Tatoeba Corpus.[35][36] Clozemaster is a language self-study program that generates gamified cloze tests from Tatoeba sentence pairs.[37] Some Anki users share flashcards that were created using Tatoeba.[38]
Regional or minority languages
[edit | edit source]Some language digital activists contribute to open collaborative projects like Tatoeba, Wikipedia, and Common Voice to promote their minority language in digital spaces.[39] Regional languages like Kabyle, Catalan, or Basque can register more than a hundred members on Tatoeba.[40]
Constructed languages
[edit | edit source]Selected content from Tatoeba in Esperanto is available in the multilingual DVD Esperanto Elektronike published by E@I.[41] As of November 2022, Esperanto is Tatoeba's fifth pivot language, with over 330,000 sentences translated into at least two languages.[16] Other constructed languages like Toki Pona, Interlingua, Klingon, Lojban, and Ido also have a significant footprint.[15]
Language technology
[edit | edit source]From 2008 to 2011, Francis Bond used the Tatoeba Corpus for his research on the Japanese language.[43][44]
Since 2013, Jörg Tiedemann has been spreading Tatoeba parallel corpora more widely in the machine translation community by sharing them on the OPUS repository and organizing the "Tatoeba Translation Challenge".[45][46] With the rise of deep learning, researchers increasingly use Tatoeba's data sets to train and evaluate their massively multilingual models in tasks like machine translation,[47] language identification,[48] semantic search,[49] and speech recognition.[50]
See also
[edit | edit source]Lua error in mw.title.lua at line 392: bad argument #2 to 'title.new' (unrecognized namespace name 'Portal').
References
[edit | edit source]- ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ a b Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ a b Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ a b Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ a b Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ a b c Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ Tatoeba weekly exports
- ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ Masato Hagiwara, Takumi Ito, Tatsuki Kuribayashi, Jun Suzuki, and Kentaro Inui. 2019. TEASPN: Framework and Protocol for Integrated Writing Assistance Environments. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations, pages 229–234, Hong Kong, China. Association for Computational Linguistics.
- ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ NISHIGAKI, C., & AKASEGAWA, S. Secondary School Students: What We Can Do to Nurture Autonomous Corpus Users?.
- ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ Kelly, Charles (2012). "タトエバ・プロジェクト・コーパスを使った www. ManyThings. org の語学学習教材" (PDF), 愛知工業大学研究報告 (47), 77-84.
- ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ Francis Bond, 栗林 孝行 [Takayuki Kuribayashi], 橋本 力 [Hashimoto Chikara] (2008) HPSGに基づくフリーな日本語ツリー バンクの構築 [A free Japanese Treebank based on HPSG]. In 14th Annual Meeting of The Association for Natural Language Processing, Tokyo.
- ^ Eric Nichols, Francis Bond, Darren Scott Appling and Yuji Matsumoto (2010) Paraphrasing Training Data for Statistical Machine Translation. Journal of Natural Language Processing, 17(3), pages 101–122.
- ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).
- ^ Lua error in Module:Citation/CS1/Configuration at line 2172: attempt to index field '?' (a nil value).