Where No Corpus Has Gone Before

Much work in computational 👅 linguistics depends on digitized corpora for a given language, and machine translation is largely based on parallel corpora that have close alignment between languages. A corpus is a a compilation of written texts that, together, contain a great many of the words and word forms used in a language, usually produced by native speakers. The best corpora (or corpuses) contain tens or hundreds of millions of words. A corpus can be monolingual, such as the Corpus of Contemporary America English, or can contain parallel translations of the same documents aligned sentence by sentence between languages, such as the proceedings of the European Parliament.

For a well-digitized language, a clever researcher can use a corpus to reveal many things about a language. For example, each word that appears in the corpus can be listed, ranked by frequency, and often analyzed as to its part of speech. This gives the lexicographer a launching point to know what words must appear in a dictionary.

A parallel corpus makes it possible to see what terms in one language appear in exactly the same context as terms in another language, and are therefore probably translations of each other. That is, if "shoe" appears in 10 English sentences and "topanka" appears in parallel versions 10 times in Slovak, then the English term and the Slovak term probably both refer to the same idea of 👞 . (Word nerds can have fun searching parallel corpora for 25 European languages, beautifully assembled by Linguee, to suss out translations for party terms like "an arm and a leg".) However, parallel data tends to be based on topics of interest to those willing to pay for high-quality public translation, and thus particularly useful for business and government affairs. No similar resource exists for many topics that matter to the general public. In sports, for example, even within the BBC, the English service and the Swahili service might report about the same marathon, but the articles are often divergent re-tellings of the event, not sentence-by-sentence reproductions of a single script, and therefore ineffective as parallel text, were the two stories to be automatically compared. Case in point, as a result of their training corpora, Google Translate generally thinks people run organizations, not races. Additionally, most published literature that has been professionally translated across languages remains under copyright, so a trove such as the Harry Potter series, in 64 languages, is not available for public use.

Learning from corpora is the focus of intense international study. Many of our partners do incredible work with corpora, and with Transtechno we have developed a game (which needs a sponsor to put online) to find high-quality usage examples from the Helsinki Corpus of Swahili. However, corpora bang against several limitations for documenting most languages or finding translations between them. It is difficult or impossible for corpora to: The common thread among these problems is lack of data. Even within English, we can easily tell that "break" occurs many thousands of times, but we cannot know how many of these refer to a rest versus a fracture. Parallel texts hit their limits at the edges of professional translations - sources such as Wikipedia have similar texts in numerous languages, but the articles do not line up at a sentence level and thus do not provide reliable data. Natural speech could be brought in via transcriptions of field recordings, which is probably the best strategy for many minority and endangered languages, but doing so is extremely expensive and time consuming, and thus unlikely to occur for most languages even if the audio is in the archives.

The biggest problem for corpus lexicography, though, is the thousands of languages for which no corpus exists. Corpora are luxuries that are enjoyed by languages with a lot of published documents, and especially those spoken in countries that have a lot of official support from governments with resources to invest in knowledge about their languages. Africa's 2000 languages, and most of the thousands of indigenous languages of Asia, Australia, and the Americas, are basically left out in the cold. Essentially, the most significant lexicographic tool of the past half century is completely unavailable to most of the world's languages. Of course, building a corpus is not wizardry, and could be accomplished for any language in reasonable time with a scanner and good OCR, or a smart phone recorder and transcription software, but doing so requires time and budget that is not going to be invested by the powers of the purse for the great bulk of languages.

Because corpus lexicography is a useless technique without corpora, Kamusi's main methods do not rely on their existence. Instead, we focus on gathering knowledge directly from the public, either as a supplement to existing data sources, or as our primary source of information. Finding the people to play our games and share their knowledge via our apps is no easy task, but, we propose, will ultimately be the most effective way of documenting languages where no corpus has gone before.

/info/corpus

Kamusi GOLD

These are the languages for which we have datasets that we are actively working toward putting online. Languages that are Active for you to search are marked with "A" in the list below.

Key

•A = Active language, aligned and searchable
•c = Data 🔢 elicited through the Comparative African Word List
•d = Data from independent sources that Kamusi participants align playing 🐥📊 DUCKS
•e = Data from the 🎮 games you can play on 😂🌎🤖 EmojiWorldBot
•P = Pending language, data in queue for alignment
•w = Data from 🔠🕸 WordNet teams

Software and Systems

We are actively creating new software for you to make use of and contribute to the 🎓 knowledge we are bringing together. Learn about software that is ready for you to download or in development, and the unique data systems we are putting in place for advanced language learning and technology:

Articles and Information

Kamusi has many elements. With these articles, you can read the details that interest you:

Videos and Slideshows

Some of what you need to know about Kamusi can best be understood visually. Our 📽 videos are not professional, but we hope you find them useful:

Partners

Our partners - past, present, and future - include:

Hack Kamusi

Here are some of the work elements on our task list that you can help do or fund:

Theory of Kamusi

Select a link below to learn about the principles that guide the project's unique approach to lexicography and public service.

Contact Us

We welcome your comments and questions, and will try to respond quickly. To get in touch, please visit our contact page. You must use a real email address if you want to get a real reply!

kamusigold.org/info/contact

© Copyright ©

The Kamusi Project dictionaries and the Kamusi Project databases are intellectual property protected by international copyright law, ©2007 through ©2016, under the joint ownership of Kamusi Project International and Kamusi Project USA. Further explanation may be found on our © Copyright page.

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

Commentary

Discussion items about language, technology, and society, from the Kamusi editor and others. This box is growing. To help develop or fund the project, please contact us!

Our biggest struggle is keeping Kamusi online and keeping it free. We cannot charge money for our services because that would block access to the very people we most want to benefit, the students and speakers of languages around the world that are almost always excluded from information technology. So, we ask, request, beseech, beg you, to please support our work by donating as generously as you can to help build and maintain this unique public resource.

/info/donate

Frequently Asked Questions

Answers to general questions you might have about Kamusi services.

We are building this page around real questions from members of the Kamusi community. Send us a question that you think will help other visitors to the site, and frequently we will place the answer here.

Try it : Ask a "FAQ"!

Press Coverage

Kamusi in the news: Reports by journalists and bloggers about our work in newspapers, television, radio, and online.

Sponsor Search:
Who Do You Know?



To keep Kamusi growing as a "free" knowledge resource for the world's languages, we need major contributions from philanthropists and organizations. Do you have any connections with a generous person, corporation, foundation, or family office that might wish to make a long term impact on educational outcomes and economic opportunity for speakers of excluded languages around the world? If you can help us reach out to a potential 💛😇 GOLD Angel, please contact us!