Alignment

Kamusi harvests data from many sources, including existing dictionaries, open data sets, and members of the public. The fundamental problem in putting all this data together is that, until Kamusi, there has never been a way to affirm that a term in any one source is equivalent to a term in another unaffiliated source. Even within a company such as Oxford Dictionaries, you would have a difficult mission to equate the various meanings of l-i-g-h-t (not heavy, not dark, not serious, not fattening...) with the terms for those different ideas in their English-Spanish and English-Arabic dictionaries, to figure out which Spanish term equates with the same concept in Arabic. Kamusi resolves this problem by aligning concepts, not spellings.

When a computer looks at l-i-g-h-t, this is what it sees: 0110110001101001011001110110100001110100, the binary code for a string of five letters. Our Basque dataset matches that language's term "argitasun" with the sequence of digits you see above, as does "afessas" in Berber. In fact, PanLex finds nearly 10,000 terms that match to that sequence from nearly 1600 language varieties. Without further context, this spelling match is the closest we can get to forming connections among languages. This is why multilingual "translation" services such as Google Translate frequently give catastrophic results. Unlike a computer, a bilingual Filipino-English speaker who looks at Charles Nigg's 1904 Tagalog-English and English-Tagalog Dictionary can instantly tell which Tagalog term matched to l-i-g-h-t corresponds to which English sense. The person faces a different hurdle: how would someone ever convey their individual knowledge into actionable data that can be shared on digital systems for others to use?

Kamusi has designed unique systems to match linguistic data (01100100011000010111010001100001) to language knowledge (what is in your head). From WordNet, we have a beginner set of about 100,000 concepts defined in English, soon to rise to XXX by aligning to Wiktionary. We show our defined terms in DUCKS (Data Unified Conceptual Knowledge Sets), and players drag the unaligned term from their dataset to the definition that matches. For the Wiktionary version of DUCKS, where we have a Wiktionary sense of l-i-g-h-t, a participant can eyeball the Kamusi sense that corresponds and tie the two together (with three goals, first to find missing senses, second to provide alternative definitions in case the WordNet description is inadequate, and third to bring in translations to many other languages that have been produced by Wiktionary volunteers). For Filipino, players are shown one of the terms in their dataset that matches to l-i-g-h-t, and they choose whether it means "not heavy", "not dark", "not serious", or "not fattening". When a consensus is achieved by a critical mass of players, we consider the alignment to be validated.

Because each version of DUCKS connects to the same core concept set, we are able to make high-probability second-generation connections among languages. While we are insistent that our results show the English or other language we use as the pivot so that we do not make uncertified truth claims, data alignment means that we can confidently assert that Filipino "not heavy" is a likely match for Vietnamese "not heavy" and the term for the same idea in Amharic. (Aligned terms will advance to a game for people to verify proposed links between languages, but we certainly will not be able to find bilingual players for all 25 million language pairs.) When we have the financial resources to work with the 1.3 billion terms in PanLex, we will be able to align concepts across as many as 11,000 language varieties. By combining the computer's ability to process data with people's ability to understand it, our systems are geared to line up linguistic knowledge at the sense level across the world's languages.

/info/alignment

Kamusi GOLD

These are the languages for which we have datasets that we are actively working toward putting online. Languages that are Active for you to search are marked with "A" in the list below.

Key

•A = Active language, aligned and searchable
•c = Data 🔢 elicited through the Comparative African Word List
•d = Data from independent sources that Kamusi participants align playing 🐥📊 DUCKS
•e = Data from the 🎮 games you can play on 😂🌎🤖 EmojiWorldBot
•P = Pending language, data in queue for alignment
•w = Data from 🔠🕸 WordNet teams

Software and Systems

We are actively creating new software for you to make use of and contribute to the 🎓 knowledge we are bringing together. Learn about software that is ready for you to download or in development, and the unique data systems we are putting in place for advanced language learning and technology:

Articles and Information

Kamusi has many elements. With these articles, you can read the details that interest you:

Videos and Slideshows

Some of what you need to know about Kamusi can best be understood visually. Our 📽 videos are not professional, but we hope you find them useful:

Partners

Our partners - past, present, and future - include:

Hack Kamusi

Here are some of the work elements on our task list that you can help do or fund:

Theory of Kamusi

Select a link below to learn about the principles that guide the project's unique approach to lexicography and public service.

Contact Us

We welcome your comments and questions, and will try to respond quickly. To get in touch, please visit our contact page. You must use a real email address if you want to get a real reply!

kamusigold.org/info/contact

© Copyright ©

The Kamusi Project dictionaries and the Kamusi Project databases are intellectual property protected by international copyright law, ©2007 through ©2016, under the joint ownership of Kamusi Project International and Kamusi Project USA. Further explanation may be found on our © Copyright page.

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

Commentary

Discussion items about language, technology, and society, from the Kamusi editor and others. This box is growing. To help develop or fund the project, please contact us!

Our biggest struggle is keeping Kamusi online and keeping it free. We cannot charge money for our services because that would block access to the very people we most want to benefit, the students and speakers of languages around the world that are almost always excluded from information technology. So, we ask, request, beseech, beg you, to please support our work by donating as generously as you can to help build and maintain this unique public resource.

/info/donate

Frequently Asked Questions

Answers to general questions you might have about Kamusi services.

We are building this page around real questions from members of the Kamusi community. Send us a question that you think will help other visitors to the site, and frequently we will place the answer here.

Try it : Ask a "FAQ"!

Press Coverage

Kamusi in the news: Reports by journalists and bloggers about our work in newspapers, television, radio, and online.

Sponsor Search:
Who Do You Know?



To keep Kamusi growing as a "free" knowledge resource for the world's languages, we need major contributions from philanthropists and organizations. Do you have any connections with a generous person, corporation, foundation, or family office that might wish to make a long term impact on educational outcomes and economic opportunity for speakers of excluded languages around the world? If you can help us reach out to a potential 💛😇 GOLD Angel, please contact us!