What is a Word?

Does the photo show one mountain, or six? Does each peak on a mountain also have its own name? Where, exactly, is the bottom of the mountain? Lexicographers have similar problems framing the boundaries of words. What merits its own entry in a dictionary? What should be shown as additional information for an entry? What is so obvious, or cumbersome, that it should be left out? The conundrum has three faces.

First is meaning. Is swing, the dance music, a different word from swing, the playground equipment? At Kamusi, we say they are different things, so they get different entries. From a conceptual perspective, separating senses is straightforward, although it means that one spelling, such as t-a-k-e, can have dozens of different entries.

For cataloguing, shape is the more difficult face. We can define a sense of big as "being of a substantial size" - but do we include bigger and biggest as forms of that idea within that entry, or are they separate words? English verbs have five forms that we can easily list in an entry, and hope all our readers will know what we mean by past (took), third person present (takes), past participle (taken), and present participle (taking). Listing such inflections is much harder for a language like French, with 96 verb forms, many of which actually require conjugating a second verb and contracting with part of a pronoun ("I have read" = j'ai lu [je+avoir + lire]). Should each sense entry list all the possible forms, or can they refer to a common table that may not apply in all cases (e.g., swings for children have a plural form, but swing music does not)? And then there are the agglutinative languages. German can bind several independent words together to make a brand new compound word that everyone understands, sometimes with letters changing internally according to various rules - think of the way English creates words like supermarket, but do that on the fly, with any idea you can make up, such as Autobahnmarkierungsentfernungskomitee (a committee that is responsible for removing the road marks on highways). Hundreds of African languages can have an entire sentence in a single unit, such as "We are squeezing each other" in Swahili: tunabanana (tu=we + na=now + bana=squeeze + na=each other). Each Kinyarwanda verb can take 900,000,000 different forms. Obviously, a dictionary cannot list nearly a billion forms within an entry. Our strategy is computational, to find the rules that people use to agglutinate words in their language, and build parsers that locate the entries for their component parts.

The third face is the most difficult for computer processing: multiword expressions. We call these party terms because they are composed of items that dance together. That is, the words in the expression each have their own meaning, but together they form a new idea - a head case is a disturbed person, for example, and a northern right whale dolphin is a type of dolphin that is neither a compass direction nor correct nor a whale. In Vietnamese, every syllable is separated by a space, so every multisyllabic word looks like a party term. Are party terms "words"? Kamusi treats them as entries, if the meaning cannot be gleaned from its parts. Party terms present additional challenges because they can be separated, can change shape, and can have multiple meanings. Fortunately, our unique architecture can handle all of that, and, using GOLDbox, a variety of games, and other techniques, we can build from the basic canonical lemmas that are the usual initial dataset or first-round user contributions for a language, toward a full set of forms associated with each concept.

Just don't ask us to tell you what exactly we mean by the word "word".

/info/what_is_a_word

Kamusi GOLD

These are the languages for which we have datasets that we are actively working toward putting online. Languages that are Active for you to search are marked with "A" in the list below.

Key

•A = Active language, aligned and searchable
•c = Data 🔢 elicited through the Comparative African Word List
•d = Data from independent sources that Kamusi participants align playing 🐥📊 DUCKS
•e = Data from the 🎮 games you can play on 😂🌎🤖 EmojiWorldBot
•P = Pending language, data in queue for alignment
•w = Data from 🔠🕸 WordNet teams

Software and Systems

We are actively creating new software for you to make use of and contribute to the 🎓 knowledge we are bringing together. Learn about software that is ready for you to download or in development, and the unique data systems we are putting in place for advanced language learning and technology:

Articles and Information

Kamusi has many elements. With these articles, you can read the details that interest you:

Videos and Slideshows

Some of what you need to know about Kamusi can best be understood visually. Our 📽 videos are not professional, but we hope you find them useful:

Partners

Our partners - past, present, and future - include:

Hack Kamusi

Here are some of the work elements on our task list that you can help do or fund:

Theory of Kamusi

Select a link below to learn about the principles that guide the project's unique approach to lexicography and public service.

Contact Us

We welcome your comments and questions, and will try to respond quickly. To get in touch, please visit our contact page. You must use a real email address if you want to get a real reply!

kamusigold.org/info/contact

© Copyright ©

The Kamusi Project dictionaries and the Kamusi Project databases are intellectual property protected by international copyright law, ©2007 through ©2016, under the joint ownership of Kamusi Project International and Kamusi Project USA. Further explanation may be found on our © Copyright page.

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.

Commentary

Discussion items about language, technology, and society, from the Kamusi editor and others. This box is growing. To help develop or fund the project, please contact us!

Our biggest struggle is keeping Kamusi online and keeping it free. We cannot charge money for our services because that would block access to the very people we most want to benefit, the students and speakers of languages around the world that are almost always excluded from information technology. So, we ask, request, beseech, beg you, to please support our work by donating as generously as you can to help build and maintain this unique public resource.

/info/donate

Frequently Asked Questions

Answers to general questions you might have about Kamusi services.

We are building this page around real questions from members of the Kamusi community. Send us a question that you think will help other visitors to the site, and frequently we will place the answer here.

Try it : Ask a "FAQ"!

Press Coverage

Kamusi in the news: Reports by journalists and bloggers about our work in newspapers, television, radio, and online.

Sponsor Search:
Who Do You Know?



To keep Kamusi growing as a "free" knowledge resource for the world's languages, we need major contributions from philanthropists and organizations. Do you have any connections with a generous person, corporation, foundation, or family office that might wish to make a long term impact on educational outcomes and economic opportunity for speakers of excluded languages around the world? If you can help us reach out to a potential 💛😇 GOLD Angel, please contact us!