Uicleir:Cànain - Wiktionary:Languages

Shortcut:
WT:LANG
WT:LANGCODE
WT:LANGNAME

For a list of all language codes, see Wiktionary:List of languages.

For information on how to add or remove a language from Wiktionary, see Wiktionary:Guide to adding and removing languages.

Wiktionary includes many words in many languages. This page details the conventions and practices relating to the variety of languages on Wiktionary.

Criteria for inclusion

Language information

To distinguish languages, Wiktionary gives each a unique name and a unique code, which identify it. Other information is also collected.

Language names

Wiktionary calls each language it includes by a distinct name. This name is used in headers, translation tables, categories, appendices, and some other places. Most languages only have one name, but some may be known by multiple names. In this case, one of the language's names is chosen for use in Wiktionary. This name is referred to as the canonical name of the language. Canonical language names are chosen by consensus. Whenever possible, common English names of languages are used, and diacritics are avoided. Attested names (names which meet CFI) are strongly preferred.

Canonical names must be unique, meaning that a name must refer to at most one language. When two or more languages are commonly known by the same name, Wiktionary distinguishes them by choosing different canonical names for each one, using a variety of means:

In many cases, the languages are also known by other names. One of those other names is then chosen so that it is unique. For example, the language of the Pyu city-states, though called "Pyu" by some scholars, is called "Tircul" (code: pyx) on Wiktionary, to distinguish it from the language of Papua New Guinea which is called "Pyu" (code: pby).
Alternative spellings of the same name can also be used to distinguish languages with otherwise identical names. For example, the Riang language of India and Bangladesh (code: ria) goes by the name "Reang" on Wiktionary, to distinguish it from the "Riang" of Burma/Myanmar (code: ril).
If languages cannot be distinguished by alternative names, the place where each language is spoken is appended in parentheses after its name, as in the case of "Buli (Ghana)" (code: bwu) and "Buli (Indonesia)" (code: bzq).
If languages go by the same name and are spoken in the same place, they can be disambiguated by their linguistic families. For example, "Austronesian Mor" (code: mhz) and "Papuan Mor" (code: moq), both of which are spoken in Indonesia.

Language codes

Each language on Wiktionary also has a unique code assigned to it, usually consisting of two or three letters. This code is used to identify languages when including templates in entries. Language names are not used in this case because they are longer and less precise, as the above section illustrates. Topical categories also use the language code as part of their names.

Wiktionary chooses codes for languages as follows, in order of priority:

If the language has a two-letter code in the ISO 639-1 standard, then that code is used. Wikipedia has a list of ISO 639-1 codes.
1. A few languages are represented on Wiktionary by 639-1 codes the ISO has deprecated. This is generally the case when the ISO has come to consider a lect a group of languages, but Wiktionary still considers it a single language. Serbo-Croatian, for example, is represented by sh.
If the language has a three-letter code in the ISO 639-3 standard, then that code is used. Wikipedia has a list of ISO 639-3 codes.
If the language has a three-letter code in the ISO 639-2 standard, then that code is used. This is quite rare. An example is Nahuatl, which is represented by the ISO 639-2 code nah.
Otherwise, a nonstandard or "exceptional" code is used. Exceptional codes are chosen as follows:
1. A few codes have been devised by the Wikimedia Foundation Language Committee for languages which have not been assigned codes by any ISO standard, but which have Wikimedia projects; these WMF codes form the subdomain part of the URL of the language's wiki projects. For example, Zamboanga Chavacano is represented on Wiktionary by cbk-zam, the code used in the URL for the Zamboanga Wikipedia, cbk-zam.wikipedia.org. Wiktionary has a list of such codes.
2. Any language which does not have an ISO or specially-devised Wikimedia code, but which is to be included in Wiktionary, has a new Wiktionary-specific code devised for it. This code consists of two parts. The first part is a relevant family code, usually consisting of three letters and usually from ISO 639-5; it is followed by a hyphen. The second part is a series of three lowercase letters which approximate the language name. (No digits, upper case letters, etc are used: IANA tags allow these, case independent, but Mediawiki software is more restrictive.) For example, Gallo is roa-gal: "roa" is the ISO 639-5 code for Romance languages, "gal" abbreviates "Gallo". This system is used even if the relevant family code is itself an exceptional code rather than an ISO-derived code; for example, Salvadoran Lenca (of the Lencan family, code qfa-len) has the code qfa-len-slv.

Reconstructed ancestor languages are assigned exceptional codes consisting of the language family's code with "-pro" added to the end. Proto-Germanic, for example, is represented by the code gem-pro.

Keep in mind that not all lects which have been assigned codes by the ISO are assigned codes or included by Wiktionary. This is the case for some constructed languages, for example. There are also many lects which the ISO has assigned codes which are not treated as distinct languages on Wiktionary for example, the ISO assigned Moldovan/Moldavian the 639-1 code mo, but Wiktionary regards it as a form of Romanian and represents it and Romanian by the same code ro. See Wiktionary:Language treatment for more information.

In a small number of cases, there is a mismatch between the (typically ISO-derived) code used by Wiktionary to represent a language and the code used by the Wikimedia Foundation. For example, Aromanian is represented on Wiktionary and in ISO 639-3 by the code rup, but the WMF uses the code roa-rup and locates the Aromanian Wikipedia at roa-rup.wikipedia.org. The templates such as Teamplaid:Wikipedia which Wiktionary uses to link to its sister projects accept only Wiktionary codes. To enable linking to projects (such as the Aromanian Wikipedia) for which the WMF uses special codes, Module:Wikimedia languages maps Wiktionary codes to Wikimedia codes.

Language families

Wiktionary sorts languages into families. Most families are related through descent from a common ancestor, but a few are merely categories, such as "creoles and pidgins". Wiktionary records which family a language belongs to in the data modules of Module:languages. Like languages, families are represented by unique codes and have unique canonical names.

English belongs to the West Germanic languages (code: gmw).
Serbo-Croatian belongs to South Slavic languages (code:zls).
Abenaki belongs to the Algonquian languages (code: alg).
Nahuatl belongs to the Nahuan languages (code: azc-nah).

Some languages are not naturally descended from other languages, but show other origins. These use special types of families:

The widely-used constructed language Esperanto is an artificial language (code: art).
Zamboanga Chavacano, a creole language, is grouped under the creole or pidgin languages (code: crp).

Scripts used by a language

Wiktionary records which script(s) (writing systems) a language is written in as well. This information is primarily used by modules to be able to automatically detect and format non-Latin-alphabet text appropriately. Scripts, too, have unique codes and canonical names.

English is written in the Latin script (code: Latn).
Serbo-Croatian is written in both the Latin and the Cyrillic scripts (codes: Latn and Cyrl).

Finding and organising terms in a language

Every language has a main category which contains all terms that Wiktionary has for that language. This category is named using the canonical name of the language, followed by the word "language". For example, the main category for English is Roinn-seòrsa:English language. If the canonical name of the language already ends in the word "language", nothing is added (hence Roinn-seòrsa:American Sign Language).

The main category for a language will have a variety of subcategories, which organise terms in various ways. The most important is the "lemma" category tree, which organises all lemmas in a language by their part of speech. As Wiktionary is always being expanded and improved upon, not all languages have their own categories yet, and certain subcategories may still be empty or missing. Categories are created as needed, when new entries are added to them. When content is added in a language lacking a category, it can simply be created using the {{Langcatboiler}}

template. The subcategories use various other templates, including {{Poscatboiler}}

, {{Derivcatboiler}}

and {{Topic cat}}

.

Languages generally also have a page which contains information that is useful to users who want to create or edit entries in that languages. This page is named ":en:Wiktionary:About (canonical name of language)", for example Wiktionary:About English or Wiktionary:About Spanish. These pages contain a wide variety of information, depending on what other editors have found useful to note. They may explain which templates to use, specific conventions regarding spelling, pronunciation or transiteration, and more. By convention, a shortcut redirect is created to these pages for easy access, named WT:A(language code). For example, WT:AEN redirects to Wiktionary:About English (for which the code is en).

Storing and retrieving language information

Templates and modules use a system for storing and retrieving the various pieces of information that may be associated with a language. The module Module:languages is used to retrieve all language-related information from other modules. This module cannot be used directly in a template, so instead there is another module named Module:languages/templates, which allows templates to access the information.

An overview of all basic information about a language, such as its canonical name, alternative names, code, family or scripts, can be looked up at Wiktionary:List of languages (or WT:LL for short). This is useful if you need to look up the code for a particular language, or need to know what the canonical name of a language is.

The data itself is not stored in Module:Languages, but instead is contained in a number of data modules (see Roinn-seòrsa:Language data modules). These are organised as follows:

Module:Languages/data2 contains information for all languages whose code consists of two letters. Thus, information for English (code: en) is stored here.
Module:Languages/data3/a through Module:Languages/data3/z contain information for all languages whose code consists of three letters. There are 26 submodules, divided by the first letter of the code. Thus, information for Old English (code: ang) is located in Module:Languages/data3/a, while information for Old Norse (code non) is in Module:Languages/data3/n.
Module:Languages/datax contains all remaining languages, which have so-called "exceptional" codes.

For instructions on how to edit this information, see the documentation of any of the data modules.

Lects which appear only in etymology sections

Some lects (dialects, chronolects and topolects) are referred to in etymology sections without having entries. These languages are given certain exceptional codes which generally not do fit the pattern described above. These languages and their codes are stored in Module:Etymology language/data and described in Wiktionary:Dialects.