CLICS

Frequently Asked Questions

Colexification

In the context of CLICS, we use the term colexification (coined by François 2008^[PDF]) to refer to the situation when two or more of the meanings in our lexical sources are covered in a language by the same lexical item. For instance, we would say that Russian рука colexifies ‘hand’ and ‘arm’, that is, concepts that are semantically related to each other. Roughly spoken, colexification can correspond either to polysemy or semantic vagueness in lexical semantic analyses. Since we have not performed such analyses that would allow us to further discriminate between the two, we chose colexification as a label that deliberately does not make a commitment with regard to this distinction. However, we offer measures to rule out effects of accidental homonymy.

Conceptual links, which constitute our main interest in colexification, are clearly not the only possible reason why two concepts can be expressed by the same form in a given language. The most prominent among the others is homonymy, which is in very simple terms an accidental similarity in form, often arising from sound changes causing originally distinct lexical items to collapse phonologically. Such cases may lead to spurious links in the database (compare, for example, the links between the concepts ‘arm’ and ‘poor’ which are due to homonymy in some Germanic languages). To deal with this issue in a consistent and generally applicable way (even if the history of the lexical items in question is not known) we recommend to employ a typological criterion to distinguish between homonymy and polysemy (see Croft 1990): for semantic connections to be accepted as genuine rather than accidental, the connection should be detectable in more than one language family. We should, however, point out that this criterion has been developed and originally applied by the aforementioned author in the realm of polyfunctionality of grammatical markers, that is, items belonging to a paradigmatically relatively well-structured set of items with a manageable semantic range. When applied to lexical meanings, there is a danger that the criterion rules out a set of genuine, but simply rare, semantic associations. Still, we feel that our approach is justified, methodologically because it offers a simple and non-subjective decision criterion, and conceptually because our approach relies on cross-linguistic data in the first place.

Using CLICS

Using our query interface you can search whether specific concepts are linked in the language varieties used in CLICS (see here). You can also check how many links are reported for a given concept (see here). If you want to view the data in a visually more appealing way, you can browse through the concept networks we extracted from the data (see How do the visualizations work? for a more detailed description of the ideas behind the visualization). You can also download parts of the data and conduct large-scale quantitative investigations (see List, Terhalle, and Urban 2013^[PDF] for an example).

How do the visualizations work?

The CLICS database can be accessed through a web-based visualization that represents each community of the network as a force-directed graph layout. The visualization features a number of interactive components that allow the detection of areal and genealogical patterns in the database. When mousing over an edge of the graph all languages showing the respective colexification pattern are shown in a list together with their genealogical or areal information and the word form that expresses the concepts in question. In addition, a world map representation highlights all languages in which a given colexification pattern occurs in order to make areal patterns more easily detectable. The visualizations are implemented in JavaScript using the D3 library (Bostock et al. 2011). Each community can be directly accessed via a URL and saved as SVG. A more detailed description is given in a paper by Mayer et al. (2014)^[PDF]. You may also check the slides of the talk accompanying the paper.

Sources of CLICS

Currently CLICS utilizes four different sources, all of which are freely available online themselves.

The intercontinental dictionary series (IDS, Key & Comrie 2007 eds.) features lexical data for 233 world languages. IDS data were provided mostly by experts on the respective languages, although in some cases published written sources have been used. There are 1,310 entries to be filled for each language, though, of course, there are gaps in coverage for individual languages. The list of concepts is inspired by Buck (1949). Of all 233 languages in the IDS, 178 were automatically cleaned and included in CLICS.
The IDS list, in turn, provides the basis for the choice of meanings in the World Loanword Database (WOLD, Haspelmath & Tadmor 2009 eds.). The principal aim of this source is to provide a basis for generalizations on the borrowability of items in different parts of the lexicon. The WOLD data consist of vocabularies of between 1,000-2,000 items for 41 languages, with annotations about the borrowing history of particular items where applicable. WOLD data was coded by experts on the respective languages, in some cases also with the aid of extant sources. Of all 41 languages in WOLD, 33 are included in CLICS.
The Logos Dictionary is a freely accessible multilingual online dictionary that is regularly updated online by a network of professional translators. It offers lexical data for more than 60 different languages. We manually extracted lexical data for 4 languages that were neither present in IDS nor in WOLD.
The Språkbanken project (University of Gothenburg) offers a couple of wordlists for South Asian and Himalayan languages. The wordlists mirror the IDS format closely, and we included 6 of currently 8 wordlists in CLICS.

Citing CLICS

CLICS can be cited as follows:

List, Johann-Mattis, Thomas Mayer, Anselm Terhalle, and Matthias Urban (2014). CLICS: Database of Cross-Linguistic Colexifications. Marburg: Forschungszentrum Deutscher Sprachatlas (Version 1.0, online available at http://CLICS.lingpy.org, accessed on ).

Reliability of the Data

The structure of the data in CLICS is a direct image of the structure of the data in IDS and WOLD and does not involve a reanalysis of any sort on our behalf. However, it must be emphasized that the meaning associations reported in CLICS are recovered from sheer identity of form in different cells in the sources we have used, and do not necessarily rest on language-internal semantic analysis (see also the sections on colexification and homonymy). Furthermore, we have no control over artifacts that

may have arisen in the process of data gathering themselves,
were created by mapping the predefined concepts onto the actual languages, and
were introduced when cleaning parts of the data automatically in which the textual coding was not provided in a consistent way.

For these reasons, we strongly recommend to check the actual sources whenever a conceptual link that our database reports should be crucial for your arguments. If you find errors in the data, we would be very glad if you could either inform us via email (clics@lingpy.org), or if you could file an issue on http://github.com/clics/clics/issues, providing the name of the language variety and its ISO-code in the issue description. Unfortunately, we will not be able to change errors immediately, but we are storing all upcoming errors in our issue tracker and will make sure to correct them before publishing the next release of CLICS.

Areal Effects in the Data

Coverage of the world’s languages in both IDS and WOLD is biased towards certain regions of the world. In the case of IDS, South American languages and languages of the Caucasus are overrepresented. In the case of WOLD, languages of Europe figure particularly prominently. Since it is possible and even expectable that certain polysemies in the lexicon are frequent or even restricted to certain areas of the world, we advise researchers interested in cross-linguistic diversity to take appropriate measures to rule out unwarranted generalizations due to areal effects.

Statistics on CLICS

CLICS (Version 1.0) offers information on colexification in 221 different language varieties covering 64 different language families. All language varieties in our sample comprise a total of 301,498 words covering 1,280 different concepts. Using a strictly automatic procedure, we identified 45,667 cases of colexification that correspond to 16,239 different links between the 1,280 concepts covered by our data.

Contact

For technical questions regarding the data, please contact Johann-Mattis List (Philipps-University Marburg, Germany).