Archive for the ‘Language’ Category

World’s biggest dictionary goes online

Posted on January 24th, 2007 at 11:01 — Filed under Books, Language

BooksThis week’s Leiden University newsletter has a story on the Dictionary of the Dutch language becoming freely available online in a few days. I forwarded this report to Mark Liberman, co-creator and senior writer of Language Log, a weblog about language (no kidding!) I have greatly enjoyed reading since I discovered it last year. Mark found the story interesting enough that he posted it on LL, expanding it a bit to make it a more entertaining read than I could ever do.

Unfortunately, there doesn’t seem to be an English report on the WNT going online, so all the juicy details were lost on non-Dutch speakers. (Of which I’m sure there are many amongst the Language Log audience.) Not anymore, though, as I will provide a translation right here:

The Woordenboek der Nederlandsche Taal (WNT; Dictionary of the Dutch Language), will become freely available on the internet on Saturday, January 27, at wnt.inl.nl. Is this news important only to scholars of Dutch, philologists and linguists? “Magnificent,” responds Harm Beukers, professor in history of medicine.

Records
The Woordenboek der Nederlandsche Taal is a record-breaking piece of work. It required 134 years of work, from 1864 to 1998. It contains hundreds of thousands of entries with definitions of Dutch words and more than one and a half million quotes from sources from between 1500 and 1976. The dictionary was published in 686 parts collected in forty volumes. This makes it a very complete account of nearly five centuries of Dutch language history.

CD-ROM
On the other hand, this also makes it a bulky and even unwieldy dictionary; it is not one any person would readily have on their bookshelves. A trip to the university library or another scientific library is required to consult it. This situation improved when the dictionary was published on CD-ROM in 2000. (An incomplete edition, up to the W, was already published in 1995.) However, this CD-ROM edition had its own disadvantages, certainly compared to the online availability soon to be realized.

Useful sources
Professor Beukers is very happy the WNT will soon be available on the internet. Up to now, he had to cycle to the university library to do research. “There was the cd-rom, of course,” he says, “but I just never got around to buying it. The biggest advantage is that one can now consult the dictionary while writing a paper.” Rob Visser, professor in history of the natural sciences [and no relative of mine, --Ruud], is also delighted. “I only used the WNT sporadically, but if it becomes more easily accessible, I will certainly consult it more often. The WNT uses sources that are not always obvious for my area of work.” Visser recalls a student who quickly found a list of sources in the WNT they could use for their research on evolution.

Magnifying glass
Marietje van der Schaar, a researcher at the university’s philosophy department, also makes frequent use of the WNT and–because she often writes in English–the Oxford English Dictionary (OED), the English equivalent to the WNT. Van der Schaar: “It is wonderful that the WNT will be available online. I have the OED at home, but I can only read it with the magnifying glass that came with it. It is important for me to know how certain words were used in the past, and these dictionaries provide a lot of information on the development of words like kennen and weten. In modern English there is no distinction between these words; both are translated as to know. The OED tells me there was a distinction in the past: to ken and to wit.”

Definitions
All words in the online WNT can be looked up using the original 1863 spelling rules or modern rules. It is also possible to look for parts of words, like suffixes and prefixes, for word categories, like interjections and conjunctions, or for terms used in the definitions, like all words that have the term plant or ship in their definition.

Information outside the dictionary
An important advantage of the online WNT over the CD-ROM edition is that links could be added to information outside the dictionary. For instance, all words that have been published so far in the Etymologisch Woordenboek van het Nederlands (Etymological Dictionary of the Dutch Language), with the most recent developments in etymological research, are coupled to their equivalents in the WNT. Further links are available to similar words in Afrikaans, to figures of plants and animals, and to dialect charts. The source list of the online WNT was completely revised: it contains a large number of new works, which also turned out to be used for the printed WNT. This new source list allowed many entries in the WNT to be dated more accurately.

Using the online WNT will be free of charge. After a one-time registration as a user, the dictionary can be consulted wherever and whenever one wants to.

The newsletter article also contains two pieces of text set apart from the main body. The first piece explains how the WNT came to be:

Historical dictionary
The WNT is a historical dictionary. For every word, it lists the grammatical characteristics, the origin, the original meaning, and other meanings that developed over time. The WNT also gives derivations and compound words and information concerning usage in expressions and proverbs. Of particular note is the fact that the descriptions are fully based on an independent collection of source material: almost ten thousand literary and non-literary sources with millions of quotes. However, the WNT is also a historical dictionary in another sense.

New spelling rules
Matthias de Vries and Lammert te Winkel, the driving forces behind the WNT, created a new set of spelling rules to be used in the dictionary. These rules are appropriately known nowadays as the De Vries and Te Winkel spelling. In 1863, Te Winkel published De grondbeginselen der Nederlandsche spelling. Ontwerp der spelling voor het aanstaande Nederlandsch Woordenboek (The foundations of the Dutch spelling. Design for the spelling rules for the upcoming Dictionary of the Dutch Language). These rules soon became very popular and were adopted in Belgium already on November 21, 1863. De Vries and Te Winkel published the Woordenlijst voor de spelling der Nederlandsche taal (List of words for the spelling rules in the Dutch language) in 1866 to be used by the common man. The entire WNT was written according to these rules, surviving two spelling reforms before the WNT was completed in 1998.

1921
In order to finish before 2000, the board of the Instituut voor Nederlandse Lexicologie (Institute for Dutch Lexicology), founded in 1967 and overseeing work on the WNT ever since, decided in 1976 that no words first used after 1921 would be added. Words like vacantiegeld and zappen are therefore absent.

The second additional bit of text compares the WNT to some other large dictionaries, but I’ll leave that out here, because for some reason my weblog refuses to display the table properly. Suffice it to say the WNT is of equal size to the Oxford English Dictionary (OED), the Deutsches Wörterbuch (DWB) by the Grimm brothers and the Dai Kan-Wa Jiten (DKWJ; a Chinese-Japenese dictionary) by Tetsuji Morohashi. It has been said the WNT is actually the world’s biggest dictionary; in terms of pages, that certainly seems to be true, but the OED contains more entries. As often with size comparisons, the winner depends on the exact definition of “biggest”.

Plutoed planet gets Word of the Year honours

Posted on January 10th, 2007 at 13:01 — Filed under Language, Science: Astronomy

PlutoAstronomy meets linguistics! The American Dialect Society picked plutoed, from the verb to pluto, as 2006′s Word of the Year:

In its 17th annual words of the year vote, the American Dialect Society voted plutoed as the word of the year, in a run-off against climate canary. To pluto is to demote or devalue someone or something, as happened to the former planet Pluto when the General Assembly of the International Astronomical Union decided Pluto no longer met its definition of a planet.

Pluto’s demotion to dwarf planet is getting this little ball of rock and ice more fame in a single year than in all 76 years combined since it was first discovered!

Teens use 20 words for third of speech… as do we all

Posted on December 18th, 2006 at 13:12 — Filed under Language

BBCLast week, the BBC—which, if anything, has a reputation of reliability—ran a story on teenagers’ extremely poor language skills. The article begins as follows:

Britain’s teenagers risk becoming a nation of “Vicky Pollards” held back by poor verbal skills, research suggests. And like the Little Britain character the top 20 words used, including yeah, no, but and like, account for around a third of all words, the study says.

It then goes on to provide some nuances, which of course were ignored by most other media reporting on this research. Hence, the main point came to be the factoid that teenagers use only 20 words for a third of their speech and writing.

Over at Language Log, there are two posts showing why this is hugely misleading. First off, Mark Liberman writes:

I’m sure that Britain’s teens would benefit from additional vocabulary instruction. But the assertion that they “use just 20 words for a third of everything they say” is a spectacularly lousy argument for this conclusion.

Here’s why. The Zipf’s-law distribution of words, whether in speech or in writing, whether produced by teens or the elderly or anyone in between, means that the commonest few words will account for a substantial fraction of the total number of word-uses. And in modern English, the fraction accounted for by the commonest 20 orthographical word-forms is in the range of 25-40%, with the 33% claimed for the British teens being towards the low side of the observed range.

For example, in the Switchboard corpus — about 3 million words of conversational English collected from mostly middle-aged Americans in 1990-91 — the top 20 words account for 38% of all word-uses. In the Brown corpus, about a million words of all sorts of English texts collected in 1960, the top 20 words account for 32.5% of all word-uses. In a collection of around 120 million words from the Wall Street Journal in the years around 1990, the commonest 20 words account for 27.5% of all word-uses.

Following up on that, Geoff Pullum had a closer look at the original BBC article:

I took the entire text of the actual BBC article (…), computed the top 20 most frequent words in it, and worked out what percentage of the total it was. The answer is between 36 and 40 percent. (The difference depends on how much you collapse different word forms together into lexemes. Collapsing genitives and plurals with non-genitive singulars makes hardly any difference to the results, but treating is, are, was, and were as different words rather than as representatives of the verb be lowers the figure slightly. If you do the collapsing, the top 20 words make up over 39.5% of the text. If you don’t, the top 20 account for just over 36%.)

So this is the situation. This staggeringly stupid news report states that Britain’s teenagers are “held back by poor verbal skills” because the evidence shows that the top 20 words in their speech account for 33% of all the words they use — the implication being that they aren’t using enough words, they’re just repeating a few words like “yeah” and “no” and “but” and “like”. But in the staggeringly stupid article itself, the top 20 words account for substantially more than that. So Britain’s science writers (at least at the BBC) are even more verbally retarded.

In case you want to see the results I got (which you can easily check for yourself), here they are (with the lexeme collapsing done). There are 402 words in the text (if you replace hyphens by spaces), and this table shows the numbers of occurrences for the top 20 in frequency:

25   the
16   forms of the verb be
13   of
10   and
10   in
10   to
9    forms of the noun word
8    a
7    but
6    as
6    forms of the pronoun it
5    forms of the pronoun he
5    no
5    forms of the verb say
5    speech
4    by
4    forms of the noun school
4    that
4    which
4    with

These words account for 25 + 16 + 13 + 10 + 10 + 10 + 9 + 8 + 7 + 6 + 6 + 5 + 5 + 5 + 5 + 4 + 4 + 4 + 4 + 4 = 160 occurrences, and 160/402 = 39.8%.

Even if you insist on going with raw word forms with not even the singulars and plurals collapsed, my count shows the percentage only going down to 36%, which is still higher than the teenagers’ alleged 33%.

Ergo, the teenagers sampled in the study reported by the BBC are more verbally skilled than the writers of the BBC article.

Parentheses and slashes

Posted on December 11th, 2006 at 10:12 — Filed under Language

ParenthesesI once read that one of the differences between Dutch and English (apart from several obvious ones) is the Dutch practice of using parentheses and slashes to allow for several possibilities in one sentence without having to write them all down explicitly. A simple example:

The student(s) that fail(s) the exam can try again in a few months.

Although there’s technically nothing wrong with this sentence in English, native speakers typically wouldn’t write this. In Dutch, though, this is quite common. Sometimes, though, people try to do too much. I found this sentence in a contract for my new savings account:

De bank is bevoegd (één van) de/het bankdienst(en) en/of product(en) te beëindigen en/of (de daarvoor verschuldigde vergoeding) te wijzigen.

I can’t even translate that literally, but I’ll try as best as I can:

The bank is allowed to terminate and/or change (one of) the service(s) and/or product(s) or its/their corresponding compensation.

I guess what they mean is they can either terminate or change any of the services and products I’m signing up for. In addition, they can change the compensation (e.g. intereset). (As I translated it, the compensation can also be terminated, but the original Dutch sentence doesn’t allow for that.)

And there’s more. A few lines down, there’s this beauty:

Voor zover met betrekking tot de in deze overeenkomst vermelde (aanvragen (tot wijziging) van) product(en) reeds eerder met dezelfde rekeninghouder één/meer overeenkomst(en) is/zijn aangegaan, …

Translated as literally as possible, that becomes:

If one/more contract(s) was/were signed by this account owner in relation to (requests for (changes in)) the services mentioned in the present contract, …

Note, in particular, the nested parentheses. I really wonder if the person that wrote these sentences thought this was the clearest and easiest way to put it.

Buffaloing Buffalo buffaloes

Posted on December 4th, 2006 at 16:12 — Filed under Language

American buffaloDid you know that Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo?

Boten Anna

Posted on October 18th, 2006 at 20:10 — Filed under Language, Random musings

If you want to know what this song really is about, have a look here: the Boten Anna video clip with Dutch or English subtitles.

Also available for Basshunter’s new single, Vi sitter i Ventrilo och spelar DotA: Dutch or English subtitles.

Russian English

Posted on February 10th, 2006 at 11:02 — Filed under Language, Science: Astronomy

Astronomers typically publish their results in peer-reviewed journals like the European Astronomy & Astrophysics or the American Astrophysical Journal. In addition, most papers are placed on arXiv.org, where they are freely accessible from all over the world. Not all the works that are placed on that website are of the high standard expected from journal papers. Today, I found a submission from a Russian scientist on two of Saturn’s moons, Ijiraq and Kiviuq. The abstract is as follows:

The problem of origin of outer irregular satellites of large planets is considered. The capture way of their origin most probable, however there is not detail theory. There are a number of irregular satellites, discovered in recent time. It gives an ability to investigate the statistics of orbital interaction and try to reconstruct real collision history of these objects We restrict this consideration by pair of orbits with close elements: Kiviuq and Ijiraq and determine period of close encounters between this satellites. It may be considered as a first step on road to the construction of theory of origin of the abundant class of irregular satellites.

The English in the first paragraph is just as Russian. The “genetic relations” are especially good… they hints at an entirely new way of star, planet and moon formation.

Nesvorny et all [1] research evolution of orbit orientations in asteroid families and prove fact of recent catastrophic destruction in asteroid main belt. On the other hand, the method of investigations of genetic relations between minor bodies successfully applied in our previous works [2]. It is naturally to apply these two ideas to irregular satellites orbits with close elements for determination their parent body possible catastrophic destruction epoch.

You can find the rest of this work here: http://arxiv.org/abs/astro-ph/0602011.

Switching to English (2)

Posted on January 28th, 2006 at 23:01 — Filed under Language, Weblog/Homepage

Err… didn’t I just say two weeks ago that I was switching to English? Well, yes, I did. And promptly forgot about it. Shame on me. Let’s see if I can stick with it this time.

Typegebrek

Posted on January 24th, 2006 at 16:01 — Filed under Computers/Internet, Language

Hoe slecht denkt Google dat ik kan typen? Ik zoek op psRRKM (een programmaatje genoemd in een artikel). Google vindt slechts één pagina en vraagt zich af of ik niet per ongeluk park bedoelde.

Nederlands!

Posted on January 23rd, 2006 at 11:01 — Filed under Language

“Nederlands praten op straat is heel belangrijk. Ik krijg van veel mensen mailtjes dat zij zich unheimisch voelen op straat,” aldus minister Rita Verdonk op een VVD-congres afgelopen zaterdag. Klinkt lekker Nederlands, dat unheimisch.