CGN and Frequentielijsten_corpora_4.0.1

The Frequentielijten_corpora doc states:
Het product Frequentielijsten Corpora is een verzameling lijsten van de 5000 meest voorkomende woorden en hun frequentie in een aantal corpora die beschikbaar zijn bij de TST-Centrale.

Inspecting CGN.woordvorm.txt, I see that the word “is” has two entries:
is 141417
is/uncertain 404

“Het Corpus Gesproken Nederlands (CGN) is een verzameling van 900 uur (bijna 9 miljoen woorden) hedendaagse Nederlandse spraak, afkomstig van Vlamingen en Nederlanders.”

How can the word “is” have only a frequency of 141417 in a list of almost 9 million words?

In SONAR500.wordfreqlist.1-gram.total.top5000.tsv, I see the following:
is 5736376 122784975 22.9366

This seems more realistic than 141417.

Can anyone explain the CGN frequency number?

Thanks!
Brian

Dear Brian,

I’m including IvdNT colleague @vincent so that he can take a look at your
question.

best regards,
Dieter

1 Like
  • I assume the is/uncertain is something that was hard to understand (as it is a transcribed spoken corpus) and is therefore transcibed as is/uncertain.
  • I’ve double checked the CGN.woordvorm.txt frequency of “is” in the online version of CGN: CLARIN Discovery Service There the number is the same, and it is 1.41% of all tokens.
  • The numbers in SONAR500 are the frequency of the word, the cumulative frequency, and the cumulative relative frequency. SONAR500 contains about 500 million tokens. The total frequency of “is” amounts to 1.09% of the total corpus, which is actually quite similar.

So it seems to me that these numbers are correct and that the word form is occurs 1.41% of the tokens in spoken Dutch and 1.09% in written Dutch.

Hope this helps.
v.

Hi Vincent,
Thanks for your answer. That helped me understand the numbers. Appreciated.

All the same, I am perplexed. The word “is” (third person singular,
derde persoon enkelvoud) must be one of the most frequently used words
in Dutch. I know this is the case in English as is “ist” in German.

It seems odd that “is” only appears as 1.41% of the tokens in the CGN.

Does this mean that people on average only uses the word “is” (in Dutch)
1.41% of the time?

I must be missing something really obvious…

namaste
Brian

Well, it is one of the most frequently used words. In the top ten list of tokens (including the full stop and uh, is is at 9th position. If we look the top ten (according to the provided link above), we get:

token freq relfreq
. 938.711 9.31%
ja 309.405 3.07%
dat 262.852 2.61%
de 261.036 2.59%
en 227.629 2.26%
uh 205.174 2.03%
één 188.816 1.87%
ik 184.726 1.83%
is 141.806 1.41%
van 138.414 1.37%
So, these together already give us 28.35% of all tokens.

If I look up the word “is” in the English TenTen corpus in SketchEngine, I get a relfreq of 1% (619,382,856 cases out of 61,450,334,702 tokens), so that is the same order of magnitude for English.

Hi Vincent,
That clears it up for me. Appreciated.

Out of curiosity… does the 9.31% token “.” have a special significance?

namaste
Brian

It is just the sentence or utterance separator, as, of course, in spoken language there is no punctuation :wink:

1 Like