CGN and Frequentielijsten_corpora_4.0.1

Brian · 30 November 2024 16:13

The Frequentielijten_corpora doc states:
Het product Frequentielijsten Corpora is een verzameling lijsten van de 5000 meest voorkomende woorden en hun frequentie in een aantal corpora die beschikbaar zijn bij de TST-Centrale.

Inspecting CGN.woordvorm.txt, I see that the word “is” has two entries:
is 141417
is/uncertain 404

“Het Corpus Gesproken Nederlands (CGN) is een verzameling van 900 uur (bijna 9 miljoen woorden) hedendaagse Nederlandse spraak, afkomstig van Vlamingen en Nederlanders.”

How can the word “is” have only a frequency of 141417 in a list of almost 9 million words?

In SONAR500.wordfreqlist.1-gram.total.top5000.tsv, I see the following:
is 5736376 122784975 22.9366

This seems more realistic than 141417.

Can anyone explain the CGN frequency number?

Thanks!
Brian

dieter · 3 December 2024 14:25

Dear Brian,

I’m including IvdNT colleague @vincent so that he can take a look at your
question.

best regards,
Dieter

vincent · 3 December 2024 14:47

I assume the is/uncertain is something that was hard to understand (as it is a transcribed spoken corpus) and is therefore transcibed as is/uncertain.
I’ve double checked the CGN.woordvorm.txt frequency of “is” in the online version of CGN: CLARIN Discovery Service There the number is the same, and it is 1.41% of all tokens.
The numbers in SONAR500 are the frequency of the word, the cumulative frequency, and the cumulative relative frequency. SONAR500 contains about 500 million tokens. The total frequency of “is” amounts to 1.09% of the total corpus, which is actually quite similar.

So it seems to me that these numbers are correct and that the word form is occurs 1.41% of the tokens in spoken Dutch and 1.09% in written Dutch.

Hope this helps.
v.

Brian · 3 December 2024 15:28

Hi Vincent,
Thanks for your answer. That helped me understand the numbers. Appreciated.

All the same, I am perplexed. The word “is” (third person singular,
derde persoon enkelvoud) must be one of the most frequently used words
in Dutch. I know this is the case in English as is “ist” in German.

It seems odd that “is” only appears as 1.41% of the tokens in the CGN.

Does this mean that people on average only uses the word “is” (in Dutch)
1.41% of the time?

I must be missing something really obvious…

namaste
Brian

vincent · 3 December 2024 15:54

Well, it is one of the most frequently used words. In the top ten list of tokens (including the full stop and uh, is is at 9th position. If we look the top ten (according to the provided link above), we get:

token	freq	relfreq
.	938.711	9.31%
ja	309.405	3.07%
dat	262.852	2.61%
de	261.036	2.59%
en	227.629	2.26%
uh	205.174	2.03%
één	188.816	1.87%
ik	184.726	1.83%
is	141.806	1.41%
van	138.414	1.37%
So, these together already give us 28.35% of all tokens.

If I look up the word “is” in the English TenTen corpus in SketchEngine, I get a relfreq of 1% (619,382,856 cases out of 61,450,334,702 tokens), so that is the same order of magnitude for English.

Brian · 4 December 2024 13:54

Hi Vincent,
That clears it up for me. Appreciated.

Out of curiosity… does the 9.31% token “.” have a special significance?

namaste
Brian

vincent · 4 December 2024 14:33

It is just the sentence or utterance separator, as, of course, in spoken language there is no punctuation

Topic		Replies	Views
Can I download the frequency list for the Corpus Contemporary Dutch / Corpus Hedendaags Nederlands? Federated login service-provider	0	149	22 April 2024
New FCS endpoint for Language Bank of Finland General fcs	0	37	26 November 2025
Can I download the Corpus Contemporary Dutch / Corpus Hedendaags Nederlands? Federated login service-provider	1	213	3 December 2024
Corpus of Australian and New Zealand Spoken English now available via federated login General federated-login	0	186	9 November 2023
New collection Corpor@UCLouvain available in the VLO General vlo	0	95	27 February 2024

CGN and Frequentielijsten_corpora_4.0.1

Related topics