What are the odds that a brand a new sentence is said every second?

sga · 2 months ago

What are the odds that a brand a new sentence is said every second?

Lvxferre [he/him]@mander.xyz · edit-2 2 months ago

This is really, really messy to calculate. It depends on:

Corpus size. As in, is it a brand new sentence in relation to which other sentences? Every single sentence uttered? If the later, how do we estimate the number of sentences already uttered?
Word frequency. This is probably the easiest as it’s dictated Zipf’s Law, so the Nth most common word appears with a frequency of k/N, where “k” is a language-dependent constant.
Sentence size. Two sentences are only identical if they have the same length; I think sentence length should vary accordingly to a standard normal distribution, with the highest values being also language-dependent.
Grammatical restrictions. For example, in English you won’t see “the” followed by a verb, but most of the time subject pronouns do it. We could abstract this factor away, though.

With all of that said, I’ve calculate this for a specific situation: a corpus containing a single one word sentence, you’re uttering a new one word sentence, and you want to know if they’re identical.

The chance both sentences are identical because they use the same word will be:

for the most common word, k²
for the second most common word, (k/2)²
for the Nth most common word, (k/N)²

The odds both sentences will be identical will be the sum of all odds above, so:

p = (k/1)² + (k/2)² + (k/3)² …
p = k² * (1/1² + 1/2² + 1/3²…)

Technically this is not an infinite series because vocab isn’t infinite, but it’s easier if we pretend that it is - because then the second factor becomes a convergent series, determined to converge to π²/6 = 1.64. So we can simplify the formula to p = 1.64 k², where “k” is the frequency of the most common word.

For example, the first link contains Zipf’s Law data for English, with “the” at 7% and “of” at 3.5%; so for English k=0.07. So the odds both one-word sentences are identical, in English, are (0.07)²*1.64 = 8.0*10⁻³ = 0.8%.

Once you enlarge the corpus from one previous sentence to two or more, or tries to handle different sentence sizes, my brain becomes mush.

sga · edit-2 2 months ago

i think you have already given a pretty good and structured method to analyse the situation. Thanks!

lets consider a world where everyone speaks english. Lets also consider some average sentence length (i don’t think sentence lengths would follow a normal distribution, i think they follow a poissonian (or something alike), essentially peaking around mean, and then sentences get rarer as they get longer (but the rate of decrease gets practically constant). I think this is also a better fit for language like english, where with enough amounts of conjuncts and clause phrase substitutions (with verbs and dps here and there) sentences can get to virtually infinte length. English even has a loophole of having ‘;’ which is kinda like full stop, but does not really count as one. (I do not really know how this is classified properly in linguistics, my guess is that it would a conjunction, but then some over powered kind, which allows to break regular grammar rules).

if we pick sentence structure for each verb class (untransitive(not sure if this is what it is called, but the ones which either only require a subject or object), and transitive (normal and di), and then get similar zipfian data for occurence of these, we can cover a lot sentences. Since the sentences do not really have to mean anything (i ate broccoli is just as valid as broccoli ate me) we can then just have list of all dps, and also pick 0, 1, 2 or 3 adjective, and maybe 1 numP (iirc, we were told that we usually do not use more than 3 adjectives), so we can have a complete DP, and then just permutations all DPs with each verb class, then waited by ziph.

if we have m number of adjectives, n number of numP, o number of distinct nouns (lets say just all nouns in a big source like wiki), then (m + 1)^3 * (n + 1) * o (the plus 1 is for null) lets call this some constant D. then for a verb, v with zipf frequency z, all possible ways would be D - v, v - D, D - v - D, D - v - D, D. This is assuming CP don’t exist.

this would give a very big number (adding all the cases), then just divide with number of speakers (available, in a speaking condition every second). I think this should be some kind of estimate.

Please correct me on stuff i got wrong, i am very new to this stuff.

Lvxferre [he/him]@mander.xyz · 2 months ago

Poisson does make more sense, and it would be easier to work with. In that case the odds of a single sentence having a specific length n would be p = (λ^n)*[e^(-λ)] / n!; for English λ should be around 18 words/sentence.

English even has a loophole of having ‘;’ which is kinda like full stop, but does not really count as one. (I do not really know how this is classified properly in linguistics, my guess is that it would a conjunction, but then some over powered kind, which allows to break regular grammar rules).

The semicolon is simply punctuation; a conjunction would be a word, like “and”. Since the semicolon is mostly used to connect related albeit independent sentences, I think it’s fair to treat it like a full stop.

Please correct me on stuff i got wrong, i am very new to this stuff.

So am I - my main area of interest is Historical Linguistics, so I’m completely clueless about this stuff. I never thought the statistics classes I got 20y ago in a Chemistry grad would help me with this, but here we are.

sga · 2 months ago

English λ should be around 18 words/sentence

is not that really huge. Does an average sentence really have 18 words? Would love the source.

I never thought the statistics classes I got 20y ago in a Chemistry grad would help me with this, but here we are.

my statistics is coming from QM 1 2 and optics classes

Lvxferre [he/him]@mander.xyz · 2 months ago

I remember reading this number from style manuals, but the sources I’ve found online are actually consistent with this number - this one for example claiming 15~20 words. It seems to vary an awful lot depending on the topic and the author, though; plus the source above is mostly prescriptive, so take it with a grain of salt.

sga · 2 months ago

my guess would have been something like 5-10 words (maybe 7). Maybe in literature it would be much higher, as writing capabilities for people writing literature (technical or not) is much better than average stuff an average person says. Averages have to include less than 10 year olds, and even 5 year olds, which might have hard time having 10 words stringed together in a logical manner. Still seems crazy fact to me.