(Cross posting from !stupid_questions@lemmings.world https://lemmings.world/post/25535229)
A brand new sentence is a sentence which we consider to have never been spoken or written, or thought of (at least the ones which are not recorded). And just swapping a noun with another noun (for example a name of a place or person) while may techinically be considered new sentence, it is not really matching the spirit of brand new sentence.
for the linguists, can you try to come up with a better estimate (better than just coming up (all the words)^(to the power average sentence length)). Maybe by using the description of using different forms of verbs (like we consider in NLP) (verbs which take DP, CP), then adding standard adjectives and finish with remaining grammar (sorry if I am getting it all wrong, it has been a while since I took my intro to linguistics class). Also, consider a morpheme less form. This exercise is for a more realistic guess.
This is really, really messy to calculate. It depends on:
With all of that said, I’ve calculate this for a specific situation: a corpus containing a single one word sentence, you’re uttering a new one word sentence, and you want to know if they’re identical.
The chance both sentences are identical because they use the same word will be:
The odds both sentences will be identical will be the sum of all odds above, so:
Technically this is not an infinite series because vocab isn’t infinite, but it’s easier if we pretend that it is - because then the second factor becomes a convergent series, determined to converge to π²/6 = 1.64. So we can simplify the formula to p = 1.64 k², where “k” is the frequency of the most common word.
For example, the first link contains Zipf’s Law data for English, with “the” at 7% and “of” at 3.5%; so for English k=0.07. So the odds both one-word sentences are identical, in English, are (0.07)²*1.64 = 8.0*10⁻³ = 0.8%.
Once you enlarge the corpus from one previous sentence to two or more, or tries to handle different sentence sizes, my brain becomes mush.
i think you have already given a pretty good and structured method to analyse the situation. Thanks!
lets consider a world where everyone speaks english. Lets also consider some average sentence length (i don’t think sentence lengths would follow a normal distribution, i think they follow a poissonian (or something alike), essentially peaking around mean, and then sentences get rarer as they get longer (but the rate of decrease gets practically constant). I think this is also a better fit for language like english, where with enough amounts of conjuncts and clause phrase substitutions (with verbs and dps here and there) sentences can get to virtually infinte length. English even has a loophole of having ‘;’ which is kinda like full stop, but does not really count as one. (I do not really know how this is classified properly in linguistics, my guess is that it would a conjunction, but then some over powered kind, which allows to break regular grammar rules).
if we pick sentence structure for each verb class (untransitive(not sure if this is what it is called, but the ones which either only require a subject or object), and transitive (normal and di), and then get similar zipfian data for occurence of these, we can cover a lot sentences. Since the sentences do not really have to mean anything (i ate broccoli is just as valid as broccoli ate me) we can then just have list of all dps, and also pick 0, 1, 2 or 3 adjective, and maybe 1 numP (iirc, we were told that we usually do not use more than 3 adjectives), so we can have a complete DP, and then just permutations all DPs with each verb class, then waited by ziph.
if we have
m
number of adjectives,n
number of numP,o
number of distinct nouns (lets say just all nouns in a big source like wiki), then(m + 1)^3 * (n + 1) * o
(the plus 1 is for null) lets call this some constant D. then for a verb,v
with zipf frequencyz
, all possible ways would be D - v, v - D, D - v - D, D - v - D, D. This is assuming CP don’t exist.this would give a very big number (adding all the cases), then just divide with number of speakers (available, in a speaking condition every second). I think this should be some kind of estimate.
Please correct me on stuff i got wrong, i am very new to this stuff.
Poisson does make more sense, and it would be easier to work with. In that case the odds of a single sentence having a specific length n would be
p = (λ^n)*[e^(-λ)] / n!
; for English λ should be around 18 words/sentence.The semicolon is simply punctuation; a conjunction would be a word, like “and”. Since the semicolon is mostly used to connect related albeit independent sentences, I think it’s fair to treat it like a full stop.
So am I - my main area of interest is Historical Linguistics, so I’m completely clueless about this stuff. I never thought the statistics classes I got 20y ago in a Chemistry grad would help me with this, but here we are.
is not that really huge. Does an average sentence really have 18 words? Would love the source.
my statistics is coming from QM 1 2 and optics classes
I remember reading this number from style manuals, but the sources I’ve found online are actually consistent with this number - this one for example claiming 15~20 words. It seems to vary an awful lot depending on the topic and the author, though; plus the source above is mostly prescriptive, so take it with a grain of salt.
my guess would have been something like 5-10 words (maybe 7). Maybe in literature it would be much higher, as writing capabilities for people writing literature (technical or not) is much better than average stuff an average person says. Averages have to include less than 10 year olds, and even 5 year olds, which might have hard time having 10 words stringed together in a logical manner. Still seems crazy fact to me.