The faulty logic was supported by a previous study from 2019
This directly applies to the human journalist, studies on other models 6 years ago are pretty much irrelevant and this one apparently tested very small distilled ones that you can run on consumer hardware at home (Llama3 8B lol).
Anyway this study seems trash if their conclusion is that small and fine-tuned models (user compliance includes not suspecting intentionally wrong prompts) failing to account for human misdirection somehow means “no evidence of formal reasoning”. Which means using formal logic and formal operations and not reasoning in general, we use informal reasoning for the vast majority of what we do daily and we also rely on “sophisticated pattern matching” lmao, it’s called cognitive heuristics. Kahneman won the Nobel prize for recognizing type 1 and type 2 thinking in humans.
Why don’t you go repeat the experiment yourself on huggingface (accounts are free, over ten models to test, actually many are the same ones the study used) and see what actually happens? Try it on model chains that have a reasoning model like R1 and Qwant and just see for yourself and report back. It would be intellectually honest to verify things since we’re talking about critical thinking in here.
Oh add a control group here, a comparison with average human performance to see what the really funny but hidden part is. Pro-tip: CS STEMlords catastrophically suck when larping being cognitive scientists.
So you say I should be intellectually honest by doing the experiment myself, then say that my experiment is going to be shit anyways? Sure… That’s also intellectually honest.
Here’s the thing.
My education is in physics, not CS. I know enough to know what I try isn’t going to be really valid.
But unless you have peer reviewed searches to show otherwise, because I would take your home grown experiment to be as valid as mine.
And here’s experimental verification that humans lack formal reasoning when sentences don’t precisely spell it out for them: all the models they tested except chatGPT4 and o1 variants are from 27B and below, all the way to Phi-3 which is an SLM, a small language model with only 3.8B parameters. ChatGPT4 has 1.8T parameters.
1.8 trillion > 3.8 billion
ChatGPT4’s performance difference (accuracy drop) with regular benchmarks was a whooping -0.3 versus Mistral 7B -9.2 drop.
Yes there were massive differences. No, they didn’t show significance because they barely did any real stats. The models I suggested you try for yourself are not included in the test and the ones they did use are known to have significant limitations. Intellectual honesty would require reading the actual “study” though instead of doubling down.
Maybe consider the possibility that
a. STEMlords in general may know how to do benchmarks but not cognitive testing type testing or how to use statistical methods from that field
b. this study being an example of a few “I’m just messing around trying to confuse LLMs with sneaky prompts instead of doing real research because I need a publication without work” type of study, equivalent to students making chatGPT do their homework
c. 3.8B models = the size in bytes is between 1.8 and 2.2 gigabytes
d. not that “peer review” is required for criticism lol but uh, that’s a preprint on arxiv, the “study” itself hasn’t been peer reviewed or properly published anywhere (how many months are there between October 2024 to May 2025?)
e. showing some qualitative difference between quantitatively different things without showing p and using weights is garbage statistics
f. you can try the experiment yourself because the models I suggested have visible Chain of Thought and you’ll see if and over what they get confused about
g. when there are graded performance differences with several models reliably not getting confused at least more than half the time but you say “fundamentally can’t reason” you may be fundamentally misunderstanding what the word means
Need more clarifications instead of reading the study or performing basic fun experiments? At least be intellectually curious or something.
Specialized AI like that is not what most people know as AI. Most people reffer to it as LLMs.
Specialized AI, like that showcased, is still decades away from generalized creative thinking. You can’t ask it to do a science experiment with in a class because it just can’t. It’s only built for math proof.
Again, my argument is that it won’t never exist.
Just that it’s so far off it’d be like trying to regulate smart phone laws in the 90s. We would have only had pipe dreams as to what the tech could be, never mind its broader social context.
So tall to me when it can, in the case of this thread, clinically validated ways of teaching. We’re still decades from that.
It’s already capable of doing a lot, and there is reason to expect it will get better over time. If we stick our fingers in our ears and pretend that’s not possible, we will not be prepared.
If you read, it’s capable of very little under the surface of what it is.
Show me one that is well studied, like clinical trial levels, then we’ll talk.
We’re decades away at this point.
My overall point of it’s just as meaningless to talk about now as it was in the 90s. Because we can’t convince of what a functioning product will be, never mind it’s context I’m a greater society. When we have it, we can discuss it then as we have something tangible to discuss. But where we’ll be in decades is hard to regulate now.
If you assume the unlimited power needed right now to power Aloha fold at scale of all human education.
We have at best proof of concepts that computers can talk. But LLMs don’t have any way of actually knowing anything behind them. That’s kinda the problem.
And it’s not a “we’ll figure out the one trick” but more fundamentally how it works doesn’t allow for that to happen.
It can’t. It just fucking can’t. We’re all pretending it does, but it fundamentally can’t.
https://appleinsider.com/articles/24/10/12/apples-study-proves-that-llm-based-ai-models-are-flawed-because-they-cannot-reason
Creative thinking is still a long way beyond reasoning as well. We’re not close yet.
This directly applies to the human journalist, studies on other models 6 years ago are pretty much irrelevant and this one apparently tested very small distilled ones that you can run on consumer hardware at home (Llama3 8B lol).
Anyway this study seems trash if their conclusion is that small and fine-tuned models (user compliance includes not suspecting intentionally wrong prompts) failing to account for human misdirection somehow means “no evidence of formal reasoning”. Which means using formal logic and formal operations and not reasoning in general, we use informal reasoning for the vast majority of what we do daily and we also rely on “sophisticated pattern matching” lmao, it’s called cognitive heuristics. Kahneman won the Nobel prize for recognizing type 1 and type 2 thinking in humans.
Why don’t you go repeat the experiment yourself on huggingface (accounts are free, over ten models to test, actually many are the same ones the study used) and see what actually happens? Try it on model chains that have a reasoning model like R1 and Qwant and just see for yourself and report back. It would be intellectually honest to verify things since we’re talking about critical thinking in here.
Oh add a control group here, a comparison with average human performance to see what the really funny but hidden part is. Pro-tip: CS STEMlords catastrophically suck when larping being cognitive scientists.
So you say I should be intellectually honest by doing the experiment myself, then say that my experiment is going to be shit anyways? Sure… That’s also intellectually honest.
Here’s the thing.
My education is in physics, not CS. I know enough to know what I try isn’t going to be really valid.
But unless you have peer reviewed searches to show otherwise, because I would take your home grown experiment to be as valid as mine.
And here’s experimental verification that humans lack formal reasoning when sentences don’t precisely spell it out for them: all the models they tested except chatGPT4 and o1 variants are from 27B and below, all the way to Phi-3 which is an SLM, a small language model with only 3.8B parameters. ChatGPT4 has 1.8T parameters.
1.8 trillion > 3.8 billion
ChatGPT4’s performance difference (accuracy drop) with regular benchmarks was a whooping -0.3 versus Mistral 7B -9.2 drop.
Yes there were massive differences. No, they didn’t show significance because they barely did any real stats. The models I suggested you try for yourself are not included in the test and the ones they did use are known to have significant limitations. Intellectual honesty would require reading the actual “study” though instead of doubling down.
Maybe consider the possibility that a. STEMlords in general may know how to do benchmarks but not cognitive testing type testing or how to use statistical methods from that field b. this study being an example of a few “I’m just messing around trying to confuse LLMs with sneaky prompts instead of doing real research because I need a publication without work” type of study, equivalent to students making chatGPT do their homework c. 3.8B models = the size in bytes is between 1.8 and 2.2 gigabytes d. not that “peer review” is required for criticism lol but uh, that’s a preprint on arxiv, the “study” itself hasn’t been peer reviewed or properly published anywhere (how many months are there between October 2024 to May 2025?) e. showing some qualitative difference between quantitatively different things without showing p and using weights is garbage statistics f. you can try the experiment yourself because the models I suggested have visible Chain of Thought and you’ll see if and over what they get confused about g. when there are graded performance differences with several models reliably not getting confused at least more than half the time but you say “fundamentally can’t reason” you may be fundamentally misunderstanding what the word means
Need more clarifications instead of reading the study or performing basic fun experiments? At least be intellectually curious or something.
And still nothing peer reviewed to show?
Synethic benchmarks mean nothing. I don’t care how much context someone can store, when the context being stored is putting glue on pizza.
Again, I’m looking for some academic sources (doesn’t have to be stem, education would be preferred here) that the current tech is close to useful.
It can and it has done creative mathematical proof work. Nothing spectacular, but at least on par with a mathematics grad student.
Specialized AI like that is not what most people know as AI. Most people reffer to it as LLMs.
Specialized AI, like that showcased, is still decades away from generalized creative thinking. You can’t ask it to do a science experiment with in a class because it just can’t. It’s only built for math proof.
Again, my argument is that it won’t never exist.
Just that it’s so far off it’d be like trying to regulate smart phone laws in the 90s. We would have only had pipe dreams as to what the tech could be, never mind its broader social context.
So tall to me when it can, in the case of this thread, clinically validated ways of teaching. We’re still decades from that.
Show me a human that can do it.
https://scholar.google.com/scholar?hl=en&as_sdt=0%2C10&q=children+learning+from+humans#d=gs_qabs&t=1747921831528&u=%23p%3DDqyOK2jEfjQJ
EDIT: you can literally get a PhD in many forms of education and have an entire career studying it.
It’s already capable of doing a lot, and there is reason to expect it will get better over time. If we stick our fingers in our ears and pretend that’s not possible, we will not be prepared.
If you read, it’s capable of very little under the surface of what it is.
Show me one that is well studied, like clinical trial levels, then we’ll talk.
We’re decades away at this point.
My overall point of it’s just as meaningless to talk about now as it was in the 90s. Because we can’t convince of what a functioning product will be, never mind it’s context I’m a greater society. When we have it, we can discuss it then as we have something tangible to discuss. But where we’ll be in decades is hard to regulate now.
Alpha Fold. We’re not decades away. We’re years at worst.
If you assume the unlimited power needed right now to power Aloha fold at scale of all human education.
We have at best proof of concepts that computers can talk. But LLMs don’t have any way of actually knowing anything behind them. That’s kinda the problem.
And it’s not a “we’ll figure out the one trick” but more fundamentally how it works doesn’t allow for that to happen.