• dejpivo
    link
    fedilink
    English
    arrow-up
    6
    arrow-down
    41
    ·
    1 day ago

    How is this kind of testing relevant anymore? Isn’t it creating an unrealistic situation, given the brave new world of AI everywhere?

      • FourWaveforms@lemm.ee
        link
        fedilink
        English
        arrow-up
        10
        arrow-down
        22
        ·
        edit-2
        1 day ago

        But what good is that if AI can do it anyway?

        That is the crux of the issue.

        Years ago the same thing was said about calculators, then graphing calculators. I had to drop a stat class and take it again later because the dinosaur didn’t want me to use a graphing calculator. I have ADD (undiagnosed at the time) and the calculator was a big win for me.

        Naturally they were all full of shit.

        But this? This is different. AI is currently as good as a graphing calculator for some engineering tasks, horrible for some others, excellent at still others. It will get better over time. And what happens when it’s awesome at everything?

        What is the use of being the smartest human when you’re easily outclassed by a machine?

        If we get fully automated yadda yadda, do many of us turn into mush-brained idiots who sit around posting all day? Everyone retires and builds Adirondack chairs and sips mint juleps and whatever? (That would be pretty sweet. But how to get there without mass starvation and unrest?)

        Alternately, do we have to do a Butlerian Jihad to get rid of it, and threaten execution to anyone who tries to bring it back… only to ensure we have capitalism and poverty forever?

        These are the questions. You have to zoom out to see them.

        • Natanael@infosec.pub
          link
          fedilink
          English
          arrow-up
          24
          arrow-down
          1
          ·
          edit-2
          1 day ago

          Because if you don’t know how to tell when the AI succeeded, you can’t use it.

          To know when it succeeded, you must know the topic.

          The calculator is predictable and verifiable. LLM is not

          • FourWaveforms@lemm.ee
            link
            fedilink
            English
            arrow-up
            2
            arrow-down
            13
            ·
            1 day ago

            I’m not sure what you’re implying. I’ve used it to solve problems that would’ve taken days to figure out on my own, and my solutions might not have been as good.

            I can tell whether it succeeded because its solutions either work, or they don’t. The problems I’m using it on have that property.

            • shoo@lemmy.world
              link
              fedilink
              English
              arrow-up
              4
              arrow-down
              1
              ·
              23 hours ago

              The problem is offloading critical thinking to a blackbox of questionably motivated design. Did you use it to solve problems or did you use it to find a sufficient approximation of a solution? If you can’t deduce why the given solution works then it is literally unknowable if your problem is solved, you’re just putting faith in an algorithm.

              There are also political reasons we’ll never get luxury gay space communism from it. General Ai is the wet dream of every authoritarian: an unverifiable, omnipresent, first line source of truth that will shift the narrative to whatever you need.

              The brain is a muscle and critical thinking is trained through practice; not thinking will never be a shortcut for thinking.

            • Natanael@infosec.pub
              link
              fedilink
              English
              arrow-up
              3
              ·
              23 hours ago

              That says more about you.

              There are a lot of cases where you can not know if it worked unless you have expertise.

              • FourWaveforms@lemm.ee
                link
                fedilink
                English
                arrow-up
                1
                arrow-down
                1
                ·
                edit-2
                3 hours ago

                This still seems too simplistic. You say you can’t know whether it’s right unless you know the topic, but that’s not a binary condition. I don’t think anyone “knows” a complex topic to its absolute limits. That would mean they had learned everything about it that could be learned, and there would be no possibility of there being anything else in the universe for them to learn about it.

                An LLM can help fill in gaps, and you can use what you already know as well as credible resources (e g., textbooks) to vet its answer, just as you would use the same knowledge to vet your own theories. You can verify its work the same way you’d verify your own. The value is that it may add information or some part of a solution that you wouldn’t have. The risk is that it misunderstands something, but that risk exists for your own theories as well.

                This approach requires skepticism. The risk would be that the person using it isn’t sufficiently skeptical, which is the same problem as relying too much on their own opinions or those of another person.

                For example, someone studying statistics for the first time would want to vet any non-trivial answer against the textbook or the professor rather than assuming the answer is correct. Answer comes from themself, the student in the next row, or an LLM, doesn’t matter.

          • pinkapple@lemmy.ml
            link
            fedilink
            English
            arrow-up
            1
            arrow-down
            1
            ·
            5 hours ago

            The faulty logic was supported by a previous study from 2019

            This directly applies to the human journalist, studies on other models 6 years ago are pretty much irrelevant and this one apparently tested very small distilled ones that you can run on consumer hardware at home (Llama3 8B lol).

            Anyway this study seems trash if their conclusion is that small and fine-tuned models (user compliance includes not suspecting intentionally wrong prompts) failing to account for human misdirection somehow means “no evidence of formal reasoning”. Which means using formal logic and formal operations and not reasoning in general, we use informal reasoning for the vast majority of what we do daily and we also rely on “sophisticated pattern matching” lmao, it’s called cognitive heuristics. Kahneman won the Nobel prize for recognizing type 1 and type 2 thinking in humans.

            Why don’t you go repeat the experiment yourself on huggingface (accounts are free, over ten models to test, actually many are the same ones the study used) and see what actually happens? Try it on model chains that have a reasoning model like R1 and Qwant and just see for yourself and report back. It would be intellectually honest to verify things since we’re talking about critical thinking in here.

            Oh add a control group here, a comparison with average human performance to see what the really funny but hidden part is. Pro-tip: CS STEMlords catastrophically suck when larping being cognitive scientists.

            • DoPeopleLookHere@sh.itjust.works
              link
              fedilink
              English
              arrow-up
              1
              arrow-down
              1
              ·
              5 hours ago

              So you say I should be intellectually honest by doing the experiment myself, then say that my experiment is going to be shit anyways? Sure… That’s also intellectually honest.

              Here’s the thing.

              My education is in physics, not CS. I know enough to know what I try isn’t going to be really valid.

              But unless you have peer reviewed searches to show otherwise, because I would take your home grown experiment to be as valid as mine.

              • pinkapple@lemmy.ml
                link
                fedilink
                English
                arrow-up
                1
                arrow-down
                1
                ·
                4 hours ago

                And here’s experimental verification that humans lack formal reasoning when sentences don’t precisely spell it out for them: all the models they tested except chatGPT4 and o1 variants are from 27B and below, all the way to Phi-3 which is an SLM, a small language model with only 3.8B parameters. ChatGPT4 has 1.8T parameters.

                1.8 trillion > 3.8 billion

                ChatGPT4’s performance difference (accuracy drop) with regular benchmarks was a whooping -0.3 versus Mistral 7B -9.2 drop.

                Yes there were massive differences. No, they didn’t show significance because they barely did any real stats. The models I suggested you try for yourself are not included in the test and the ones they did use are known to have significant limitations. Intellectual honesty would require reading the actual “study” though instead of doubling down.

                Maybe consider the possibility that a. STEMlords in general may know how to do benchmarks but not cognitive testing type testing or how to use statistical methods from that field b. this study being an example of a few “I’m just messing around trying to confuse LLMs with sneaky prompts instead of doing real research because I need a publication without work” type of study, equivalent to students making chatGPT do their homework c. 3.8B models = the size in bytes is between 1.8 and 2.2 gigabytes d. not that “peer review” is required for criticism lol but uh, that’s a preprint on arxiv, the “study” itself hasn’t been peer reviewed or properly published anywhere (how many months are there between October 2024 to May 2025?) e. showing some qualitative difference between quantitatively different things without showing p and using weights is garbage statistics f. you can try the experiment yourself because the models I suggested have visible Chain of Thought and you’ll see if and over what they get confused about g. when there are graded performance differences with several models reliably not getting confused at least more than half the time but you say “fundamentally can’t reason” you may be fundamentally misunderstanding what the word means

                Need more clarifications instead of reading the study or performing basic fun experiments? At least be intellectually curious or something.

                • DoPeopleLookHere@sh.itjust.works
                  link
                  fedilink
                  English
                  arrow-up
                  1
                  ·
                  3 hours ago

                  And still nothing peer reviewed to show?

                  Synethic benchmarks mean nothing. I don’t care how much context someone can store, when the context being stored is putting glue on pizza.

                  Again, I’m looking for some academic sources (doesn’t have to be stem, education would be preferred here) that the current tech is close to useful.

          • FourWaveforms@lemm.ee
            link
            fedilink
            English
            arrow-up
            7
            arrow-down
            10
            ·
            1 day ago

            It’s already capable of doing a lot, and there is reason to expect it will get better over time. If we stick our fingers in our ears and pretend that’s not possible, we will not be prepared.

            • DoPeopleLookHere@sh.itjust.works
              link
              fedilink
              English
              arrow-up
              3
              arrow-down
              3
              ·
              edit-2
              22 hours ago

              If you read, it’s capable of very little under the surface of what it is.

              Show me one that is well studied, like clinical trial levels, then we’ll talk.

              We’re decades away at this point.

              My overall point of it’s just as meaningless to talk about now as it was in the 90s. Because we can’t convince of what a functioning product will be, never mind it’s context I’m a greater society. When we have it, we can discuss it then as we have something tangible to discuss. But where we’ll be in decades is hard to regulate now.

                • DoPeopleLookHere@sh.itjust.works
                  link
                  fedilink
                  English
                  arrow-up
                  1
                  ·
                  edit-2
                  5 hours ago

                  If you assume the unlimited power needed right now to power Aloha fold at scale of all human education.

                  We have at best proof of concepts that computers can talk. But LLMs don’t have any way of actually knowing anything behind them. That’s kinda the problem.

                  And it’s not a “we’ll figure out the one trick” but more fundamentally how it works doesn’t allow for that to happen.

        • HobbitFoot @thelemmy.club
          link
          fedilink
          English
          arrow-up
          11
          arrow-down
          1
          ·
          1 day ago

          If you want to compare a calculator to an LLM, you could at least reasonably expect the calculator result to be accurate.

          • Zexks@lemmy.world
            link
            fedilink
            English
            arrow-up
            2
            ·
            7 hours ago

            Why. Because you put trust into the producers of said calculators to not fuck it up. Or because you trust others to vet those machines or are you personally validating. Unless your disassembling those calculators and inspecting their chips sets your just putting your trust in someone else and claiming “this magic box is more trust worthy”

            • HobbitFoot @thelemmy.club
              link
              fedilink
              English
              arrow-up
              1
              ·
              5 hours ago

              A combination of personal vetting via analyzing output and the vetting of others. For instance, the Pentium calculation error was in the news. Otherwise, calculation by computer processor is understood and the technology is acceptable to be used for cases involving human lives.

              In contrast, there are several documented cases where LLM’s have been incorrect in the news to a point where I don’t need personal vetting. No one is anywhere close to stating that LLM’s can be used in cases involving human lives.