• DoPeopleLookHere@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    1
    arrow-down
    1
    ·
    10 hours ago

    So you say I should be intellectually honest by doing the experiment myself, then say that my experiment is going to be shit anyways? Sure… That’s also intellectually honest.

    Here’s the thing.

    My education is in physics, not CS. I know enough to know what I try isn’t going to be really valid.

    But unless you have peer reviewed searches to show otherwise, because I would take your home grown experiment to be as valid as mine.

    • pinkapple@lemmy.ml
      link
      fedilink
      English
      arrow-up
      1
      arrow-down
      1
      ·
      9 hours ago

      And here’s experimental verification that humans lack formal reasoning when sentences don’t precisely spell it out for them: all the models they tested except chatGPT4 and o1 variants are from 27B and below, all the way to Phi-3 which is an SLM, a small language model with only 3.8B parameters. ChatGPT4 has 1.8T parameters.

      1.8 trillion > 3.8 billion

      ChatGPT4’s performance difference (accuracy drop) with regular benchmarks was a whooping -0.3 versus Mistral 7B -9.2 drop.

      Yes there were massive differences. No, they didn’t show significance because they barely did any real stats. The models I suggested you try for yourself are not included in the test and the ones they did use are known to have significant limitations. Intellectual honesty would require reading the actual “study” though instead of doubling down.

      Maybe consider the possibility that a. STEMlords in general may know how to do benchmarks but not cognitive testing type testing or how to use statistical methods from that field b. this study being an example of a few “I’m just messing around trying to confuse LLMs with sneaky prompts instead of doing real research because I need a publication without work” type of study, equivalent to students making chatGPT do their homework c. 3.8B models = the size in bytes is between 1.8 and 2.2 gigabytes d. not that “peer review” is required for criticism lol but uh, that’s a preprint on arxiv, the “study” itself hasn’t been peer reviewed or properly published anywhere (how many months are there between October 2024 to May 2025?) e. showing some qualitative difference between quantitatively different things without showing p and using weights is garbage statistics f. you can try the experiment yourself because the models I suggested have visible Chain of Thought and you’ll see if and over what they get confused about g. when there are graded performance differences with several models reliably not getting confused at least more than half the time but you say “fundamentally can’t reason” you may be fundamentally misunderstanding what the word means

      Need more clarifications instead of reading the study or performing basic fun experiments? At least be intellectually curious or something.

      • DoPeopleLookHere@sh.itjust.works
        link
        fedilink
        English
        arrow-up
        1
        arrow-down
        1
        ·
        8 hours ago

        And still nothing peer reviewed to show?

        Synethic benchmarks mean nothing. I don’t care how much context someone can store, when the context being stored is putting glue on pizza.

        Again, I’m looking for some academic sources (doesn’t have to be stem, education would be preferred here) that the current tech is close to useful.

        • pinkapple@lemmy.ml
          link
          fedilink
          English
          arrow-up
          1
          arrow-down
          1
          ·
          4 hours ago

          You made huge claims using a non peer reviewed preprint with garbage statistics and abysmal experimental design where they put together 21 bikes and 4 race cars to bury openAI flagship models under the group trend and go to the press with it. I’m not going to go over all the flaws but all the performance drops happen when they spam the model with the same prompt several times and then suddenly add or remove information, while using greedy decoding which will cause artificial averaging artifacts. It’s context poisoning with extra steps i.e. not logic testing but prompt hacking.

          This is Apple (that is falling behind in its AI research) attacking a competitor with fake FUD and doesn’t even count as research, which you’d know if you looked it up and saw you know, opinions of peers.

          You’re just protecting an entrenched belief based on corporate slop so what would you do with peer reviewed anything? You didn’t bother to check the one you posted yourself.

          Or you post corporate slop on purpose and now trying to turn the conversation away from that. Usually the case when someone conveniently bypasses absolutely all your arguments lol.

          • DoPeopleLookHere@sh.itjust.works
            link
            fedilink
            English
            arrow-up
            1
            ·
            1 hour ago

            Okay, here’s a non apple source since you want it.

            https://arxiv.org/abs/2402.12091

            5 Conclusion In this study, we investigate the capacity of LLMs, with parameters varying from 7B to 200B, to com- prehend logical rules. The observed performance disparity between smaller and larger models indi- cates that size alone does not guarantee a profound understanding of logical constructs. While larger models may show traces of semantic learning, their outputs often lack logical validity when faced with swapped logical predicates. Our findings suggest that while LLMs may improve their logical reason- ing performance through in-context learning and methodologies such as COT, these enhancements do not equate to a genuine understanding of logical operations and definitions, nor do they necessarily confer the capability for logical reasoning.