• @Maggoty@lemmy.world
    link
    fedilink
    English
    57
    edit-2
    11 months ago

    Oh OpenAI is just as illegal as SciHub. More so because they’re making money off of stolen IP. It’s just that the Oligarchs get to pick and choose. So of course they choose the arrangement that gives them more control over knowledge.

    • Lemminary
      link
      fedilink
      English
      -26
      edit-2
      11 months ago

      They’re not serving you the exact content they scraped, and that makes all the difference.

      • @localhost443@discuss.tchncs.de
        link
        fedilink
        English
        2111 months ago

        Well if you believe that you should look at the times lawsuit.

        Word for word on hundreds/thousands of pages of stolen content, its damming

        • Lemminary
          link
          fedilink
          English
          -711 months ago

          Why do you assume that I haven’t? The case hasn’t been resolved and it’s not clear how The NY Times did what they claim, which is may as well be manipulation. It’s a fair rebuttal by OpenAI. The Times haven’t provided the steps they used to achieve that.

          So unless that’s cleared up, it’s not damming in the slightest. Not yet, anyway. And that still doesn’t invalidate my statement above, because it’s still under very specific circumstances when that happens.

          • @Emy@lemmy.world
            link
            fedilink
            English
            211 months ago

            Also intention is pretty important when determining the guilt of many crimes. OpenAI doesnt intentionally spit back an author’s exact words, their intention is to summarize and create unique content.

              • Lemminary
                link
                fedilink
                English
                311 months ago

                No, the real defense is “that’s not how LLMs work” but you are all hinging on the wrong idea. If you so think that an LLM is capable of doing what you claim, I’d love to hear the mechanism in detail and the steps to replicate it.

              • @whofearsthenight@lemm.ee
                link
                fedilink
                English
                211 months ago

                I mean, I’m not sure why this conversation even needs to get this far. If I write an article about the history of Disney movies, and make it very clear the way I got all of those movies was to pirate them, this conversation is over pretty quick. OpenAI and most of the LLMs aren’t doing anything different. The Times isn’t Wikipedia, most of their stuff is behind a paywall with pretty clear terms of service and nothing entitles OpenAI to that content. OpenAI’s argument is “well, we’re pirating everything so it’s okay.” The output honestly seems irrelevant to me, they never should have had the content to begin with.

                • Lemminary
                  link
                  fedilink
                  English
                  211 months ago

                  That’s not the claim that they’re making. They’re arguing that OpenAI retains their work they made publicly available, which OpenAI claims is fair use because it’s wholly transformative in the form of nodes, weights and biases, and that they don’t store those articles in a database for reuse. But their other argument is that they created a system that threatens their business which is just ludicrous.

        • Lemminary
          link
          fedilink
          English
          -411 months ago

          What a colorful mischaracterization. It sounds clever at face value but it’s really naive. If anything about this is deceptive, it’s the lengths that people go to to slander what they dislike.

          • @jacksilver@lemmy.world
            link
            fedilink
            English
            211 months ago

            Actually content laundering is the best term I’ve heard to describe the process. Just like money laundering, you no longer know the source and know it’s technically legal to use and distribute.

            I mean, if the copyrighted content wasn’t so critical, they would train models without it. Their essentially derivative works, but no one wants to acknowledge it because it would either require changing our copyright laws or make this potentially lucrative and important work illegal.

            • Lemminary
              link
              fedilink
              English
              411 months ago

              Content laundering is not a good way to describe it because it’s misleading as it oversimplifies and mischaracterizes what a language model actually does. It’s a fundamental misunderstanding of how it works. Training language models is typically a transparent and well-documented process as described by the mountains of research over the past decades. The real value comes from the weights of the nodes in the neural network and not the source that it spits out in its entirety when it was trained. The source material is evaluated and wholly transformed into new data in the form of nodes and weights. The original content does not exist as it was within the network because there’s no way to encode it that way. It’s a statistical system that compounds information.

              And while LLMs do have the capacity to create derivative works in other ways, it’s not all that they do, or what they always do. It’s only one of the many functions that it has. What you say would probably be true if it was only trained on a single source, but that’s not even feasible. But when you train it on millions of sources, what remains are the overall patterns of language within those works. It’s much more sophisticated and flexible than what you describe.

              So no, if it was cut and dry there would be grounds for a legitimate lawsuit. The problem is that people are arguing points that do not apply but sound reasonable when they haven’t seen a neural network work under the hood. If anything, new laws need to be created to address what LLMs do if you’re so concerned about proper compensation.

          • Jilanico
            link
            fedilink
            English
            211 months ago

            I feel most people critical of AI don’t know how a neural network works…

            • Lemminary
              link
              fedilink
              English
              -111 months ago

              That is exactly what’s going on here. Or they hate it enough that they don’t mind making stuff up or mischaracterizing what it does. Seems to be a common thread on the Fediverse. It’s not the first time this week I’ve seen it.

      • Cethin
        link
        fedilink
        English
        511 months ago

        It’s great how for most of us we’re taught that just changing the order of words is still plagerism. For them they frequently end up using the exact same words as other things and people still argue it somehow is intelligent and somehow not plagerism.

        • Lemminary
          link
          fedilink
          English
          211 months ago

          “Changing the order of words” is what it does? That’s news to me. And do you have examples of it “using the exact same words as other things” without prompt manipulation?

          • @asret@lemmy.zip
            link
            fedilink
            English
            011 months ago

            Why does the prompting matter? If I “prompt” a band to play copyrighted music does that mean they get a free pass?

            • Lemminary
              link
              fedilink
              English
              3
              edit-2
              11 months ago

              That’s not a very good analogy because the band would be reproducing an entire work of art which an LLM does not and cannot. And by prompt manipulation I mean purposely making it seem like the LLM is doing something it wouldn’t do on its own. The operating word is seem, which is what I meant by manipulation. The prompting here is irrelevant, but how it’s done is. So unless The Times releases the steps they used to get ChatGPT to output what it did, you can’t really claim that that’s what it does.

              In a blog post, OpenAI said the Times “is not telling the full story.” It took particular issue with claims that its ChatGPT AI tool reproduced Times stories verbatim, arguing that the Times had manipulated prompts to include regurgitated excerpts of articles. “Even when using such prompts, our models don’t typically behave the way The New York Times insinuates, which suggests they either instructed the model to regurgitate or cherry-picked their examples from many attempts,” OpenAI said.

            • @stewsters@lemmy.world
              link
              fedilink
              English
              2
              edit-2
              11 months ago

              If you passed them a sheet of music I’d say that’s on you, it would be your responsibility to not sell recordings of them playing it.

              Just like if I typed the first chapter of Harry Potter into word it is not Microsoft’s intent to breach copyright, it would have been my intent to make it do it. It would be my responsibility not to sell that first chapter, and they should come after me if I did, even though MS is a corporation who supplied the tools.