OpenAI says it’s “impossible” to create useful AI models without copyrighted material

@sculd@beehaw.org · 11 months ago

OpenAI says it’s “impossible” to create useful AI models without copyrighted material

@lily33@lemm.ee · 11 months ago

This is not REALLY about copyright - this is an attack on free and open AI models, which would be IMPOSSIBLE if copyright was extended to cover the case of using the works for training.
It’s not stealing. There is literally no resemblance between the training works and the model. IP rights have been continuously strengthened due to lobbying over the last century and are already absurdly strong, I don’t understand why people on here want so much to strengthen them ever further.

BraveSirZaphod · 11 months ago

There is literally no resemblance between the training works and the model.

This is way too strong a statement when some LLMs can spit out copyrighted works verbatim.

https://www.404media.co/google-researchers-attack-convinces-chatgpt-to-reveal-its-training-data/

A team of researchers primarily from Google’s DeepMind systematically convinced ChatGPT to reveal snippets of the data it was trained on using a new type of attack prompt which asked a production model of the chatbot to repeat specific words forever.

Often, that “random content” is long passages of text scraped directly from the internet. I was able to find verbatim passages the researchers published from ChatGPT on the open internet: Notably, even the number of times it repeats the word “book” shows up in a Google Books search for a children’s book of math problems. Some of the specific content published by these researchers is scraped directly from CNN, Goodreads, WordPress blogs, on fandom wikis, and which contain verbatim passages from Terms of Service agreements, Stack Overflow source code, copyrighted legal disclaimers, Wikipedia pages, a casino wholesaling website, news blogs, and random internet comments.

Beyond that, copyright law was designed under the circumstances where creative works are only ever produced by humans, with all the inherent limitations of time, scale, and ability that come with that. Those circumstances have now fundamentally changed, and while I won’t be so bold as to pretend to know what the ideal legal framework is going forward, I think it’s also a much bolder statement than people think to say that fair use as currently applied to humans should apply equally to AI and that this should be accepted without question.

MudMan · 11 months ago

I’m gonna say those circumstances changed when digital copies and the Internet became a thing, but at least we’re having the conversation now, I suppose.

I agree that ML image and text generation can create something that breaks copyright. You for sure can duplicate images or use copyrighted characterrs. This is also true of Youtube videos and Tiktoks and a lot of human-created art. I think it’s a fascinated question to ponder whether the infraction is in what the tool generates (i.e. did it make a picture of Spider-Man and sell it to you for money, whcih is under copyright and thus can’t be used that way) or is the infraction in the ingest that enables it to do that (i.e. it learned on pictures of Spider-Man available on the Internet, and thus all output is tainted because the images are copyrighted).

The first option makes more sense to me than the second, but if I’m being honest I don’t know if the entire framework makes sense at this point at all.

@lily33@lemm.ee · edit-2 11 months ago

The infraction should be in what’s generated. Because the interest by itself also enables many legitimate, non-infracting uses: uses, which don’t involve generating creative work at all, or where the creative input comes from the user.

MudMan · 11 months ago

I don’t disagree on principle, but I do think it requires some thought.

Also, that’s still a pretty significant backstop. You basically would need models to have a way to check generated content for copyright, in the way Youtube does, for instance. And that is already a big debate, whether enforcing that requirement is affordable to anybody but the big companies.

But hey, maybe we can solve both issues the same way. We sure as hell need a better way to handle mass human-produced content and its interactions with IP. The current system does not work and it grandfathers in the big players in UGC, so whatever we come up with should work for both human and computer-generated content.

@intensely_human@lemm.ee · 11 months ago

I can spit out copyrighted work verbatim.

“No Lieutenant, your men are already dead”

See?

@lily33@lemm.ee · 11 months ago

But AI isn’t all about generating creative works. It’s a store of information that I can query - a bit like searching Google; but understands semantics, and is interactive. It can translate my own text for me - in which case all the creativity comes from me, and I use it just for its knowledge of language. Many people use it to generate boilerplate code, which is pretty generic and wouldn’t usually be subject to copyright.

@intensely_human@lemm.ee · 11 months ago

This is how I use the AI: I learn from it. Honestly I just never got the bug on wanting it to generate creative works I can sell. I guess I’d rather sell my own creative output, you know? It’s more fun than ordering a robot to be creative for me.

FaceDeer · 11 months ago

I have used it as a collaborator when doing creative work. It’s a great brainstorming buddy, and I use it to generate rough drafts of stuff. Usually I use it while developing roleplaying scenarios for TTRPGs I run for my friends. Generative AI is great for illustrating those scenarios, too.

@AndrasKrigare@beehaw.org · 11 months ago

I know it inherently seems like a bad idea to fix an AI problem with more AI, but it seems applicable to me here. I believe it should be technically feasible to incorporate into the model something which checks if the result is too similar to source content as part of the regression.

My gut would be that this would, at least in the short term, make responses worse on the whole, so would probably require legal action or pressure to have it implemented.

BraveSirZaphod · 11 months ago

The key element here is that an LLM does not actually have access to its training data, and at least as of now, I’m skeptical that it’s technologically feasible to search through the entire training corpus, which is an absolutely enormous amount of data, for every query, in order to determine potential copyright violations, especially when you don’t know exactly which portions of the response you need to use in your search. Even then, that only catches verbatim (or near verbatim) violations, and plenty of copyright questions are a lot fuzzier.

For instance, say you tell GPT to generate a fan fiction story involving a romance between Draco Malfoy and Harry Potter. This would unquestionably violate JK Rowling’s copyright on the characters if you published the output for commercial gain, but you might be okay if you just plop it on a fan fic site for free. You’re unquestionably okay if you never publish it at all and just keep it to yourself (well, a lawyer might still argue that this harms JK Rowling by damaging her profit if she were to publish a Malfoy-Harry romance, since people can just generate their own instead of buying hers, but that’s a messier question). But, it’s also possible that, in the process of generating this story, GPT might unwittingly directly copy chunks of renowned fan fiction masterpiece My Immortal. Should GPT allow this, or would the copyright-management AI strike it? Legally, it’s something of a murky question.

For yet another angle, there is of course a whole host of public domain text out there. GPT probably knows the text of the Lord’s Prayer, for instance, and so even though that output would perfectly match some training material, it’s legally perfectly okay. So, a copyright police AI would need to know the copyright status of all its training material, which is not something you can super easily determine by just ingesting the broad internet.

@lily33@lemm.ee · 11 months ago

skeptical that it’s technologically feasible to search through the entire training corpus, which is an absolutely enormous amount of data

Google, DuckDuckGo, Bing, etc. do it all the time.

@AndrasKrigare@beehaw.org · 11 months ago

I don’t see why it wouldn’t be able to. That’s a Big Data problem, but we’ve gotten very very good at searches. Bing, for instance, conducts a web search on each prompt in order to give you a citation for what it says, which is pretty close to what I’m suggesting.

As far as comparing to see if the text is too similar, I’m not suggesting a simple comparison or even an Expert Machine; I believe that’s something that can be trained. GANs already have a discriminator that’s essentially measuring how close to generated content is to “truth.” This is extremely similar to that.

I completely agree that categorizing input training data by whether or not it is copyrighted is not easy, but it is possible, and I think something that could be legislated. The AI you would have as a result would inherently not be as good as it is in the current unregulated form, but that’s not necessarily a worse situation given the controversies.

On top of that, one of the common defenses for AI is that it is learning from material just as humans do, but humans also can differentiate between copyrighted and public works. For the defense to be properly analogous, it would make sense to me that it would need some notion of that as well.

FaceDeer · 11 months ago

It’s actually the other way around, Bing does websearches based on what you’ve asked it and then the answer it generates can incorporate information that was returned by the websearching. This is why you can ask it about current events that weren’t in its training data, for example - it looks the information up, puts it into its context, and then generates the response that you see. Sort of like if I asked you to write a paragraph about something that you didn’t know about, you’d go look the information up first.

but humans also can differentiate between copyrighted and public works

Not really. Here’s a short paragraph about sailboats. Is it copyrighted?

Sailboats, those graceful dancers of the open seas, epitomize the harmonious marriage of nature and human ingenuity. Their billowing sails, like ethereal wings, catch the breath of the wind, propelling them across the endless expanse of the ocean. Each vessel bears the scars of countless journeys, a testament to the resilience of both sailor and ship.

@AndrasKrigare@beehaw.org · 11 months ago

Bing does, but it still has a pre trained model that it’s using in its answer; you can give it prompts that it will answer without having to perform a search at all. That’s not a huge distinction, but I think the majority of the concern is on those types of responses. If it’s just responding with the results of a web search, I don’t think anyone is particularly concerned.

I was being specific with my word choice there, and should have emphasized more. Humans can differentiate between them, not humans always can differentiate. Copyright as a concept is something we have awareness of than (to my knowledge) is not part of the major AI models. I don’t know that an AI needs to be better than a human at that task.

sour · 11 months ago

deleted by creator

@sculd@beehaw.org · 11 months ago

Sorry AIs are not humans. Also executives like Altman are literally being paid millions to steal creator’s work.

@lily33@lemm.ee · 11 months ago

I didn’t say anything about AIs being humans.

@intensely_human@lemm.ee · 11 months ago

They’re also not vegetables 😡

@MNByChoice@midwest.social · 11 months ago

I don’t understand why people on here want so much to strengthen them ever further.

It is about a lawless company doing lawless things. Some of us want companies to follow the spirit, or at least the letter, of the law. We can change the law, but we need to discuss that.

@explodicle@local106.com · 11 months ago

IANAL, why isn’t it fair use?

@maynarkh@feddit.nl · 11 months ago

The two big arguments are:

Substantial reproduction of the original work, you can get back substantial portions of the original work from an AI model’s output.
The AI model replaces the use of the original work. In short, a work that uses copyrighted material under fair use can’t be a replacement for the initial work.

@intensely_human@lemm.ee · 11 months ago

you can get back substantial portions of the original work from an AI model’s output

Have you confirmed this yourself?

@chaos@beehaw.org · 11 months ago

In its complaint, The New York Times alleges that because the AI tools have been trained on its content, they sometimes provide verbatim copies of sections of Times reports.

OpenAI said in its response Monday that so-called “regurgitation” is a “rare bug,” the occurrence of which it is working to reduce.

“We also expect our users to act responsibly; intentionally manipulating our models to regurgitate is not an appropriate use of our technology and is against our terms of use,” OpenAI said.

The tech company also accused The Times of “intentionally” manipulating ChatGPT or cherry-picking the copycat examples it detailed in its complaint.

https://www.cnn.com/2024/01/08/tech/openai-responds-new-york-times-copyright-lawsuit/index.html

The thing is, it doesn’t really matter if you have to “manipulate” ChatGPT into spitting out training material word-for-word, the fact that it’s possible at all is proof that, intentionally or not, that material has been encoded into the model itself. That might still be fair use, but it’s a lot weaker than the original argument, which was that nothing of the original material really remains after training, it’s all synthesized and blended with everything else to create something entirely new that doesn’t replicate the original.

@intensely_human@lemm.ee · 11 months ago

So that’s a no? Confirming it yourself here means doing it yourself. Have you gotten it to regurgitate a copyrighted work?

FaceDeer · 11 months ago

You said:

Substantial reproduction of the original work, you can get back substantial portions of the original work from an AI model’s output.

If an AI is trained on a huge number of NYT articles and you’re only able to get it to regurgitate one of them, that’s not a “substantial portion of the original work.” That’s a minuscule portion of the original work.

Chahk · 11 months ago

Agreed on both counts… Except Microsoft sings a different tune when their software is being “stolen” in the exact same way. They want to have it both ways - calling us pirates when we copy their software, but it’s “without merit” when they do it. Fuck’em! Let them play by the same rules they want everyone else to play.

@intensely_human@lemm.ee · 11 months ago

That sounds bad. Do you have evidence for MS behaving this way?

Chahk · 11 months ago

https://www.computerworld.com/article/3121736/microsoft-sues-repeat-software-pirate-who-owes-company-12m-from-prior-case.html

Literally first hit on google (after the NYT links).