OpenAI says it’s “impossible” to create useful AI models without copyrighted material

@sculd@beehaw.org · 11 months ago

OpenAI says it’s “impossible” to create useful AI models without copyrighted material

BraveSirZaphod · 11 months ago

The key element here is that an LLM does not actually have access to its training data, and at least as of now, I’m skeptical that it’s technologically feasible to search through the entire training corpus, which is an absolutely enormous amount of data, for every query, in order to determine potential copyright violations, especially when you don’t know exactly which portions of the response you need to use in your search. Even then, that only catches verbatim (or near verbatim) violations, and plenty of copyright questions are a lot fuzzier.

For instance, say you tell GPT to generate a fan fiction story involving a romance between Draco Malfoy and Harry Potter. This would unquestionably violate JK Rowling’s copyright on the characters if you published the output for commercial gain, but you might be okay if you just plop it on a fan fic site for free. You’re unquestionably okay if you never publish it at all and just keep it to yourself (well, a lawyer might still argue that this harms JK Rowling by damaging her profit if she were to publish a Malfoy-Harry romance, since people can just generate their own instead of buying hers, but that’s a messier question). But, it’s also possible that, in the process of generating this story, GPT might unwittingly directly copy chunks of renowned fan fiction masterpiece My Immortal. Should GPT allow this, or would the copyright-management AI strike it? Legally, it’s something of a murky question.

For yet another angle, there is of course a whole host of public domain text out there. GPT probably knows the text of the Lord’s Prayer, for instance, and so even though that output would perfectly match some training material, it’s legally perfectly okay. So, a copyright police AI would need to know the copyright status of all its training material, which is not something you can super easily determine by just ingesting the broad internet.

@lily33@lemm.ee · 11 months ago

skeptical that it’s technologically feasible to search through the entire training corpus, which is an absolutely enormous amount of data

Google, DuckDuckGo, Bing, etc. do it all the time.

@AndrasKrigare@beehaw.org · 11 months ago

I don’t see why it wouldn’t be able to. That’s a Big Data problem, but we’ve gotten very very good at searches. Bing, for instance, conducts a web search on each prompt in order to give you a citation for what it says, which is pretty close to what I’m suggesting.

As far as comparing to see if the text is too similar, I’m not suggesting a simple comparison or even an Expert Machine; I believe that’s something that can be trained. GANs already have a discriminator that’s essentially measuring how close to generated content is to “truth.” This is extremely similar to that.

I completely agree that categorizing input training data by whether or not it is copyrighted is not easy, but it is possible, and I think something that could be legislated. The AI you would have as a result would inherently not be as good as it is in the current unregulated form, but that’s not necessarily a worse situation given the controversies.

On top of that, one of the common defenses for AI is that it is learning from material just as humans do, but humans also can differentiate between copyrighted and public works. For the defense to be properly analogous, it would make sense to me that it would need some notion of that as well.

FaceDeer · 11 months ago

It’s actually the other way around, Bing does websearches based on what you’ve asked it and then the answer it generates can incorporate information that was returned by the websearching. This is why you can ask it about current events that weren’t in its training data, for example - it looks the information up, puts it into its context, and then generates the response that you see. Sort of like if I asked you to write a paragraph about something that you didn’t know about, you’d go look the information up first.

but humans also can differentiate between copyrighted and public works

Not really. Here’s a short paragraph about sailboats. Is it copyrighted?

Sailboats, those graceful dancers of the open seas, epitomize the harmonious marriage of nature and human ingenuity. Their billowing sails, like ethereal wings, catch the breath of the wind, propelling them across the endless expanse of the ocean. Each vessel bears the scars of countless journeys, a testament to the resilience of both sailor and ship.

@AndrasKrigare@beehaw.org · 11 months ago

Bing does, but it still has a pre trained model that it’s using in its answer; you can give it prompts that it will answer without having to perform a search at all. That’s not a huge distinction, but I think the majority of the concern is on those types of responses. If it’s just responding with the results of a web search, I don’t think anyone is particularly concerned.

I was being specific with my word choice there, and should have emphasized more. Humans can differentiate between them, not humans always can differentiate. Copyright as a concept is something we have awareness of than (to my knowledge) is not part of the major AI models. I don’t know that an AI needs to be better than a human at that task.