Amazing how every new generation of technology has a generation of users of the previous technology who do whatever they can do stop its advancement. This technology takes human creativity and output to a whole new level, it will advance medicine and science in ways that are difficult to even imagine, it will provide personalized educational tutoring to every student regardless of income, and these people are worried about the technicality of what the AI is trained on and often don’t even understand enough about AI to even make an argument about it. If people like this win, whatever country’s legal system they win in will not see the benefits that AI can bring. That society is shooting themselves in the foot.
Your favorite musician listened to music that inspired them when they made their songs. Listening to other people’s music taught them how to make music. They paid for the music (or somebody did via licensing fees or it was freely available for some other reason) when they listened to it in the first place. When they sold records, they didn’t have to pay the artist of every song they ever listened to. That would be ludicrous. An AI shouldn’t have to pay you because it read your book and millions like it to learn how to read and write.
You’re humanizing the software too much. Comparing software to human behavior is just plain wrong. GPT can’t even reason properly yet. I can’t see this as anything other than a more advanced collage process.
Open used intellectual property without consent of the owners. Major fucked.
If ‘anybody’ does anything similar to tracing, copy&pasting or even sampling a fraction of another person’s imagery or written work, that anybody is violating copyright.
If ‘anybody’ does anything similar to tracing, copy&pasting or even sampling a fraction of another person’s imagery or written work, that anybody is violating copyright.
Ok, but tracing is literally a part of the human learning process. If you trace a work and sell it as your own that’s bad. If you trace a work to learn about the style and let that influence your future works that is what every artist already does.
The artistic process isn’t copyrighted, only the final result. The exact same standards can apply to AI generated work as already do to anything human generated.
i don’t know the specifics of the lawsuit but i imagine this would parallel piracy.
in a way you could say that Open has pirated software directly from multiple intellectual properties. Open has distributed software which emulates skills and knowledge. remember this is a tool, not an individual.
It’s not exactly the same thing, but here’s an article by Kit Walsh, who’s a senior staff attorney at the EFF explains how image generators work within the law. The two aren’t exactly the same, but you can see how the same ideas would apply. The EFF is a digital rights group who most recently won a historic case: border guards now need a warrant to search your phone.
Here are some excerpts:
First, copyright law doesn’t prevent you from making factual observations about a work or copying the facts embodied in a work (this is called the “idea/expression distinction”). Rather, copyright forbids you from copying the work’s creative expression in a way that could substitute for the original, and from making “derivative works” when those works copy too much creative expression from the original.
Second, even if a person makes a copy or a derivative work, the use is not infringing if it is a “fair use.” Whether a use is fair depends on a number of factors, including the purpose of the use, the nature of the original work, how much is used, and potential harm to the market for the original work.
And:
Like copying to create search engines or other analytical uses, downloading images to analyze and index them in service of creating new, noninfringing images is very likely to be fair use. When an act potentially implicates copyright but is a necessary step in enabling noninfringing uses, it frequently qualifies as a fair use itself. After all, the right to make a noninfringing use of a work is only meaningful if you are also permitted to perform the steps that lead up to that use. Thus, as both an intermediate use and an analytical use, scraping is not likely to violate copyright law.
it does trouble me to think that the creators of stable diffusion could be financially punished. Did they at least try to compensate the artists in anyway?
It “feels” as though it parallels consultation. These creatives are literally paid for their creations. If a software constructs a neural network to emulate intellectual property, does that count as consultation? Could/Should it apply to the software developers or individuals using the software?
From the technical side, I don’t understand how all the red flags aren’t already there. the source material was taken, and now any individual could acquire that exact material or anything “in the spirit of” that material through a single service. Is this a new way to pirate?
stable diffusion is a great opportunity for small businesses. especially in an increasingly anti-small business america (maybe that’s just california?) I’d hate for it become inaccessible to creators that would wield it properly.
as long as creatives retain the ability to sue the bad actors, i’m glad. I personally don’t need Open or whomever is directly responsible for stable diffusion and its training data to be punished.
In the US, fair use lets you use copyrighted material without permission for criticism, research, artistic expression like literature, art, music, satire, and parody. It balances the interests of copyright holders with the public’s right to access and use information. There are rights people can maintain over their work, and there are rights they do not maintain. We are allowed to analyze people’s publically published works, and that’s always been to the benefit of artistic expression. It would be awful for everyone if IP holders could take down any criticism, reverse engineering, or indexes they don’t like. That would be the dream of every corporation, bully, troll, or wannabe autocrat.
The consultation angle is interesting, but I’m not sure applies here. Consultation usually involves a direct and intentional exchange of information and expertise, whereas this is an original analysis of data that doesn’t emulate any specific intellectual property.
I also don’t think this is a new way to pirate, as long as you don’t reproduce the source material. If you wanted to do that, you could just right-click and “save as”. What this does is lower the bar for entry to let people more easily exercise their rights. Like print media vs. internet publication and TV/Radio vs. online content, there will be winners and losers, but if done right, I think this will all be in service of a more decentralized and open media landscape.
sampling a fraction of another person’s imagery or written work.
So citing is a copyright violation? A scientific discussion on a specific text is a copyright violation? This makes no sense. It would mean your work couldn’t build on anything else, and that’s plain stupid.
Also to your first point about reasoning and advanced collage process: you are right and wrong. Yes an LLM doesn’t have the ability to use all the information a human has or be as precise, therefore it can’t reason the same way a human can. BUT, and that is a huge caveat, the inherit goal of AI and in its simplest form neural networks was to replicate human thinking. If you look at the brain and then at AIs, you will see how close the process is. It’s usually giving the AI an input, the AI tries to give the desired output, them the AI gets told what it should have looked like, and then it backpropagates to reinforce it’s process. This already pretty advanced and human-like (even look at how the brain is made up and then how AI models are made up, it’s basically the same concept).
Now you would be right to say “well in it’s simplest form LLMs like GPT are just predicting which character or word comes next” and you would be partially right. But in that process it incorporates all of the “knowledge” it got from it’s training sessions and a few valuable tricks to improve. The truth is, differences between a human brain and an AI are marginal, and it mostly boils down to efficiency and training time.
And to say that LLMs are just “an advanced collage process” is like saying “a car is just an advanced horse”. You’re not technically wrong but the description is really misleading if you look into the details.
And for details sake, this is what the paper for Llama2 looks like; the latest big LLM from Facebook that is said to be the current standard for LLM development:
Well, given how we’re the ones that developed the models, they are deterministic as we know and can save and reproduce the random weights they are given during training, and we can use a debugger to step through every single step the models makes in learning and “thinking”, yes, we understand them.
We know the input, we can set the model to save the weight in checkpoints during training and can view them any time, and we can see weights of the finished model, and we can see the code.
If what you said about LLMs being completely black box were true, we wouldn’t be able to reproduce models, and each model would be unique.
But we can control every step of the training process, and we can reproduce not just the finished model, but the model at every single step during training.
We created the math, we created the training sets, we created the code and we can see and modify the weights and any other property of the model.
Look, I understand why you think this. I thought this too when I was first beginning to learn machine learning and data science. But I’ve now been working with machine learning models including neural networks for nearly a decade, and the truth is that is nearly impossible to track the path of an input to a given output in machine learning models other than regression-based models and decision tree-based models.
There is an entire field of data science devoted to explaining how these models arrive at their conclusions. It’s called “explainable AI” or “xAI”, and I have a few papers that I’ve published in exploring the utility of them. The basic explanation for how they work is that we run hundreds of thousands of different models and then do statistical analysis to estimate why the models arrived at their conclusion. It isn’t an exact science, however.
No that’s not how it works. It stores learned information like “word x is more likely to follow word y than word a” or “people from country x are more likely to consume food a than b”. That is what is distributed when the AI model is shared. To learn that, it just reads books zillions of times and updates its table of likelihoods. Just like an artist might listen to a Lil Wayne album hundreds of times and each time they learn a little bit more about his rhyme style or how beats work or whatever. It’s more complicated than that, but that’s a layperson’s explanation of how it works. The book isn’t stored in there somewhere. The book’s contents aren’t transferred to other parties.
Its less about copying the work, its more like looking at patterns that appear in a work.
To bring a very rudimentary example, if I wanted a word and the first letter was Q, what would the second letter be.
Of course, statistically, the next letter is u, and its not common for words starting with Q to have a different letter after that. ML/AI is like taking these small situations, but having a ridiculous amount of parameters to come up with something based on several internal models. These paramters of course generally have some context.
Its like if you were told to read a book thoroughly, and then after was told to reproduce the same book. You probably cannot make it 1:1, but could probably get the general gist of a story. The difference between you and the machine is the machine read a lot of books, and contextually knows patterns so that it can generate something similar faster and more accurate, but not exactly the original one for one thing.
When you download Vicuna or Stable Diffusion XL, they’re a handful of gigabytes. But when you go download LAION-5B, it’s 240TB. So where did that data go if it’s being copy/pasted and regurgitated in its entirety?
Exactly! If it were just out putting exact data they wouldn’t care about making new works and just pivot as the world’s greatest source of compression.
Though there is some work researchers have done to heavily modify these models to over fit to do exactly this.
I don’t think that Sarah Silverman and the others are saying that the tech shouldn’t exist. They’re saying that the input to train them needs to be negotiated as a society. And the businesses also care about the input to train them because it affects the performance of the LLMs. If we do allow licensing, watermarking, data cleanup, synthetic data, etc. in a way that is transparent, I think it’s good for the industry and it’s good for the people.
But you do need to negotiate with Sarah Silverman, if you take that book, rearrange the chapters, and then try sell it for profit. Obviously that’s extremified but it’s The argument they’re making.
I agree. But that isn’t what AI is doing, because it doesn’t store the actual book and it isn’t possible to reproduce any part in a format that is recognizable as the original work.
Definitely not how that output works. It will come up with something that seems like a Sarah Silverman created work but isn’t. It’s like calling Copyright on impersonations. I don’t buy it
Yes. Imagine how much trouble ANY actor would be in if they were sued for impersonating someone nearly identical but not that person. If Sarah Silverman ever interacted with a person and then imitated that person on stage for her own personal benefit without the other persons express consent it would be no different. And comedians pick up their comedy from everything around them both natural and imitation.
100%. I just can’t get behind any of these arguments against AI from this segment of workers. This is no different than other rallies against technological evolution due to fear of job losses. Their scarce commodity will soon disappear and that’s what they’re actually afraid of.
It’s easy. They’re grasping at straws because their career isn’t what it used to be. It’s something new and viral so it must be an easy target to exploit for money. Personally I’d be on top of it and setting up contracts to allow AI to use my likeness for a small subset of the usual pay. I just can’t imagine not taking advantage of the ability to do absolutely nothing and still get paid for it. Instead they appear to actively be trying to tear it down. If they were wanting to set guidelines then they would be rallying congress not suing a company based on how you FEEL it should be.
That’s not what this is. To use your example it would be like taking her book and rearranging ALL of the words to make another book and selling that book. But they’re not selling the book or its contents, they’re selling how their software interprets the book for the benefit of the user. This would be like suing teachers for teaching about their book.
The argument is less that an LLM is a human and more that it is not a copyright violation to use a material to train the LLM. By current legal definitions, it is fair use unless the material is able to be reproduced in its entirety (or at least, in some meaningful way).
Yeah, definitions that were written before this technology existed. I don’t base my opinions on what is legal, legality nothing more than rules determined by those in power.
Instead, I base them on what is ethical, and the consumption of material by LLMs and other AIs without the express permission of its creator is unethical.
Amazing how every generation of technology has an asshole billionaire or two stealing shit to be the first in line to try and monopolize society’s progress.
Amazing how every new generation of technology has a generation of users of the previous technology who do whatever they can do stop its advancement. This technology takes human creativity and output to a whole new level, it will advance medicine and science in ways that are difficult to even imagine, it will provide personalized educational tutoring to every student regardless of income, and these people are worried about the technicality of what the AI is trained on and often don’t even understand enough about AI to even make an argument about it. If people like this win, whatever country’s legal system they win in will not see the benefits that AI can bring. That society is shooting themselves in the foot.
Your favorite musician listened to music that inspired them when they made their songs. Listening to other people’s music taught them how to make music. They paid for the music (or somebody did via licensing fees or it was freely available for some other reason) when they listened to it in the first place. When they sold records, they didn’t have to pay the artist of every song they ever listened to. That would be ludicrous. An AI shouldn’t have to pay you because it read your book and millions like it to learn how to read and write.
You’re humanizing the software too much. Comparing software to human behavior is just plain wrong. GPT can’t even reason properly yet. I can’t see this as anything other than a more advanced collage process.
Open used intellectual property without consent of the owners. Major fucked.
If ‘anybody’ does anything similar to tracing, copy&pasting or even sampling a fraction of another person’s imagery or written work, that anybody is violating copyright.
Ok, but tracing is literally a part of the human learning process. If you trace a work and sell it as your own that’s bad. If you trace a work to learn about the style and let that influence your future works that is what every artist already does.
The artistic process isn’t copyrighted, only the final result. The exact same standards can apply to AI generated work as already do to anything human generated.
i don’t know the specifics of the lawsuit but i imagine this would parallel piracy.
in a way you could say that Open has pirated software directly from multiple intellectual properties. Open has distributed software which emulates skills and knowledge. remember this is a tool, not an individual.
It’s not exactly the same thing, but here’s an article by Kit Walsh, who’s a senior staff attorney at the EFF explains how image generators work within the law. The two aren’t exactly the same, but you can see how the same ideas would apply. The EFF is a digital rights group who most recently won a historic case: border guards now need a warrant to search your phone.
Here are some excerpts:
And:
I’d like to hear your thoughts.
thanks for the sauce. Its very enlightening.
it does trouble me to think that the creators of stable diffusion could be financially punished. Did they at least try to compensate the artists in anyway?
It “feels” as though it parallels consultation. These creatives are literally paid for their creations. If a software constructs a neural network to emulate intellectual property, does that count as consultation? Could/Should it apply to the software developers or individuals using the software?
From the technical side, I don’t understand how all the red flags aren’t already there. the source material was taken, and now any individual could acquire that exact material or anything “in the spirit of” that material through a single service. Is this a new way to pirate?
stable diffusion is a great opportunity for small businesses. especially in an increasingly anti-small business america (maybe that’s just california?) I’d hate for it become inaccessible to creators that would wield it properly.
as long as creatives retain the ability to sue the bad actors, i’m glad. I personally don’t need Open or whomever is directly responsible for stable diffusion and its training data to be punished.
In the US, fair use lets you use copyrighted material without permission for criticism, research, artistic expression like literature, art, music, satire, and parody. It balances the interests of copyright holders with the public’s right to access and use information. There are rights people can maintain over their work, and there are rights they do not maintain. We are allowed to analyze people’s publically published works, and that’s always been to the benefit of artistic expression. It would be awful for everyone if IP holders could take down any criticism, reverse engineering, or indexes they don’t like. That would be the dream of every corporation, bully, troll, or wannabe autocrat.
The consultation angle is interesting, but I’m not sure applies here. Consultation usually involves a direct and intentional exchange of information and expertise, whereas this is an original analysis of data that doesn’t emulate any specific intellectual property.
I also don’t think this is a new way to pirate, as long as you don’t reproduce the source material. If you wanted to do that, you could just right-click and “save as”. What this does is lower the bar for entry to let people more easily exercise their rights. Like print media vs. internet publication and TV/Radio vs. online content, there will be winners and losers, but if done right, I think this will all be in service of a more decentralized and open media landscape.
So citing is a copyright violation? A scientific discussion on a specific text is a copyright violation? This makes no sense. It would mean your work couldn’t build on anything else, and that’s plain stupid.
Also to your first point about reasoning and advanced collage process: you are right and wrong. Yes an LLM doesn’t have the ability to use all the information a human has or be as precise, therefore it can’t reason the same way a human can. BUT, and that is a huge caveat, the inherit goal of AI and in its simplest form neural networks was to replicate human thinking. If you look at the brain and then at AIs, you will see how close the process is. It’s usually giving the AI an input, the AI tries to give the desired output, them the AI gets told what it should have looked like, and then it backpropagates to reinforce it’s process. This already pretty advanced and human-like (even look at how the brain is made up and then how AI models are made up, it’s basically the same concept).
Now you would be right to say “well in it’s simplest form LLMs like GPT are just predicting which character or word comes next” and you would be partially right. But in that process it incorporates all of the “knowledge” it got from it’s training sessions and a few valuable tricks to improve. The truth is, differences between a human brain and an AI are marginal, and it mostly boils down to efficiency and training time.
And to say that LLMs are just “an advanced collage process” is like saying “a car is just an advanced horse”. You’re not technically wrong but the description is really misleading if you look into the details.
And for details sake, this is what the paper for Llama2 looks like; the latest big LLM from Facebook that is said to be the current standard for LLM development:
https://arxiv.org/pdf/2307.09288.pdf
You’re mystifying and mythologising humans too much. The learning process is very equivalent.
amazing
Well, there still a shit ton we don’t understand about human.
We do, however, understand everything about machine learning.
LOL
We understand less about how LLMs generate a single output than we do about the human brain. You clearly have no experience developing models.
Well, given how we’re the ones that developed the models, they are deterministic as we know and can save and reproduce the random weights they are given during training, and we can use a debugger to step through every single step the models makes in learning and “thinking”, yes, we understand them.
We can not however, do that for the human brain.
You really don’t understand how these models work and you should learn about them before you make statements about them.
Machine learning models are, almost by definition, non-deterministic.
We know the input, we can set the model to save the weight in checkpoints during training and can view them any time, and we can see weights of the finished model, and we can see the code.
If what you said about LLMs being completely black box were true, we wouldn’t be able to reproduce models, and each model would be unique.
But we can control every step of the training process, and we can reproduce not just the finished model, but the model at every single step during training.
We created the math, we created the training sets, we created the code and we can see and modify the weights and any other property of the model.
What exactly do we not understand?
Look, I understand why you think this. I thought this too when I was first beginning to learn machine learning and data science. But I’ve now been working with machine learning models including neural networks for nearly a decade, and the truth is that is nearly impossible to track the path of an input to a given output in machine learning models other than regression-based models and decision tree-based models.
There is an entire field of data science devoted to explaining how these models arrive at their conclusions. It’s called “explainable AI” or “xAI”, and I have a few papers that I’ve published in exploring the utility of them. The basic explanation for how they work is that we run hundreds of thousands of different models and then do statistical analysis to estimate why the models arrived at their conclusion. It isn’t an exact science, however.
deleted by creator
No that’s not how it works. It stores learned information like “word x is more likely to follow word y than word a” or “people from country x are more likely to consume food a than b”. That is what is distributed when the AI model is shared. To learn that, it just reads books zillions of times and updates its table of likelihoods. Just like an artist might listen to a Lil Wayne album hundreds of times and each time they learn a little bit more about his rhyme style or how beats work or whatever. It’s more complicated than that, but that’s a layperson’s explanation of how it works. The book isn’t stored in there somewhere. The book’s contents aren’t transferred to other parties.
Its less about copying the work, its more like looking at patterns that appear in a work.
To bring a very rudimentary example, if I wanted a word and the first letter was Q, what would the second letter be.
Of course, statistically, the next letter is u, and its not common for words starting with Q to have a different letter after that. ML/AI is like taking these small situations, but having a ridiculous amount of parameters to come up with something based on several internal models. These paramters of course generally have some context.
Its like if you were told to read a book thoroughly, and then after was told to reproduce the same book. You probably cannot make it 1:1, but could probably get the general gist of a story. The difference between you and the machine is the machine read a lot of books, and contextually knows patterns so that it can generate something similar faster and more accurate, but not exactly the original one for one thing.
When you download Vicuna or Stable Diffusion XL, they’re a handful of gigabytes. But when you go download LAION-5B, it’s 240TB. So where did that data go if it’s being copy/pasted and regurgitated in its entirety?
Exactly! If it were just out putting exact data they wouldn’t care about making new works and just pivot as the world’s greatest source of compression.
Though there is some work researchers have done to heavily modify these models to over fit to do exactly this.
I don’t think that Sarah Silverman and the others are saying that the tech shouldn’t exist. They’re saying that the input to train them needs to be negotiated as a society. And the businesses also care about the input to train them because it affects the performance of the LLMs. If we do allow licensing, watermarking, data cleanup, synthetic data, etc. in a way that is transparent, I think it’s good for the industry and it’s good for the people.
I don’t need to negotiate with Sarah Silverman if Im handed her book by a friend, and neither should an AI
But you do need to negotiate with Sarah Silverman, if you take that book, rearrange the chapters, and then try sell it for profit. Obviously that’s extremified but it’s The argument they’re making.
I agree. But that isn’t what AI is doing, because it doesn’t store the actual book and it isn’t possible to reproduce any part in a format that is recognizable as the original work.
Definitely not how that output works. It will come up with something that seems like a Sarah Silverman created work but isn’t. It’s like calling Copyright on impersonations. I don’t buy it
Yes. Imagine how much trouble ANY actor would be in if they were sued for impersonating someone nearly identical but not that person. If Sarah Silverman ever interacted with a person and then imitated that person on stage for her own personal benefit without the other persons express consent it would be no different. And comedians pick up their comedy from everything around them both natural and imitation.
100%. I just can’t get behind any of these arguments against AI from this segment of workers. This is no different than other rallies against technological evolution due to fear of job losses. Their scarce commodity will soon disappear and that’s what they’re actually afraid of.
It’s easy. They’re grasping at straws because their career isn’t what it used to be. It’s something new and viral so it must be an easy target to exploit for money. Personally I’d be on top of it and setting up contracts to allow AI to use my likeness for a small subset of the usual pay. I just can’t imagine not taking advantage of the ability to do absolutely nothing and still get paid for it. Instead they appear to actively be trying to tear it down. If they were wanting to set guidelines then they would be rallying congress not suing a company based on how you FEEL it should be.
That’s not what this is. To use your example it would be like taking her book and rearranging ALL of the words to make another book and selling that book. But they’re not selling the book or its contents, they’re selling how their software interprets the book for the benefit of the user. This would be like suing teachers for teaching about their book.
An LLM isn’t human and shouldn’t be treated the same as a human. It’s as foolish as corporate personhood.
The argument is less that an LLM is a human and more that it is not a copyright violation to use a material to train the LLM. By current legal definitions, it is fair use unless the material is able to be reproduced in its entirety (or at least, in some meaningful way).
Yeah, definitions that were written before this technology existed. I don’t base my opinions on what is legal, legality nothing more than rules determined by those in power.
Instead, I base them on what is ethical, and the consumption of material by LLMs and other AIs without the express permission of its creator is unethical.
Amazing how every generation of technology has an asshole billionaire or two stealing shit to be the first in line to try and monopolize society’s progress.
No, it doesn’t. There’s nothing “human” or “creative” about the output of AI.