Laundering Copyright At Scale

The Author’s Guild just added their own class-action lawsuit against OpenAI, claiming that using their copyrighted works to train ChatGPT violated their respective copyrights. This is essentially the same argument made in two other lawsuits filed a few months ago and in the class -action lawsuit filed by artists against Stability AI. As I said with the Stable Diffusion case, case law suggests that training an AI is fair use, though it’s far from settled. Either way I’m sure the big players are busy training “clean” models using only public domain and licensed content (particularly content they already “own”), both as a hedge and because uncertainty about fair use will naturally tamp down any competitors who don’t have the resources to make their own clean versions. There’s already word that Getty Images is partnering with Nvidia to create it’s own generative AI system trained only on it’s own library, and I’m sure they aren’t the only ones.

But don’t expect clean training data to make artists and authors any happier, because this whole debate isn’t really about how these models were trained — it’s about what they can do. Copyright law protects a fixed expression of an idea — the words on a page, placement of ink in a drawing, even composition of a photograph — but not the idea itself. That’s by design, because art inherently builds on what came before it, “stealing” the best ideas and remixing them into something new. If copyright were extended too broadly we might never have seen another detective story after Edgar Alan Poe’s The Murders in the Rue Morgue, or another Pointallism painting after Georges Seurat’s A Sunday Afternoon on the Island of La Grande Jatte.

In general artists are comfortable with this kind of “stealing” so long as it pushes the art in new directions. As TS Elliot said, “Immature poets imitate; mature poets steal; bad poets deface what they take, and good poets make it into something better, or at least something different.” Austin Kleon, author of Steal Like An Artist, put it more succinctly: “Imitation is not flattery. Transformation is flattery.” Copyright law tries to capture this distinction between good stealing and, well, just plain stealing by requiring that there be “substantial similarity” to a copied work for there to be infringement, and by carving out fair-use exceptions for reasonable sampling in transformative works. Factual works have to be especially similar to be infringing, to the point where it’s perfectly legal (and as long as credit is given, perfectly acceptable) for newspapers to rewrite their competitor’s reporting in their own voice. The similarity threshold for what counts as infringement for art and fiction isn’t quite as high, but it’s still legal to copy an artist’s style and general form as long as there are enough differences overall.

The hope, presumably, is that this similarity threshold is a way to allow good copying and outlaw bad copying without forcing judges to decide on the artistic merit of the changes that were made. But what about works that don’t really add anything useful to a prior art but still tweak it just enough to avoid copyright infringement. Take much of what comes out of content farms like Demand Media (eHow.com, answers.com), which is essentially regurgitated content from blogs, Reddit and Wikipedia with just enough rewriting to pass copyright muster, or at least to pass the filters that Google uses to deprioritize such low-value-added content in search results.

In theory content mills could add value above and beyond the original, but the business model only prioritizes quantity and high search-engine rank (preferably higher than whoever you copied from). In the early 2010s these sites relied mostly on severely underpaid contractors to churn out blog posts for pennies per word, but nowadays more and more of this work is being handed over completely to generative AI. For example, take Content At Scale, who advertise a service that uses generative AI to write a search-engine optimized blog post or article based just on the set of keywords you want to rank for in web searches. Or they can write articles based on your competitor’s content: “Have a competitor that’s crushing it with their content marketing? Or have awesome thought leaders or content sites in your niche? … Take any existing article, and have a freshly written article created that uses the source URL as context for the all new article.” They can also go straight from podcast or YouTube video to blog post, and just in case you missed what this was really about they advertise that one of their advantages over existing content mill services (besides price) is that they automatically integrate scans to make sure their posts aren’t tagged for plagiarism or AI-written content.

Rewriting someone else’s material either to avoid copyright infringement or to avoid its detection is being called copyright laundering, with analogy to money laundering. But unlike money laundering as long as you change enough to pass the substantial similarity threshold it’s perfectly legal. And it’s also not just news articles and blog posts that are being generated anymore. Just last week Amazon announced that they were reducing the number of books an author could self-publish on Kindle to three books per day because of AI-generated content.

No wonder authors are pissed!