British author Hari Kunzru posted a link to prosecraft.io on X (formerly Twitter), alleging that the website appears “to have stolen a lot of books, trained an AI, and are now offering a service based on that data.” He threaded a screenshot of the analysis of his novel White Tears, which included measures like: number of words, vividness, passive voice, use of adverbs, and more, saying that “I did not consent to this use of my work.”
Seven hours later, Prosecraft creator Benji Smith took down the website. “Your feelings are legitimate, and I hope you’ll accept my sincerest apologies. I care about stories. I care about publishing. I care about authors,” he wrote in an Aug. 7 blog post.
During those seven hours, several other authors said their books were used by the site without permission. Celeste Ng of Little Fires Everywhere fame pointed out that the site was inundated with data from scores of books by authors big and small, including Stephen King and Jodi Picoult, adding that “there’s a limit to what data can teach you about writing” anyway. Amid the uproar, authors are also urging publishers to take AI theft seriously, and a handful are even threatening class-action lawsuits against Smith.
Plus, deleting Prosecraft only buried the problem, not solved it.
Prosecraft’s evolution, by the digits
20+ years: How long Benji Smith has been working in the field of computational linguistics and machine learning. During this time, he says,“I was always frustrated that the fancy tools were only accessible to big businesses and government spy agencies. I wanted to bring that magic to everyone.”
10+ years: How long ago Smith started the project, which took shape as he began writing his memoir. It started with him manually counting words in paperbacks off his own bookshelves. Later, the spreadsheet-turned-cloud-database was populated with more books he found via web crawlers across authors and genres.
25,000: Number of books that were used in the linguistic analysis of literature on the Prosecraft website in the six years that it was up-and-running. The data was meant to serve as “a suite of ‘lexicographic’ tools that they budding writers could use, to compare their own writing with the writing of authors they admire,” Smith wrote.
The loose ends leftover in life after Prosecraft
Before taking the site offline, Smith briefly suggested that authors who want their work taken down send him an email with a link to their work, and he’d do the needful. Authors, already annoyed, were infuriated by the suggestion that this database be opt-out rather than opt-in.
Kunzru had questioned if Smith used a shadow library — online databases that provide access to millions of books and articles that are often out of print, hard to obtain, and paywalled — to pirate content for the database. In his blog post, Smith denied it. Regardless, the claim that he only used “publicly available” works is shaky at best. “The ‘most vivid page’ excerpt from my book was literally the most spoilery moment of the climax, not published publicly, not scrapable… unless you were scraping book pirating sites?” author Diane Urban wrote. Others chimed in with similar stories.
Moreover, some authors are worried he’s still holding on to the data. Even with the website gone, there’s no guarantee the data is no longer accessible. Especially after a user revealed the fact that Smith was asking for help to “finetune/train” an AI Large Language Model called GPT-Neox in March.
While Smith was apologetic about stepping on authors’ toes, he does think his now-defunct project fell under “fair use.” He acknowledged that “the arrival of AI on the scene has been tainted by early use-cases that allow anyone to create zero-effort impersonations of artists” but continued to tout his collection and analysis of entire books as separate from other AI uses. How exactly? He didn’t elaborate.
Person of interest: Jane Friedman
One author is tackling a different tech-and-plagiarism menace over at behemoth Amazon.
Jane Friedman, a veteran publisher and the author of The Business of Being a Writer, found about half a dozen books with her name printed on them selling on Amazon. The titles were also on Goodreads under her author profile. The only problem was: She didn’t write them. “Most likely they’ve been generated by AI,” she wrote in a blog post.
Hours after she complained, the titles disappeared from her Goodreads profile, but remained available for sale at Amazon because she has no copyright claim to these texts, nor does she have a trademark for her name.
“We desperately need guardrails on this landslide of misattribution and misinformation,” she lamented. “Amazon and Goodreads, I beg you to create a way to verify authorship, or for authors to easily block fraudulent books credited to them. Do it now, do it quickly.”
Quotable: AI is coming for authors, visual artists, musicians, and more
“I also would like to remind you that the works of thousands of living visual artists are still in giant databases scraped from the web without their permissions, at least most of them never consent to such thing, and they are victimized just like the authors victimized by this ‘creatively curious engineer.’ If you are an author using Midjourney to do your covers please remember that MJ most likely uses such database or its variations, although they never publicly stated it. We are all on the same boat when it comes to genAI.”
— Kürşat Yilmaz, a senior 3D CGI Artist whose worked on DC’s Black Adam