Fear of AI just killed a useful tool

I found this article frustratingly vague on how prosecraft.io actually worked. As far as I can tell, the author scraped the web for books, including in-copyright books. Then he analyzed it with techniques based on "classical" natural language processing techniques, rather than transformers or deep learning. He appears to have retained the books he scraped for future analysis. The site itself seems to use only snippets.

However, the apology [0] says that the creator did not "intend" to participate in AI that can "create zero-effort impersonations of artists." I'm not sure if the wording is unintentionally vague, or if there is some way his project could be used in that way.

For what it's worth, the Computational Story Lab's hendometer [1] seems to have largely out-of-copyright books from Project Gutenberg, plus the Harry Potter series.

[0]: https://blog.shaxpir.com/taking-down-prosecraft-io-37e189797...

[1]: https://hedonometer.org/books/v3/863/

Edit: Apparently he was working on an LLM project. https://twitter.com/stealcase/status/1688721685585809408. It's unclear whether he was planning to use the books he scraped (although as @stealcase points out, GPT-Neox itself was trained on books that were pirated).

randallsquared · 3 years ago

If he says he didn't do something, the pitchfork mob will simply tell each other that he is lying. They will do this in the most confused and twisted way possible, driven by lack of understanding of what was happening combined with a need to drive outrage and thereby advertise their work.

If he says he didn't intend to do that thing, this is still compatible with a later update that he didn't do that thing, but immediately dampens the outrage machine. The reader who knows nothing about either side relaxes -- "No need for me to get worked up, because we won". Conveniently, saying he didn't intend to do the thing is also compatible with a later reveal that he was doing the thing (perhaps for later release, since he wasn't clearly doing the thing here).

Therefore, regardless of whether he was doing what he was accused of doing, this is the lowest energy response, and probably the default unless this was the hill he wanted to defend.

b112 · 3 years ago

The best response, for us all collectively, is to always ignore everyone's opinion online. There is zero value in anything on reddit, twitter, facebook, the media these days.

Just ignore it. All of it. Outrage or not.

I see downvotes, but I mean it. You know who you listen to? Your friends. Your neighbours. Your local community. You listen to PEOPLE, not sockpuppets. You listen to legitimate human beings, not AI generated blather, or curated news stories, or groups working together to generate hate, outrage, to stoke anger, upset.

You listen to actual, real PEOPLE.

You want to go to reddit? Twitter? Anything? Fine. But treat it as 100% fiction, pure entertainment, and never let it affect YOU.

fenomas · 3 years ago

> However, the apology [0] says that the creator did not "intend" to participate in AI that can "create zero-effort impersonations of artists." I'm not sure if the wording is unintentionally vague, or if there is some way his project could be used in that way.

This seems FUDdy. "Intend" isn't in the apology at all, and the wording that is there says clearly that generative AI came after prosecraft, so there's no way the tool could be used for it.

> It's unclear whether he was planning to use the books he scraped

This also seems unwarranted. The tweet about fine-tuning an LLM came 5-6 years after the guy made prosecraft; why suggest they might involve the same dataset?

actuallyalys · 3 years ago

I apologize for the quotes around intend. I wrote it without, then I forgot it was a paraphrase and added them back again. Unfortunately, I cannot edit my comment to fix that.

I do think “intend” is a reasonable paraphrase of “never wanted to.”

(Edited to add) I don’t think prosecraft was a finished project and he was definitely still working on his other tool for writers that incorporates some of the same tools.

> The tweet about fine-tuning an LLM came 5-6 years after the guy made prosecraft; why suggest they might involve the same dataset?

The reason being that he had mentioned he was planning to use the scraped books for future analysis.

h11h · 3 years ago

Even Facebook's Llama was trained on books3, a dump of pirated books.

stusmall · 3 years ago

It's so mind blowing to me that it made it past corporate legal. I don't get what defense there could be besides "lmao try and stop me, nerds"

If you want to do this kind of thing, let authors opt-in (or publishers).

Yes, it will take effort and probably go slow, but if the tool is really useful and amazing, it should be doable.

I suspect the authors are put-off by a couple things:

- the text of the works scanned seems like it may be from pirated sources. That poisons the project, no matter what it does with the scans, for many authors.

- the use of these scans in a commercial product

The article itself is clueless… it doesn’t engage authors’ concerns at all, and just portrays authors as stoopid AI-fearful luddites.

bogwog · 3 years ago

> If you want to do this kind of thing, let authors opt-in (or publishers).

If it's fair use, why should you have to do that? The same copyright law protecting author's ownership rights over their art also provide "fair use" to other people. Someone may disagree with current fair use laws (and I suspect many outraged here do not), but that's a broader issue not related to this particular tool. It just 100% seems like misdirected AI outrage.

> the text of the works scanned seems like it may be from pirated sources.

Do you have a source for this? I didn't see that mentioned in the article.

jmull · 3 years ago

> Do you have a source for this? I didn't see that mentioned in the article.

The person who runs prosecraft says "I looked to the internet for more text that I could analyze, and I used web crawlers to find more books." [0]

I'm just inferring, but if they had, say, purchased each of these books, or borrowed them from the library, or only sourced from sites that ensure the copyright is satisfied, then they might have mentioned it.

(FWIW, the blog post says the other source for the 25K works was their personal library, so I'm assuming the bulk of the 25K come from the internet, though I know some people have prodigious personal libraries.)

[0] https://blog.shaxpir.com/taking-down-prosecraft-io-37e189797...

JohnFen · 3 years ago

> If it's fair use, why should you have to do that?

You may not be legally required to do that, but it can be an excellent move that benefits you nonetheless.

Much like how Weird Al isn't legally required to get permission to make a parody of a popular song, but he does so anyway.

But in this case, I don't think you even need to invoke Fair Use. I think what he did simply isn't a copyright violation in the first place.

In reality, the legality of this was never the issue anyway. The issue was that doing this made the authors angry, and the dev didn't want that.

dingaling · 3 years ago

> also provide "fair use" to other people

"How much of someone else's work can I use without getting permission?

Under the fair use doctrine of the U.S. copyright statute, it is permissible to use limited portions of a work including quotes, for purposes such as commentary, criticism, news reporting, and scholarly reports."

https://www.copyright.gov/help/faq/faq-fairuse.html

Limited portions, not the entire work.

macNchz · 3 years ago

> The article itself is clueless… it doesn’t engage authors’ concerns at all, and just portrays authors as stoopid AI-fearful luddites.

Going off of some of the tweets about this that initially whipped up the outrage about this…it’s not like they were making a nuanced case about their concerns, they were basically just stomping their feet and shouting.

jmull · 3 years ago

That's twitter generally.

If your engagement only reaches the level of twitter, you aren't really engaging at all.

Deleted Comment

jasonlotito · 3 years ago

> If you want to do this kind of thing, let authors opt-in (or publishers).

"This kind of thing" is factual information about the book, such as page or word count, ly-adverb count, etc. Small snippets, something permissible under copyright law today, that were heavily editorialized and commented on were displayed.

To suggest that counting words and pages is something that should not be allowed is silly.

> The article itself is clueless…

Says the person making stuff up to force a narrative.

The person doing this had the rights to do this, and was very clearly within his rights to do this under copyright law. Counting words is not a crime.

tensor · 3 years ago

> The article itself is clueless… it doesn’t engage authors’ concerns at all, and just portrays authors as stoopid AI-fearful luddites.

The authors quotes speak for themselves. They very clearly and ignorantly claimed that this was an "AI training project" when it was nothing of the sort.

dxbydt · 3 years ago

> it doesn’t engage authors’ concerns at all, and just portrays authors as stoopid AI-fearful luddites.

https://twitter.com/scumbelievable/status/168915466478730444...

So the two authors who are gloating about "killed that stupid fuckin AI thing" - I'm supposed to be engaged with their concerns ? Please.

rockemsockem · 3 years ago

Statistical analysis is only useful if you have enough data to analyse, so there is in fact a threshold of number of books to cross before the tool can even really exist. If you read his post, the initial goal was to get stats about typical word count, typical amount of passive speech, etc. Requiring opt-in for these broad statistics, through outrage only since this project is CLEARLY legal in the United States, means that tools like this will never exist. Which seems net bad to me.

If you are saying it should be opt-in only for the pages analyzing specific books, like the instigator of this outrage screen-shotted, well that seems to fall squarely into the critical analysis bucket, so that is also quite ridiculous.

I understand some folks being unhappy that a portion of the works were pirated, but it seems like most of the outraged would be outraged even if he personally purchased each and every ebook.

Also, if you read through the Twitter thread a lot of the authors (not 100%, but a LOT) are doing a really great job portraying themselves as "stoopid AI-fearful luddites". Many of them think the site is somehow like ChatGPT and they don't bother to dig any deeper, or really at all.

dfxm12 · 3 years ago

Yeah, the article represents the voice of the authors in two tweets, from authors not apparently notable enough to have a wikipedia page. One I couldn't even find on Goodreads. It's obvious there's more to this than just the tweets presented. The article is unhelpful in this regard.

aredox · 3 years ago

Jeff VanderMeer is not notable enough?

paulluuk · 3 years ago

While I would agree in theory that a project like this would be best with opt-in, in reality that would just not work. Publishers would never opt-in to it, if they even respond to your requests at all.

wokwokwok · 3 years ago

Then don't do it?

Or, if you do it, do it privately and don't share it on the internet?

I'm not sure why this is a difficult idea; if asking for something and getting permission to do it is so difficult that 'would just not work. Publishers would never opt-in to it'

...then, it seems really obvious that even if you want to do it, technically can do it and you could maybe make a legal argument to doing it doesn't violate any laws...

...why would you do it? Why would you post about doing it?

Come on, that's literally being a selfish dick; spitting in people's faces and waving a 'too bad, you can't sue me' flag.

There are so many things, so many mannnny things that you could work on, why would you choose to pick something that you knew would upset people and you knew you wouldn't get permission to do if you asked?

zarzavat · 3 years ago

Authors are not demigods, they don’t have a right to control the use of their works, only the reproduction.

When you publish a book you “consent” to the fact that people are going to take it apart, talk about it, review it, quote from it, and yes run statistics on it. If an author doesn’t want that to happen then they shouldn’t publish a book. Just keep it private, only distribute it to people you trust after they sign an NDA.

As far as anyone knows, no piracy has occurred. In the US you are allowed to scan books, index them, and post excerpts - it’s called Google Books and there was a big case that affirmed that it is legal. Downloading a book from a pirate website for the purpose of indexing by a computer program is not piracy, you have simply outsourced the scanning stage to someone else. It is only an issue if you download from some p2p protocol (such as a torrent) that also uploads and shares the book.

OkayPhysicist · 3 years ago

Because the authors were AI-fearful luddites. From "Book" to "Program that judges books" lies well beyond any argument that the use of the derivative work could supersede the original. It's such clear cut transformative use that the authors come across as grossly misinformed about copyright law as a whole.

Perhaps there is an argument for generative AI possibly superseding the original, in that people might start asking an AI to generate them stories "in the style of x" instead of buying the author's books, but this wasn't that. It was just some fun data analysis of books.