Show HN: I scraped 3B Goodreads reviews to train a better recommendation model

Does this break part 4 of the Goodreads TOS?

"[...] you agree not to sell, license, rent, modify, distribute, copy, reproduce, transmit, publicly display, publicly perform, publish, adapt, edit or create derivative works from any materials or content accessible on the Service. Use of the Goodreads Content or materials on the Service for any purpose not expressly permitted by this Agreement is strictly prohibited."

Also did the reviewers give you permission to fed their content into an LLM?

esskay · 3 months ago

Fairly meaningless in this day and age. Also IIRC scraping legality depends heavily on jurisdiction. Some places take a more permissive view of accessing publicly available information, even if a site's TOS forbids bots.

In the US there’s a major precedent [0] which held that scraping public-facing pages isn’t a CFAA "unauthorized access" issue. That’s a big part of why we’ve seen entire venture-backed scraping companies pop up - it’s not considered hacking if the data is already public.

[0] https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn

SahAssar · 3 months ago

From that article:

> However, after further appeal in another court, hiQ was found to be in breach of LinkedIn's terms, and there was a settlement.

So why would the same not apply here?

exe34 · 3 months ago

it's only legal if you have a team of lawyers though. the law still applies to the rest of us.

Deleted Comment

hrimfaxi · 3 months ago

> CFAA liability: hiQ stipulates that LinkedIn experienced losses sufficient to, and “may establish liability” under a CFAA civil claim “based on hiQ’s data collection practices and based on hiQ’s direct access to password-protected pages on LinkedIn’s platforms using fake accounts.”

This was part of the terms of the settlement.

voidUpdate · 3 months ago

So if you are legally allowed to "adapt, edit or create derivative works from any materials", what's the point of the TOS?

6stringmerc · 3 months ago

Tell that to judyrecords with the same smug attitude.

Your textbook versus reality conceptualization of things is dogshit. It’s exploitation to do what OP did. You’re endorsing it and minimizing the ethics and this certainly shall poison the well from which you drink. Godspeed.

MichaelBosworth · 3 months ago

What expectation of confidentiality are you ascribing to people having posted publicly accessible opinions on the internet?

Out of curiosity, is your point about TOS out of concern for the poster or for Goodreads?

voidUpdate · 3 months ago

My expectation isn't of confidentiality, but of attribution. Sure, my website is perfectly accessible on the internet, and I'm fine with being able to find it on google, but if you pipe it into an algorithm that will start throwing out stuff based on what I wrote, with zero reference to me at all, I'd get a bit annoyed. This website has taken the combined output of probably thousands of people, shoved it into an algorithm and is then using their work to give "original" ideas. If one person wanted their content removed from the system, how would you do that?

caconym_ · 3 months ago

What does that comment have to do with confidentiality?

pantropy · 3 months ago

Technically speaking none of Goodreads material or content is being used publically, the only information displayed on the site is freely available (Title, Author) and not Goodread's property.

You could try to argue that this falls under "create derivative works from any materials or content accessible on the Service" but even then it seems really flimsy to say that recommending books based on Goodread reviews is an infringemnt.

It's just not that different to a youtuber saying "I read reviews for 50 books, here's the ones to read"

voidUpdate · 3 months ago

I'd be impressed if a youtuber could read 3 billion reviews and recommend books to you based on that

croes · 3 months ago

I visit your garden and take 1 apple from your tree

I visit your garden and take 1000 apples from your tree.

Not that different.

contravariant · 3 months ago

At what point are they feeding reviews into an LLM? From what I got the only personal data they're using is which user read which books.

kosolam · 3 months ago

I’m not taking sides in this debate, however since feeding whole books into LLMs is considered legal fair use now, I guess these reviews don’t require a permission as well. Would be great to hear a professional lawyer take on this.

saaaaaam · 3 months ago

The hidden gotcha in the Anthropic judgement (which I think is what you’re referencing?) is that feeding whole books into LLMs is considered legal fair use if you obtain them legitimately.

I suspect we need to wait for the NYT (and others) case to be decided before we know whether scraping sites in contravention of their terms is also fair use for LLM training.

My own opinion (as someone who creates written content on an occasional professional basis) is that if you can’t monetise your content in some other way than blocking people from accessing it then your content probably isn’t as valuable as you think.

But at the same time that’s tricky when it’s genuine journalism, as in NYT’s case.

Obviously user generated content reviewing books online is rather different because the motivation of the reviewers was (presumably) not to generate money. And, indeed, with goodreads there’s a strong argument that people have already been screwed over after their good faith review submissions were packaged up as an asset and flogged to Amazon. A lot of people were quite upset by that when it happened a decade or so back.

So from a ‘moral arguments’ perspective I don’t think scraping goodreads is as problematic as other scraping examples.

(Sorry, none of this was aimed at you - your comment just got me thinking and it seemed as good a place as any to put it!)

irl_zebra · 3 months ago

This is, essentially, why I've withdrawn from posting content from my human brain almost anywhere on the open internet (except here, sometimes) and have retired blog posts, opinions, and so on to our friends WAN.

petralithic · 3 months ago

Why ask questions you already know the answers to?

hananova · 3 months ago

Because some tech adjacent people still have morals?

lunias · 3 months ago

If it's on the internet, and people can access it, then it's public. I would have no expectations for what people do with public data; that just seems like setting yourself up for disappointment.

Vvector · 3 months ago

Is a pirated movie, found on bittorrent, public?

IMO, your definition is overbroad

Echoing what everyone else has said here - awesome site, love how fast it was.

I did notice that when I put in a single book in a series (in my case Going Postal, Discworld #33) that tended to dominate the rest of the selection. That does make sense, but I don't want recommendations for a series I'm already well into.

Also noticed that a few books (Spycraft by Nadine Akkerman and Pete Langman, Tribalism is Dumb by Andrew Heaton) that I know are in goodreads and reviewed didn't show up in the search. I tried both author's name and the title of the book. Maybe they aren't in the dataset.

It did stumble with some books more niche books (The Complete Yes Minister). Trying the "Similar" button gave me more books that were _technically_ similar because they were novelizations of British comedy shows, but not what I was looking for.

For more common books though it lined up very well with books already on my wishlist!

costco · 3 months ago

Yes I would say the handling of series is probably the biggest problem. Once my test metrics got to a point I was happy with and my quality spot checks passed (can I follow the models recommendations from one generic history book to Steven Runciman, also making sure popular books don't always dominate the results), I was ready to release because I had been working on this project for so long. The solution is probably using the transformer model to generate 100-200 candidates and then having a reranker on top.

walletdrainer · 3 months ago

Not just series, but I seem to mostly get a list of other books from the same authors.

The recommendations from other authors are good, but as far as I can tell I’ve read every single one of them.

Continuing to aggressively add everything it recommends eventually does seem to result in some interesting books I wasn’t familiar with, but I also end up with more and more books that are of zero interest to me.

For what it’s worth, I started with:

  Infinite Jest David Foster Wallace
  Europe Central William T. Vollmann
  Gravity’s Rainbow Thomas Pynchon
  White Noise Don DeLillo
  One Hundred Years of Solitude Gabriel García Márquez

It is possible that there simply aren’t many books like these in existence, so the pool of relevant recommendations gets exhausted fairly quickly. I’d guess trending towards unrelated popular books is also just a feature of the source data, that largely sums up my experience with goodreads anyway.

Very cool project though. I did end up ordering a couple of new books, so thank you very much.

IanCal · 3 months ago

Releasing is the right choice, well done with this it’s really cool.

I’ve only had a short play but a solution to this problem might be to show authors rather than books. Or select authors outside of the list the user has shared and then a top n (1,3,5) for each of those.

I feel like that’s how you’d recommend to someone else - type of book -> unknown author -> best matching few books from them.

After that the other side would be trying to find some diversity (if you think I’d like author X, personally you might suggest three different styles of book from them rather than three very similar books from them)

Peritract · 3 months ago

> It did stumble with ... more niche books (The Complete Yes Minister).

If you haven't already read it, you might like Lawrence Durrell's Antrobus [1].

[1] https://www.goodreads.com/book/show/759709.Antrobus_complete

dbl000 · 3 months ago

I haven't read it, but I will check it out!

mulakosag · 3 months ago

Going Postal is awesome. The flood of the mails and the test to be the post master where you would have a slide mail into the hole where a vicious dog is barking.

vessenes · 3 months ago

OK, I just added books until you told me I had too many. Fun idea! I have a couple of suggestions:

* UI - once someone clicks "Add" you really should remove that item from the suggested list - it's very confusing to still see it.

* Beam search / diversification -- Your system threw like 100 books at me of which I'd read 95 and heard of 2 of the other 3, so it worked for me as a predictor of what I'd read, but not so well for discovery.

I'd be interested in recommendations that pushed me into a new area, or gave me a surprising read. This is easier to do if you have a fairly complete list of what someone's read, I know. But off the top of my head, I'm imagining finding my eigenfriends, then finding books that are either controversial (very wide rating differences amongst my fellow readers) or possibly ghettoized, that is, some portion of similar readers also read this X or Y subject, but not all.

Anyway, thanks, this is fun! Hook up a VLM and let people take pictures of their bookshelf next.

kace91 · 3 months ago

(From the site) >If you visit the "intersect" page, you can input multiple books and find the set of users that have read all of those books. This can be useful for finding longer tail books that weren't popular enough to meet the threshold. For instance, if you like reading about the collapse of the Soviet Union, you could put in "Lenin's Tomb" and "Secondhand Time", and see what other books the resultant users have read.

This is how filmaffinity works, which is the best recommendation system I've tried. They have a group of several dozen 'soulmates', which are users with the most similar set of films seen and ratings given; recommendations are other stuff they also liked, and you get direct access to their lists.

>then finding books that are either controversial or possibly ghettoized

Naively, I’d say the surprises are going to be better if you filter more different friends, rather than more controversial books among your friends. As in “find me a person that’s like me only in some ways, tell me what they love”. Long term this method is much better at exposing you to new ideas rather than just finding your cliques holy wars.

idoubtit · 3 months ago

The "Intersect" page was useless for me. I added 15 books, but got no matching user. I entered a cycle of removing-searching, and at 10 books I had 2 users: one had read 41353 books, and the other 85363, with no ratings...

To be useful, the "Intersect" page should have:

- find near matches when there is no exact match with every book,

- ignore fake users (can any human read 80k books in many languages?),

- do not ignore users' votes (my input was books I liked, I expected to find users that rated them highly).

With the "Recommend" page I had the same problem as the GP, and all the recommendations were useless. To fix that, I think some features are needed:

- do not list books by authors from my list (I don't need recommendations for them),

- add a button for marking a suggested book as "disliked" (at the bare minimum, it should remove it from the suggestion, and ideally it should influence le suggestions as much as a "liked" book),

- do not suggest several books by the same author,

- add a button to hide a suggestion or show more suggestions (there were dozens of books I'd read but wouldn't rate high).

mscbuck · 3 months ago

Awesome site and speed!

My advice from someone who has built recommendation systems: Now comes the hard part! It seems like a lot of the feedback here is that it's operating pretty heavily like a content based system system, which is fine. But this is where you can probably start evaluating on other metrics like serendipity, novelty, etc. One of the best things I did for recommender systems in production is having different ones for different purposes, then aggregating them together into a final. Have a heavy content-based one to keep people in the rabbit hole. Have a heavy graph based to try and traverse and find new stuff. Have one that is heavily tuned on a specific metric for a specific purpose. Hell, throw in a pure TF-IDF/BM25/Splade based one.

The real trick of rec systems is that people want to be recommnded things differently. Having multiple systems that you can weigh differently per user is one way to be able to achieve that, usually one algorithm can't quite do that effectively.

maaaaattttt · 3 months ago

Speaking of TF-IDF I once added it “after” the recommendations to downscore items that were too popular and tended to be recommended too much/with too many other items (think Beatles/iphone) and inversely for more niche items. It might be too costly too do depending on how you generate the recommendations though.

blehn · 3 months ago

You should filter out authors from the input books in the output. If liked a book by an author, surely I'd read more of their work if I wanted to — recommending them isn't helpful. Along the same lines, I think interesting recommendations tend to be the ones that (1) I like and (2) I didn't expect. The more similar the recommendations are to the input, the more likely I already know them, and the more likely to create a recommendation echo chamber.

Semaphor · 3 months ago

> You should filter out authors from the input books in the output.

No, or at least make it configurable.

I’d agree for series, but not for Authors, just because I once read a book by someone doesn’t mean I even know they have other stuff, the list of Authors I read and enjoyed is very long.

Configurability is fine, but it's too obvious a recommendation and just creates noise. The purpose of a recommendation system is to help you find things that aren't obvious. I'd still filter them out by default even if it's configurable.

martin82 · 3 months ago

I don't agree at all.

VERY few authors write consistenly good books.

If you liked one book by an author, it is not at all likely that you will like the other books as well. For example, Neil Stephenson is probably my favorite author alive today, but I hate almost half of his books.

The only author that I can think of where I read and liked every single book was Terry Pratchett, and that might have be a case of "I was still young and easy to impress".

I didn't say it was likely that you'd enjoy the other books by that author. My point is that it's not helpful for a recommendation system to recommend more books by that author — it's common sense. If I've read a book by an author, it's easy enough to look up their other work and decide whether I want to read more of them.

honkycat · 3 months ago

yep, was gonna say this. Getting recommended all of the same books I've already read isn't great

Ntrails · 3 months ago

Agree entirely - more excluding series than authors but both should be options.

I also i need a way to describe its recommendations as "meh". For example, if I put Gone Girl in, I get Girl on a Train. Which, personally, I thought was bad. I want to exclude that from all future rec sets, and ideally align my preferences to the intersection of liked A and disliked B. vOv

diffeomorphism · 3 months ago

The robots.txt is pretty explicit that this scraping is "disallowed"

https://www.goodreads.com/robots.txt

So legalities aside, this seems unethical.

sputr · 3 months ago

Why would it be unethical?

This obsession with "everything must be commercialized" is really killing creativity.

Now if the author was commercializing other peoples reviews, sure, it's potentially(!) unethical. But scraping a website for reviews that are publicly(!) posted, training a recommendation LLM and then sharing it, for free, seems ... exactly the ideal use case for this technology.

paulnpace · 3 months ago

It is truly criminal that such a bright and brilliant model of ethics, Amazon, should endure such an attack.

Unethical behavior does not become good just because it happens to hurt "bad people" (or more accurately, companies bought by bad people).

galdauts · 3 months ago

I agree. As a frequent reviewer on Goodreads, this feels really icky.

psandor · 3 months ago

You are right.

At the same time, everything you ever posted online has already been scraped by hundreds (maybe thousands) of entities and distributed/sold to countless other entities. The only difference is that OP shared his project here.

If it's unethical it's not because of what the robots.txt says.

Blindly violating it is bad manners, but deliberately scraping a single website over a month isn't the worst.

yoz-y · 3 months ago

It works pretty well in the sense that after inputting only a few quite diverse books it gave me recommendations for a lot of books that I’ve already also read and enjoyed.

I would also really like a possibility to add negative signal. It did also recommend books that seemed interesting to me but I ultimately didn’t like.

Overall quite impressive.

varenc · 3 months ago

I love this site, and the approach! Great seeing someone making good use of Goodreads data.

Sadly my experience with the book recommender isn't too great because of the 64 book limit. If I import either the most recent or least recent 64 book, 95% of the books it recommends to me are books I've read. Though it was helpful for spotting a few books I've read that I didn't log on Goodreads. Guess I'm pretty consistent.

I think I will expand the input books limit (sadly requires retraining) and or the output books limit of 30.

I ended up playing with it more and found the recommended useful! I just removed a bunch of books of a certain theme, then got a bunch of good recs for the theme that remains.