For the past couple months I've been working on a website with two main features:
- https://book.sv - put in a list of books and get recommendations on what to read next from a model trained on over a billion reviews
- https://book.sv/intersect - put in a list of books and find the users on Goodreads who have read them all (if you don't want to be included in these results, you can opt-out here: https://book.sv/remove-my-data)
Technical info available here: https://book.sv/how-it-works
Note 1: If you only provide one or two books, the model doesn't have a lot to work with and may include a handful of somewhat unrelated popular books in the results. If you want recommendations based on just one book, click the "Similar" button next to the book after adding it to the input book list on the recommendations page.
Note 2: This is uncommon, but if you get an unexpected non-English titled book in the results, it is probably not a mistake and it very likely has an English edition. The "canonical" edition of a book I use for display is whatever one is the most popular, which is usually the English version, but this is not the case for all books, especially those by famous French or Russian authors.
* UI - once someone clicks "Add" you really should remove that item from the suggested list - it's very confusing to still see it.
* Beam search / diversification -- Your system threw like 100 books at me of which I'd read 95 and heard of 2 of the other 3, so it worked for me as a predictor of what I'd read, but not so well for discovery.
I'd be interested in recommendations that pushed me into a new area, or gave me a surprising read. This is easier to do if you have a fairly complete list of what someone's read, I know. But off the top of my head, I'm imagining finding my eigenfriends, then finding books that are either controversial (very wide rating differences amongst my fellow readers) or possibly ghettoized, that is, some portion of similar readers also read this X or Y subject, but not all.
Anyway, thanks, this is fun! Hook up a VLM and let people take pictures of their bookshelf next.
This is how filmaffinity works, which is the best recommendation system I've tried. They have a group of several dozen 'soulmates', which are users with the most similar set of films seen and ratings given; recommendations are other stuff they also liked, and you get direct access to their lists.
>then finding books that are either controversial or possibly ghettoized
Naively, I’d say the surprises are going to be better if you filter more different friends, rather than more controversial books among your friends. As in “find me a person that’s like me only in some ways, tell me what they love”. Long term this method is much better at exposing you to new ideas rather than just finding your cliques holy wars.
To be useful, the "Intersect" page should have:
- find near matches when there is no exact match with every book,
- ignore fake users (can any human read 80k books in many languages?),
- do not ignore users' votes (my input was books I liked, I expected to find users that rated them highly).
With the "Recommend" page I had the same problem as the GP, and all the recommendations were useless. To fix that, I think some features are needed:
- do not list books by authors from my list (I don't need recommendations for them),
- add a button for marking a suggested book as "disliked" (at the bare minimum, it should remove it from the suggestion, and ideally it should influence le suggestions as much as a "liked" book),
- do not suggest several books by the same author,
- add a button to hide a suggestion or show more suggestions (there were dozens of books I'd read but wouldn't rate high).
"[...] you agree not to sell, license, rent, modify, distribute, copy, reproduce, transmit, publicly display, publicly perform, publish, adapt, edit or create derivative works from any materials or content accessible on the Service. Use of the Goodreads Content or materials on the Service for any purpose not expressly permitted by this Agreement is strictly prohibited."
Also did the reviewers give you permission to fed their content into an LLM?
In the US there’s a major precedent [0] which held that scraping public-facing pages isn’t a CFAA "unauthorized access" issue. That’s a big part of why we’ve seen entire venture-backed scraping companies pop up - it’s not considered hacking if the data is already public.
[0] https://en.wikipedia.org/wiki/HiQ_Labs_v._LinkedIn
> However, after further appeal in another court, hiQ was found to be in breach of LinkedIn's terms, and there was a settlement.
So why would the same not apply here?
Deleted Comment
This was part of the terms of the settlement.
Your textbook versus reality conceptualization of things is dogshit. It’s exploitation to do what OP did. You’re endorsing it and minimizing the ethics and this certainly shall poison the well from which you drink. Godspeed.
Out of curiosity, is your point about TOS out of concern for the poster or for Goodreads?
You could try to argue that this falls under "create derivative works from any materials or content accessible on the Service" but even then it seems really flimsy to say that recommending books based on Goodread reviews is an infringemnt.
It's just not that different to a youtuber saying "I read reviews for 50 books, here's the ones to read"
I visit your garden and take 1000 apples from your tree.
Not that different.
I suspect we need to wait for the NYT (and others) case to be decided before we know whether scraping sites in contravention of their terms is also fair use for LLM training.
My own opinion (as someone who creates written content on an occasional professional basis) is that if you can’t monetise your content in some other way than blocking people from accessing it then your content probably isn’t as valuable as you think.
But at the same time that’s tricky when it’s genuine journalism, as in NYT’s case.
Obviously user generated content reviewing books online is rather different because the motivation of the reviewers was (presumably) not to generate money. And, indeed, with goodreads there’s a strong argument that people have already been screwed over after their good faith review submissions were packaged up as an asset and flogged to Amazon. A lot of people were quite upset by that when it happened a decade or so back.
So from a ‘moral arguments’ perspective I don’t think scraping goodreads is as problematic as other scraping examples.
(Sorry, none of this was aimed at you - your comment just got me thinking and it seemed as good a place as any to put it!)
IMO, your definition is overbroad
I did notice that when I put in a single book in a series (in my case Going Postal, Discworld #33) that tended to dominate the rest of the selection. That does make sense, but I don't want recommendations for a series I'm already well into.
Also noticed that a few books (Spycraft by Nadine Akkerman and Pete Langman, Tribalism is Dumb by Andrew Heaton) that I know are in goodreads and reviewed didn't show up in the search. I tried both author's name and the title of the book. Maybe they aren't in the dataset.
It did stumble with some books more niche books (The Complete Yes Minister). Trying the "Similar" button gave me more books that were _technically_ similar because they were novelizations of British comedy shows, but not what I was looking for.
For more common books though it lined up very well with books already on my wishlist!
The recommendations from other authors are good, but as far as I can tell I’ve read every single one of them.
Continuing to aggressively add everything it recommends eventually does seem to result in some interesting books I wasn’t familiar with, but I also end up with more and more books that are of zero interest to me.
For what it’s worth, I started with:
It is possible that there simply aren’t many books like these in existence, so the pool of relevant recommendations gets exhausted fairly quickly. I’d guess trending towards unrelated popular books is also just a feature of the source data, that largely sums up my experience with goodreads anyway.Very cool project though. I did end up ordering a couple of new books, so thank you very much.
I’ve only had a short play but a solution to this problem might be to show authors rather than books. Or select authors outside of the list the user has shared and then a top n (1,3,5) for each of those.
I feel like that’s how you’d recommend to someone else - type of book -> unknown author -> best matching few books from them.
After that the other side would be trying to find some diversity (if you think I’d like author X, personally you might suggest three different styles of book from them rather than three very similar books from them)
If you haven't already read it, you might like Lawrence Durrell's Antrobus [1].
[1] https://www.goodreads.com/book/show/759709.Antrobus_complete
My advice from someone who has built recommendation systems: Now comes the hard part! It seems like a lot of the feedback here is that it's operating pretty heavily like a content based system system, which is fine. But this is where you can probably start evaluating on other metrics like serendipity, novelty, etc. One of the best things I did for recommender systems in production is having different ones for different purposes, then aggregating them together into a final. Have a heavy content-based one to keep people in the rabbit hole. Have a heavy graph based to try and traverse and find new stuff. Have one that is heavily tuned on a specific metric for a specific purpose. Hell, throw in a pure TF-IDF/BM25/Splade based one.
The real trick of rec systems is that people want to be recommnded things differently. Having multiple systems that you can weigh differently per user is one way to be able to achieve that, usually one algorithm can't quite do that effectively.
No, or at least make it configurable.
I’d agree for series, but not for Authors, just because I once read a book by someone doesn’t mean I even know they have other stuff, the list of Authors I read and enjoyed is very long.
VERY few authors write consistenly good books.
If you liked one book by an author, it is not at all likely that you will like the other books as well. For example, Neil Stephenson is probably my favorite author alive today, but I hate almost half of his books.
The only author that I can think of where I read and liked every single book was Terry Pratchett, and that might have be a case of "I was still young and easy to impress".
I also i need a way to describe its recommendations as "meh". For example, if I put Gone Girl in, I get Girl on a Train. Which, personally, I thought was bad. I want to exclude that from all future rec sets, and ideally align my preferences to the intersection of liked A and disliked B. vOv
https://www.goodreads.com/robots.txt
So legalities aside, this seems unethical.
This obsession with "everything must be commercialized" is really killing creativity.
Now if the author was commercializing other peoples reviews, sure, it's potentially(!) unethical. But scraping a website for reviews that are publicly(!) posted, training a recommendation LLM and then sharing it, for free, seems ... exactly the ideal use case for this technology.
At the same time, everything you ever posted online has already been scraped by hundreds (maybe thousands) of entities and distributed/sold to countless other entities. The only difference is that OP shared his project here.
Blindly violating it is bad manners, but deliberately scraping a single website over a month isn't the worst.
I would also really like a possibility to add negative signal. It did also recommend books that seemed interesting to me but I ultimately didn’t like.
Overall quite impressive.
Deleted Comment
Sadly my experience with the book recommender isn't too great because of the 64 book limit. If I import either the most recent or least recent 64 book, 95% of the books it recommends to me are books I've read. Though it was helpful for spotting a few books I've read that I didn't log on Goodreads. Guess I'm pretty consistent.