For years now I've wanted to develop a program that would be able to find similar websites. I am able to write a crawler, no problem. I haven't been able to figure out the classification side of things. I tried bayesian without much success. Would ChatGPT Or Llama be able to do this?
See also this post by marginalia[1] (HN discussion[2]) which discusses the same thing.
[1] https://memex.marginalia.nu/log/69-creepy-website-similarity...
[2] https://news.ycombinator.com/item?id=34143101
1. The current embedding models are for relatively small amounts of content rather than whole web pages. For example, most of the popular Huggingface sentence-transformers are for 256 tokens max, which is only sufficient for a small part of most web pages. There are some models for larger sizes, e.g. Longformer and BigBird, but they're still only 4096 tokens, and there aren't any ready-made implementations for cosine or dot product similarity at the moment so you'd need to roll your own. Some of the new LLM OpenAI embedding models do go to 8191 tokens and can be used for cosine similarity, and costs may be acceptable for the content embedding, but for the query embedding you'd have to be super-careful about costs because if you put any kind of search on the public internet you could find that you'll soon be overwhelmed by SEO spam bots (most of which even Cloudflare can't block) and that could become prohibitively expensive to service via a paid-for API solution like OpenAI. A popular approach to the token length issue is to chunk your pages into multiple blocks.
2. Even if you get the embedding and chunking approach working, that is still just for page comparison, and single web pages are not necessarily representative of whole web sites. For example, if comparing home pages, many blogs just have a link to the posts page on the home page without any of the actual content from the posts. Or some websites cover an eclectic range of topics, and if your goal is to find other sites that cover a similar eclectic range of topics then you you need an embedding for the whole site rather than individual pages. Not sure about the best way to address this one though. Apparently averaging embeddings may work surprisingly well but I haven't tried that out yet. Other possible options might include summarising (chunked) pages and summarising the summaries for the site, or using topic modelling (e.g. BERTopic) alongside some kind of website taxonomy, or something like that. Keen to read of other possible approaches here.
Deleted Comment
Not had the same success with word models, although it is probably largely a matter of finding an appropriate embedding.
The data is pretty stale, but here is a demo: https://explore2.marginalia.nu/
OP (or anyone else): Shoot me an email and I'll give you a copy of Marginalia's link graph to play with.
There are plenty of ways : For example you can render the home page of a website to an image. Then run CLIP to get a feature vector. Then use some approximate near neighbor search library like FAISS or HNSWlib to index it.
Or you can ask ChatGPT, or a neural network to summarize the webpages into a short description and then into feature vector. Old school approach are things like bags of words for document classification. Then you run a hierarchical clustering algorithm (something like hierarchical k-means). This will allow to present things that are similar but not-so-similar that they are duplicates.
Interesting distances between websites are often described by considering which other websites they links to and what other websites link to them. Graph Neural Network allow to build a feature vector based on these links between websites. This is also related to the well known Page Rank algorithm.
Finally gathering metadata about websites can also be an interesting axis of similarity : Who owns the site ? How often do they update ? How much money they generate ? How do they generate money ? How big is it ? How fast do they render ? What do people think about the site ? Basically answering the 5Ws about the website and building a database about it, and LLM can help answer those questions automatically (You do a web-search about the site summarize the results, put them in the context of the LLM, and ask the question in a prompt and index the answer.
Two aspects I'm trying currently are (that need users browsing history):
- Dont try to recommend similar website, but recommend users that like similar things as you, and you can list the website that this user likes
- Create tags with accuracy. For example you will tag a website "product management" "startup" and "b2b". You can go one step further and ask users to rate how this tag matches the website. Like 90% for "b2b" and 50% for "startup" and 20% "product management". Then you can let users search tags and their accuracy (I want "product management" at average more than 50%)
Like you I feel like something can be done with LLM but I just haven't found it yet, maybe to suggest the tags of a website from a restricted list of tags, and then to suggest tags from an explanation of what the user is searching and then search those tags
But the problem is on how you feed data into them. Some websites are vast depots of pretty different type of texts spanning a huge number of domains. If you use data from /r/programming you will classify Reddit as a programming website. If you use data from /r/food you will classify it as a culinary website.
Some websites like Pinterest or dailymotion are media heavy, so using just text might not be helpful.
What I want to say is that actual classification is the last problem to solve, the problem is feeding relevant data into it.
In that case, I agree that LLMs are not the solution in and of themselves, but the word/sentence/documents embeddings obtained from the inner layers thereof might actually be useful. Back when BERT came out, the team I was on at the time had pretty good results using BERT-derived sentence vectors for a machine learning project.
As some have pointed out, embedding your documents using an LLM would be a good bet.
If you take the time to manually annotate a portion of your data, you could then fine tune a model. You could also try doing some few shot / zero shot with chat GPT. You could also try to do some clustering on your embeddings to see if categories emerge and try to attribute them afterwards
It's not an explanation, but show a possible way to cluster (thus classify) websites based on the how they appear, if you want your classification based on content... Maybe you need something different
[1] https://news.ycombinator.com/item?id=35073603