Ask HN: How to Classify Websites?

Word vectorization + cosine distance, perhaps.

See also this post by marginalia[1] (HN discussion[2]) which discusses the same thing.

[1] https://memex.marginalia.nu/log/69-creepy-website-similarity...

[2] https://news.ycombinator.com/item?id=34143101

m-i-l · 2 years ago

There are a couple of challenges with the word vectorization + cosine distance approach:

1. The current embedding models are for relatively small amounts of content rather than whole web pages. For example, most of the popular Huggingface sentence-transformers are for 256 tokens max, which is only sufficient for a small part of most web pages. There are some models for larger sizes, e.g. Longformer and BigBird, but they're still only 4096 tokens, and there aren't any ready-made implementations for cosine or dot product similarity at the moment so you'd need to roll your own. Some of the new LLM OpenAI embedding models do go to 8191 tokens and can be used for cosine similarity, and costs may be acceptable for the content embedding, but for the query embedding you'd have to be super-careful about costs because if you put any kind of search on the public internet you could find that you'll soon be overwhelmed by SEO spam bots (most of which even Cloudflare can't block) and that could become prohibitively expensive to service via a paid-for API solution like OpenAI. A popular approach to the token length issue is to chunk your pages into multiple blocks.

2. Even if you get the embedding and chunking approach working, that is still just for page comparison, and single web pages are not necessarily representative of whole web sites. For example, if comparing home pages, many blogs just have a link to the posts page on the home page without any of the actual content from the posts. Or some websites cover an eclectic range of topics, and if your goal is to find other sites that cover a similar eclectic range of topics then you you need an embedding for the whole site rather than individual pages. Not sure about the best way to address this one though. Apparently averaging embeddings may work surprisingly well but I haven't tried that out yet. Other possible options might include summarising (chunked) pages and summarising the summaries for the site, or using topic modelling (e.g. BERTopic) alongside some kind of website taxonomy, or something like that. Keen to read of other possible approaches here.

Deleted Comment

marginalia_nu · 2 years ago

Yeah, I use the incident link graph similarity though. It's been almost ridiculously effective.

Not had the same success with word models, although it is probably largely a matter of finding an appropriate embedding.

The data is pretty stale, but here is a demo: https://explore2.marginalia.nu/

OP (or anyone else): Shoot me an email and I'll give you a copy of Marginalia's link graph to play with.

supriyo-biswas · 2 years ago

Playing around with the explore link, I see most of the popular news websites are excluded (such as wsj.com and nytimes.com)?

stingraycharles · 2 years ago

I’ve done this by using word vectorisation, and using nearest neighbour. At least assuming the content is key, not the design and/or other visuals.

That's the basis for a search engine.

There are plenty of ways : For example you can render the home page of a website to an image. Then run CLIP to get a feature vector. Then use some approximate near neighbor search library like FAISS or HNSWlib to index it.

Or you can ask ChatGPT, or a neural network to summarize the webpages into a short description and then into feature vector. Old school approach are things like bags of words for document classification. Then you run a hierarchical clustering algorithm (something like hierarchical k-means). This will allow to present things that are similar but not-so-similar that they are duplicates.

Interesting distances between websites are often described by considering which other websites they links to and what other websites link to them. Graph Neural Network allow to build a feature vector based on these links between websites. This is also related to the well known Page Rank algorithm.

Finally gathering metadata about websites can also be an interesting axis of similarity : Who owns the site ? How often do they update ? How much money they generate ? How do they generate money ? How big is it ? How fast do they render ? What do people think about the site ? Basically answering the 5Ws about the website and building a database about it, and LLM can help answer those questions automatically (You do a web-search about the site summarize the results, put them in the context of the LLM, and ask the question in a prompt and index the answer.

GistNoesis · 2 years ago

polote · 2 years ago

I'm exactly working on that too, and don't have the answer. The problem is we all have our way to classify things and this is never the same way. The same word also never mean the same things for each one of us.

Two aspects I'm trying currently are (that need users browsing history):

- Dont try to recommend similar website, but recommend users that like similar things as you, and you can list the website that this user likes

- Create tags with accuracy. For example you will tag a website "product management" "startup" and "b2b". You can go one step further and ask users to rate how this tag matches the website. Like 90% for "b2b" and 50% for "startup" and 20% "product management". Then you can let users search tags and their accuracy (I want "product management" at average more than 50%)

Like you I feel like something can be done with LLM but I just haven't found it yet, maybe to suggest the tags of a website from a restricted list of tags, and then to suggest tags from an explanation of what the user is searching and then search those tags

DeathArrow · 2 years ago

There are countless ML algorithms good for classification, no need for LLMs.

But the problem is on how you feed data into them. Some websites are vast depots of pretty different type of texts spanning a huge number of domains. If you use data from /r/programming you will classify Reddit as a programming website. If you use data from /r/food you will classify it as a culinary website.

Some websites like Pinterest or dailymotion are media heavy, so using just text might not be helpful.

What I want to say is that actual classification is the last problem to solve, the problem is feeding relevant data into it.

nerdponx · 2 years ago

I think they mean "categorize" instead of "classify". That is, they are looking for some form of unsupervised clustering.

In that case, I agree that LLMs are not the solution in and of themselves, but the word/sentence/documents embeddings obtained from the inner layers thereof might actually be useful. Back when BERT came out, the team I was on at the time had pretty good results using BERT-derived sentence vectors for a machine learning project.

btrettel · 2 years ago

The comments here focus on various AI/ML ways of classifying websites. That's surely part of a solution, but I think manual classification is still important. If you're making a public facing website, try to crowd-source the manual classification if you can't classify pages yourself. At present, AI/ML classification probably doesn't work anywhere near as well as you all think it does, which became obvious to me when I worked as a patent examiner in the past and used some of the many AI/ML search tools. These tools are quite good at finding somewhat similar things, but will usually miss things that I'd consider very similar. And when I was a patent examiner, "somewhat similar" wasn't good enough. Many people looking for similar websites won't be satisfied with a list of somewhat similar pages if a very similar page exists in the database.

jerpint · 2 years ago

The hard part will be figuring out your different categories to classify.

As some have pointed out, embedding your documents using an LLM would be a good bet.

If you take the time to manually annotate a portion of your data, you could then fine tune a model. You could also try doing some few shot / zero shot with chat GPT. You could also try to do some clustering on your embeddings to see if categories emerge and try to attribute them afterwards

byschii · 2 years ago

Maybe you can take inspiration from this https://youtu.be/z6ep308goxQ?t=187

It's not an explanation, but show a possible way to cluster (thus classify) websites based on the how they appear, if you want your classification based on content... Maybe you need something different

djrockstar1 · 2 years ago

Someone had success using GPT-3 to classify episodes of a podcast[1]. I imagine if you fed the HTML from the crawler into an LLM, it could come up with a usable classification for it.

[1] https://news.ycombinator.com/item?id=35073603