We've had the best success by first converting the HTML to a simpler format (i.e. markdown) before passing it to the LLM.
There are a few ways to do this that we've tried, namely Extractus[0] and dom-to-semantic-markdown[1].
Internally we use Apify[2] and Firecrawl[3] for Magic Loops[4] that run in the cloud, both of which have options for simplifying pages built-in, but for our Chrome Extension we use dom-to-semantic-markdown.
Similar to the article, we're currently exploring a user-assisted flow to generate XPaths for a given site, which we can then use to extract specific elements before hitting the LLM.
By simplifying the "problem" we've had decent success, even with GPT-4o mini.
If you're open to it, I'd love to hear what you think of what we're building at https://browserbase.com/ - you can run a chrome extension on a headless browser so you can do the semantic markdown within the browser, before pulling anything off.
This is super neat and I think I've seen your site before :)
Do you handle authentication? We have lots of users that want to automate some part of their daily workflow but the pages are often behind a login and/or require a few clicks to reach the desired content.
Am I correct that the use case of doing this is 1. Scale and 2. Defeating Cloudflare et. al?
I do scraping, but I struggle to see what these tools are offering, but maybe I'm just not the target audience. If the websites don't have much anti-scraping protection to speak of, and I only do a few pages per day, is there still something I can get out of using a tool like Browserbase? I wonder because of this talk about semantic markdown and LLMs, what's the benefit between writing (or even having an AI write) standard fetching and parsing code using playwright/beautifulsoup/cheerio?
I was just a bit confused that the sign up buttons for the Hobby and Scale plans are grey, I thought that they are disabled until randomly hovering over them.
Have you compared markdown to just stripping the HTML down (strip tag attributes, unwrap links, remove obvious non-displaying elements)? My experience has been that performance is pretty similar to markdown, and it’s an easier transformation with fewer edge cases.
That’s what I’ve done for quite a few [non-LLM] applications. The remaining problem is that HTML is verbose vs other formats. That has a higher, per-token cost. So, maybe stripping followed by substituting HTML tags with a compressed notation.
First I’ve heard of Semantic Markdown [0]. It appears to be a way to embed RDF data in Markdown documents.
The page I found is labeled “Alpha Draft,” which suggests there isn’t a huge corpus of Semantic Markdown content out there. This might impede LLM’s ability to understand it due to lack of training data. However, it seems sufficiently readable that LLMs could get by pretty well by treating its structured metadata as parathenicals
=====
What is Semantic Markdown?
Semantic Markdown is a plain-text format for writing documents that embed machine-readable data. The documents are easy to author and both human and machine-readable, so that the structured data contained within these documents is available to tools and applications.
Technically speaking, Semantic Markdown is "RDFa Lite for Markdown" and aims at enhancing the HTML generated from Markdown with RDFa Lite attributes.
Design Rationale:
Embed RDFa-like semantic annotation within Markdown
Ability to mix unstructured human-text with machine-readable data in JSON-LD-like lists
Ability to semantically annotate an existing plain Markdown document with semantic annotations
Keep human-readability to a maximum
About this document
We did something similar -na although in a somewhat different context.
Translating a complex JSON representing an execution graph to a simpler graphviz dot format first and then feeding it to an LLM. We had decent success.
OpenAI recently announced a Batch API [1] which allows you to prepare all prompts and then run them as a batch. This reduces costs as its just 50% the price. Used it a lot with GPT-4o mini in the past and was able to prompt 3000 Items in less than 5min. Could be great for non-realtime applications.
I hope some of the opensource inference servers start supporting that endpoint soon. I know vLLM has added some "offline batch mode" support with the same format, they just haven't gotten around to implementing it on the OpenAI endpoint yet.
That's a great proposition by OpenAI.
I think however that it is still one to two orders of magnitude too expensive compared to traditional text extraction with very similar precision and recall levels.
Yeah this was a phenomenal decision on their part. I wish some of the other cloud tools like azure would offer the same thing, it just makes so much sense!
For structured content (e.g. lists of items, simple tables), you really don’t need LLMs.
I recently built a web scraper to automatically work on any website [0] and built the initial version using AI, but I found that using heuristics based on element attributes and positioning ended up being faster, cheaper, and more accurate (no hallucinations!).
For most websites, the non-AI approach works incredibly well so I’d make sure AI is really necessary (e.g. data is unstructured, need to derive or format the output based on the page data) before incorporating it.
The LLM is resistant to website updates that would break normal scraping
If you do like the author did and ask it to generate xPaths, you can use it once, use the xPaths it generated for regular scraping, then once it breaks fall back to the LLM to update the xPaths and fall back one more time to alerting a human if the data doesn't start flowing again, or if something breaks further down the pipeline because the data is in an unexpected format.
This is absolutely true, but it does have to be weighed against the performance benefits of something that doesn't require invoking an LLM to operate.
If the cost of updating some xPath things every now and then is relatively low - which I guess means "your target site is not actively & deliberately obfuscating their website specifically to stop people scraping it"), running a basic xPath scraper would be maybe multiple orders of magnitude more efficient.
Using LLMs to monitor the changes and generate new xPaths is an awesome idea though - it takes the expensive part of the process and (hopefully) automates it away, so you get the benefits of both worlds.
I've seen another website like this that had this feature on hackernews but it was from a retrospective. These websites have the nasty habit of ceasing operations
Is there a "html reducer" out there? I've been considering writing one. If you take a page's source it's going to be 90% garbage tokens -- random JS, ads, unnecessary properties, aggressive nesting for layout rendering, etc.
I feel like if you used a dom parser to walk and only keep nodes with text, the html structure and the necessary tag properties (class/id only maybe?) you'd have significant savings. Perhaps the xpath thing might work better too. You can even even drop necessary symbols and represent it as a indented text file.
We use readability for things like this but you lose the dom structure and their quality reduces with JS heavy websites and pages with actions like "continue reading" which expand the text.
Jina.ai offer a really neat (currently free) API for this - you add https://r.jina.ai/ on the beginning of any API and it gives you back a Markdown version of the main content of that page, suitable for piping into an LLM.
I've been using Readability (minus the Markdown) bit myself to extract the title and main content from a page - I have a recipe for running it via Playwright using my shot-scraper tool here: https://shot-scraper.datasette.io/en/stable/javascript.html#...
I snuck in an edit about readability before I saw your reply. The quality of that one in particular is very meh, especially for most new sites and then you lose all of the dom structure in case you want to do more with the page. Though now I'm curious how it works on the weather.com page the author tried. pupeteer -> screenshot -> ocr (or even multi-modal which many do OCR first) -> LLM pipeline might work better there.
Only works insofar as sites are being nice. A lot of sites do things like: render all text via JS, render article text via API, paywall content by showing a preview snippet of static text before swapping it for the full text (which lives in a different element), lazyload images, lazyload text, etc etc.
DOM parsing wasn't enough for Google's SEO algo, either. I'll even see Safari's "reader mode" fail utterly on site after site for some of these reasons. I tend to have to scroll the entire page before running it.
That’s easy to do with BeautifulSoup in Python. Look up tutorials on that. Use it on non-essential tags. That will at least work when the content is in HTML rather than procedurally generated (eg JavaScript).
It's very surprising that the author of this post does 99% of the work and writing and then does not go forward for the other 1% downloading ollama (or some other llama.cpp based engine) and testing how some decent local LLM works in this use case. Because maybe a 7B or 30B model will do great in this use case, and that's cheap enough to run: no GPT-4o needed.
We've been working on AI-automated web scraping at Kadoa[0] and our early experiments were similar to the those in the article. We started when only the expensive and slow GPT-3 was available, which pushed us to develop a cost-effective solution at scale.
Here is what we ended up with:
- Extraction: We use codegen to generate CSS selectors or XPath extraction code. This is more efficient than using LLMs for every data extraction. Using an LLM for every data extraction, would be expensive and slow, but using LLMs to generate the scraper code and subsequently adapt it to website modifications is highly efficient.
- Cleansing & transformation: We use small fine-tuned LLMs to clean and map data into the desired format.
- Validation: Unstructured data is a pain to validate. Among traditional data validation methods like reverse search, we use LLM-as-a-judge to evaluate the data quality.
We quickly realized that doing this for a few data sources with low complexity is one thing, doing it for thousands of websites in a reliable, scalable, and cost-efficient way is a whole different beast.
Combining traditional ETL engineering methods with small, well-evaluated LLM steps was the way to go for us
I've had good luck with giving it an example of HTML I want scraped and asking for a beautifulsoup code snippet. Generally the structure of what you want to scrape remains the same, and it's a tedious exercise coming up with the garbled string of nonsense that ends up parsing it.
Using an LLM for the actual parsing, that's simultaneously overkill while risking your results being polluted with hallucinations.
As others have mentioned here you might get better results cheaper (this probably wasn't the point of the article, so just fyi) if you preprocess the html first. I personally have had good results with trafilatura[1], which I don't see mentioned yet.
I second trafilatura greatly. This will save a huge amount of money to just send the text to the LLM.
I used it on this recent project (shameless plug): https://github.com/philippe2803/contentmap. It's a simple python library that creates a vector store for any website, using a domain XML sitemap as a starting point. The challenge was that each domain has its own HTML structure, and to create a vector store, we need the actual content, removing HTML tags, etc. Trafilatura basically does that for any url, in just a few lines of code.
Good to know! Yes, trafilatura is great, sure it breaks sometimes, but everything breaks on some website - the real questions are how often and what is the extent of breakage. For general info., the library was published about here [1], where in Table 1 they provide some benchmarks.
I also forgot to mention another interesting scraper that's an LLM based service. A quick search here tells me it was mentioned once by simonw, but I think it should be better known just for the convenience! Prepend "r.jina.ai" to any URL to extract text. For ex., check out [2] or [3].
There are a few ways to do this that we've tried, namely Extractus[0] and dom-to-semantic-markdown[1].
Internally we use Apify[2] and Firecrawl[3] for Magic Loops[4] that run in the cloud, both of which have options for simplifying pages built-in, but for our Chrome Extension we use dom-to-semantic-markdown.
Similar to the article, we're currently exploring a user-assisted flow to generate XPaths for a given site, which we can then use to extract specific elements before hitting the LLM.
By simplifying the "problem" we've had decent success, even with GPT-4o mini.
[0] https://github.com/extractus
[1] https://github.com/romansky/dom-to-semantic-markdown
[2] https://apify.com/
[3] https://www.firecrawl.dev/
[4] https://magicloops.dev/
We even have an iFrame-able live view of the browser, so your users can get real-time feedback on the XPaths they're generating: https://docs.browserbase.com/features/session-live-view#give...
Happy to answer any questions!
Do you handle authentication? We have lots of users that want to automate some part of their daily workflow but the pages are often behind a login and/or require a few clicks to reach the desired content.
Happy to chat: username@gmail.com
I do scraping, but I struggle to see what these tools are offering, but maybe I'm just not the target audience. If the websites don't have much anti-scraping protection to speak of, and I only do a few pages per day, is there still something I can get out of using a tool like Browserbase? I wonder because of this talk about semantic markdown and LLMs, what's the benefit between writing (or even having an AI write) standard fetching and parsing code using playwright/beautifulsoup/cheerio?
I was just a bit confused that the sign up buttons for the Hobby and Scale plans are grey, I thought that they are disabled until randomly hovering over them.
The page I found is labeled “Alpha Draft,” which suggests there isn’t a huge corpus of Semantic Markdown content out there. This might impede LLM’s ability to understand it due to lack of training data. However, it seems sufficiently readable that LLMs could get by pretty well by treating its structured metadata as parathenicals
=====
What is Semantic Markdown?
Semantic Markdown is a plain-text format for writing documents that embed machine-readable data. The documents are easy to author and both human and machine-readable, so that the structured data contained within these documents is available to tools and applications.
Technically speaking, Semantic Markdown is "RDFa Lite for Markdown" and aims at enhancing the HTML generated from Markdown with RDFa Lite attributes.
Design Rationale:
Embed RDFa-like semantic annotation within Markdown
Ability to mix unstructured human-text with machine-readable data in JSON-LD-like lists
Ability to semantically annotate an existing plain Markdown document with semantic annotations
Keep human-readability to a maximum About this document
=====
[0] https://hackmd.io/@sparna/semantic-markdown-draft
I've been wanting to try the same approach and have been looking for the right tools.
Translating a complex JSON representing an execution graph to a simpler graphviz dot format first and then feeding it to an LLM. We had decent success.
Source: I have a toddler at home.
[1] https://platform.openai.com/docs/guides/batch
There is no need for a new API endpoint. Just send multiple requests at once.
I recently built a web scraper to automatically work on any website [0] and built the initial version using AI, but I found that using heuristics based on element attributes and positioning ended up being faster, cheaper, and more accurate (no hallucinations!).
For most websites, the non-AI approach works incredibly well so I’d make sure AI is really necessary (e.g. data is unstructured, need to derive or format the output based on the page data) before incorporating it.
[0] https://easyscraper.com
If you do like the author did and ask it to generate xPaths, you can use it once, use the xPaths it generated for regular scraping, then once it breaks fall back to the LLM to update the xPaths and fall back one more time to alerting a human if the data doesn't start flowing again, or if something breaks further down the pipeline because the data is in an unexpected format.
Deleted Comment
> Turns out, a simple table from Wikipedia (Human development index) breaks the model because rows with repeated values are merged
If the cost of updating some xPath things every now and then is relatively low - which I guess means "your target site is not actively & deliberately obfuscating their website specifically to stop people scraping it"), running a basic xPath scraper would be maybe multiple orders of magnitude more efficient.
Using LLMs to monitor the changes and generate new xPaths is an awesome idea though - it takes the expensive part of the process and (hopefully) automates it away, so you get the benefits of both worlds.
I feel like if you used a dom parser to walk and only keep nodes with text, the html structure and the necessary tag properties (class/id only maybe?) you'd have significant savings. Perhaps the xpath thing might work better too. You can even even drop necessary symbols and represent it as a indented text file.
We use readability for things like this but you lose the dom structure and their quality reduces with JS heavy websites and pages with actions like "continue reading" which expand the text.
Whats the gold standard for something like this?
Here's an example: https://r.jina.ai/https://simonwillison.net/2024/Sep/2/anato... - for this page: https://simonwillison.net/2024/Sep/2/anatomy-of-a-textual-us...
Their code is open source so you can run your own copy if you like: https://github.com/jina-ai/reader - it's written in TypeScript and uses Puppeteer and https://github.com/mozilla/readability
I've been using Readability (minus the Markdown) bit myself to extract the title and main content from a page - I have a recipe for running it via Playwright using my shot-scraper tool here: https://shot-scraper.datasette.io/en/stable/javascript.html#...
It's adapted from vimium and works like a charm. Distill the html down to it's important bits, and handle a ton of edge cases along the way haha
https://github.com/mozilla/readability
Deleted Comment
DOM parsing wasn't enough for Google's SEO algo, either. I'll even see Safari's "reader mode" fail utterly on site after site for some of these reasons. I tend to have to scroll the entire page before running it.
If these readers do not use already rendered HTML to parse the information on the screen, then...
Deleted Comment
It’s strips all JS/event handlers, most attributes and most CSS, and only keeps important text nodes
I needed this because I was using LLM to reimplement portions of a page using just tailwind, so needed to minimise input tokens
Here is what we ended up with:
- Extraction: We use codegen to generate CSS selectors or XPath extraction code. This is more efficient than using LLMs for every data extraction. Using an LLM for every data extraction, would be expensive and slow, but using LLMs to generate the scraper code and subsequently adapt it to website modifications is highly efficient.
- Cleansing & transformation: We use small fine-tuned LLMs to clean and map data into the desired format.
- Validation: Unstructured data is a pain to validate. Among traditional data validation methods like reverse search, we use LLM-as-a-judge to evaluate the data quality.
We quickly realized that doing this for a few data sources with low complexity is one thing, doing it for thousands of websites in a reliable, scalable, and cost-efficient way is a whole different beast.
Combining traditional ETL engineering methods with small, well-evaluated LLM steps was the way to go for us
[0] https://kadoa.com
Using an LLM for the actual parsing, that's simultaneously overkill while risking your results being polluted with hallucinations.
[1] https://trafilatura.readthedocs.io/en/latest/
I also forgot to mention another interesting scraper that's an LLM based service. A quick search here tells me it was mentioned once by simonw, but I think it should be better known just for the convenience! Prepend "r.jina.ai" to any URL to extract text. For ex., check out [2] or [3].
[1] https://aclanthology.org/2021.acl-demo.15.pdf
[2] https://r.jina.ai/news.ycombinator.com/
[3] (this discussion) https://r.jina.ai/news.ycombinator.com/item?id=41428274