Show HN: Defuddle, an HTML-to-Markdown alternative to Readability

Interesting as I was researching this recently and certainly not impressed with the quality of the Readability implementations in various languages. Although Readability.js was clearly the best, it being Javascript didn't suit my project.

In the end I found the python trifatura library to extract the best quality content with accurate meta data.

You might want to compare your implementation to trifatura to see if there is room for improvement.

acrophobic · 3 months ago

> ...it being Javascript didn't suit my project.

If you're using Go, I maintain Go ports of Readability[0] and Trafilatura[1]. They're actively maintained, and for Trafilatura, the extraction performance is comparable to the Python version.

[0]: https://github.com/go-shiori/go-readability

[1]: https://github.com/markusmobius/go-trafilatura

derekperkins · 3 months ago

We've been active users of go-trafilatura and love it

breadchris · 3 months ago

this is what i came here to see, thanks!

fabmilo · 3 months ago

reference to the library: https://trafilatura.readthedocs.io/en/latest/

for the curious: Trafilatura means "extrusion" in Italian.

| This method creates a porous surface that distinguishes pasta trafilata for its extraordinary way of holding the sauce. search maccheroni trafilati vs maccheroni lisci :)

(btw I think you meant trafilatura not trifatura)

thm · 3 months ago

Been using it since day one but development has stalled quite a bit since 2.0.0.

winddude · 3 months ago

It's a bit old, but I bench marked a number of the web extraction tools years ago, https://github.com/Nootka-io/wee-benchmarking-tool, resiliparse-plain was my clear winner at the time.

Really nice work. I appreciate the example with JSDOM as that’s exactly how I use readability, and this looks like a nice drop-in replacement.

Question: How did you validate this? You say it works better than readability but I don’t see any tests or datasets in the repo to evaluate accuracy or coverage. Would it be possible to share that as well?

kepano · 3 months ago

Currently I am relying on manual testing and user feedback, but yes, I'd like to add tests.

Defuddle works quite differently from Readability. Readability tends to be overly conservative and tends to remove useful content because it tests blocks to find the beginning and end of the "main" content.

Defuddle is able to run multiple passes and detect if it returned no content to try and expand its results. It also uses a greater variety of techniques to clean the content — for example, by using a page's mobile styles to detect content that can be hidden.

Lastly, Defuddle is not only extracting the content but also standardizing the output (which Readability doesn't do). For example footnotes and code blocks all aim to output a single format, whereas Readability keeps the original DOM intact.

honodk123 · 3 months ago

This looks great!

I would love to give Defuddle a try as a Readability replacement. However, for my use case I want to do in a Chrome extension background script (service worker). I have not been able to get Defuddle to work, while readability does (when combining with linkedom). So basically, while this works:

  import { parseHTML } from 'linkedom';
  ...
  private extractArticleWithReadability(html: string) {
      const { document } = parseHTML(html);
      const reader = new Readability(document);
      return reader.parse();
  }

This does not:

  import { parseHTML } from 'linkedom';
  ...
  private async extractArticleWithDefuddle(html: string) {
      const { document } = parseHTML(html);
      const result = new Defuddle(document);
      result.parse();
      return result;
  }

I get errors like:

- Error in findExtractor: TypeError: Failed to construct 'URL': Invalid URL

- Defuddle: Error evaluating media queries: TypeError: undefined is not iterable (cannot read property Symbol(Symbol.iterator))

- Defuddle Error processing document: TypeError: b.getComputedStyle is not a function

Is there a way to run Defuddle in a chrome extension background script/service worker? Or do you have any plans of adding support for that?

tmpfs · 3 months ago

creakingstairs · 3 months ago

I was just looking at obsidian web-clipper's source code because I've been quite impressed at its markdown conversion results and came across Defuddle in there. I'll be using for my bespoke read-it-later/ knowledge-base app, so thank you in advance :D

Tsarp · 3 months ago

Been using the obsidian clipper since it was out and this is a really neat. The per website profile based extraction is awesome.

Even if you are not a obsidian user, the markdown extraction quality is the most reliable Ive seen.

audessuscest · 3 months ago

thanks for the tip!

jeanlucas · 3 months ago

Obsidian Web Clipper is a great tool to turn chatGPT conversations in markdown, or to just print it (believe me, it is a user case)

emaro · 3 months ago

Not sure about other clients, but Kagi Assistant directly offers to save a conversation as Markdown. Using Obsidian's web-clipper is a good idea too though.

T0Bi · 3 months ago

I just ask ChatGPT to provide the summary or whatever I need as a markdown file.

kouru225 · 3 months ago

Is that a paid plugin?

It is free and open source: https://github.com/obsidianmd/obsidian-clipper

binarymax · 3 months ago

shrinks99 · 3 months ago

I've been super happy with Obsidian Web Clipper! It's worked really well for me with the one exception of importing publish dates (which is more than forgivable !)

Is Mozilla's Readability really abandoned? The latest release (v0.6.0) is just 2 months ago, and its maintainer (Gijs) is pretty active on responding issues.

khasan222 · 3 months ago

That codebase definitely leaves much to be desired, I’ve already had to fork it for work in order to fix some bugs.

1 such bug, find a foreign language with commas in between numbers instead of periods, like Dutch(I think), and a lot of prices on the page. It’ll think all the numbers are relevant text.

And of course I tried to open a pr and get it merged, but they require tests, and of course the tests don’t work on the page Im testing. It’s just very snafu imho

fabrice_d · 3 months ago

This seems to be https://github.com/mozilla/readability/pull/853#issuecomment... and I think their expectations are pretty reasonable.

rcarmo · 3 months ago

The Python analogues seem to be well maintained. I did my own implementation of the Readability algorithm years ago and dropped it in favor them, and I have a few scrapers going strong with regular updates.

Are there any in particular you can recommend?

khimaros · 3 months ago

not parent, but this one looks maintained https://github.com/buriy/python-readability