Show HN: Beyond text splitting – improved file parsing for LLMs

My $0.02: correct chunking can improve accuracy, but it does not change the fact that it is still a single-shot operation. I have commented on this before, so I am repeating myself, but what RAGs are trying to do is the equivalent of looking up some information (let's say via a search engine), and you happen to have the correct answer in the first 5 results - not the links but the actual excerpt from the crawled pages. You don't need many evals to naturally figure out that this will only sometimes work. So, chunking improves the performance as long as the search phrase can discover the correct information, but it does not consider that the search itself could be wrong or require more evaluation. Add to the mix that vectorisation of the records does not work well for non-tokens, made-up words, foreign languages, etc, and you start getting the idea of the complexity involved. This is why more context is better but up to a limit.

IMHO, in most use cases, chunking optimisation strategies will not substantially improve performance. What I think might improve performance is running N search strategies with multiple variations of the search phrase and picking up the best answer. But this is currently expensive and slow.

Having developed a RAG platform over one and a half years ago, I find many of these challenges strikingly familiar.

serjester · 2 years ago

There's far more to a RAG pipeline than chunking documents, chunking is just one way to interface with a file. In our case we use query decomposition, document summaries and chunking to achieve strong results.

Your right that chunking is just one piece of this. But without quality chunks you're either going to miss context come query time (bad chunks) or use 100X the tokens (full file context).

Can you describe a little bit more in detail what is your stragegy on query decomposition?

mistermann · 2 years ago

> What I think might improve performance is running N search strategies with multiple variations of the search phrase and picking up the best answer. But this is currently expensive and slow.

Eerily similar to Thinking Fast and Slow, and may help explain (when combined with biological and social evolutionary theory) why people have such a strong aversion to System 2 thinking.

_pdp_ · 2 years ago

Ha, never thought of that. Thank you :)

This looks great! You might be interested in surya - https://github.com/VikParuchuri/surya (I'm the author). It does OCR (much more accurate than tesseract), layout analysis, and text detection.

The OCR is slow on CPU (working on it), but faster than tesseract (CPU-only) on GPU.

You could probably replace pymupdf, tesseract, and some layout heuristics with this.

Happy to discuss more, feel free to email me (in profile).

nicklo · 2 years ago

OP: please don't poison your MIT license w/ surya's GPL license

vikp · 2 years ago

It should be possible to call a GPL library in a separate process (surya can batch process from the CLI) and avoid GPL - ocrmypdf does this with ghostscript.

barfbagginus · 2 years ago

Can I send a PR extending the benchmark against doctr and potentially textract? I believe these represent the SOTA for open and proprietary OCR.

The benefit is to let people evaluate surya against the open source and commercial SOTA, improving the integrity and applicability of the benchmark in a business or research setting.

There's a risk: it could make surya's benchmark look less attractive. Also, picking textract to represent the proprietary SOTA might be dicey, since it has competitors (Google cloud ocr, Azure ocr)

Still, ranking surya with doctr, textract, and tesseract would be really nice baseline. As a research user, business user or open source contributor, those are the results I need to quickly understand surya's potential.

vikp · 2 years ago

I've benchmarked against google cloud ocr, but the results are on Twitter, not the repo yet - https://twitter.com/VikParuchuri/status/1765440195124691339 . The reason I didn't benchmark against doctr is language support.

marban · 2 years ago

The recent Real Python pod has some anecdotal insights from a real-world project with respect to dealing with decades-old unstructured PDFs. https://realpython.com/podcasts/rpp/199/

cpursley · 2 years ago

Neat and timely. My biggest challenge is tables contained in PDFs.

Are there any similar projects that are lower level (for those of us not using Python)? Something in Rust that I could call out to, for example?

d-z-m · 2 years ago

Very cool!

I see this in the README under the "How is this different from other layout parsers" section.

> Commercial Solutions: Requires sharing your data with a vendor.

But I also see that to use the Semantic Processing example, you have to have an OpenAI API key. Are there any plans to support locally hosted embedding models for this kind of processing?

willj · 2 years ago

Relatedly, the OCR component relies on PyMuPDF, which has a license that requires releasing source code, which isn’t possible for most commercial applications. Is there any plan to move away from PyMuPDF, or is there a way to use an alternative?

kkielhofner · 2 years ago

FWIW PyMuPDF doesn't do OCR. It extracts embedded text from a PDF, which in some cases is either non-existent or done with poor quality OCR (like some random implementation from whatever it was scanned with).

This implementation bolts on Tesseract which IME is typically not the best available.

Coming soon!

zby · 2 years ago

What I want is a dynamic chunking - I want to search a document for a word - and then I want to get the largest chunk that fits into my limits and contain the found word. Has anyone worked on such thing?

Y_Y · 2 years ago

    grep -C $n word document

will get you $n lines of context on either side of the matching lines.

Yeah - the idea is simple - but there are so many variations as to what makes a good chunk. If it is a program - then lines are good, but maybe you'd like to set the boundaries at block endings or something. And for regular text - then maybe sentences would be better than lines? Or paragraphs. And maybe it should not go beyond a boundary for a text section or chapter. And then there might also be tables. With tables - the good solution would be to fit some rows - but maybe the headers should also be copied together with the rows in the middle? But if a previous chunk with the headers was already loaded - then maybe not duplicate the headers?

snorkel · 2 years ago

OpenSearch perhaps? The search query results returns a list of hits (matches) with a text_entry field that has the matching excerpt from the source doc

dleeftink · 2 years ago

Do you need to find the longest common substring? Because there are several methods to accomplish that.

[0]: https://en.m.wikipedia.org/wiki/Longest_common_substring

Oras · 2 years ago

How accurate is table detection/parsing in PDFs? I found this part the most challenging, and none of the open-source PDF parsers worked well.

Author here. Optionally we implement unitable which represents the current state of the art in table detection. Camelot / Tabelot use much simpler, traditional extraction techniques.

Unitable itself has shockingly good accuracy, although we’re still working on better table detection which sometimes negatively affects results.

xyzjgf · 2 years ago

is this the unitable you mentioned https://github.com/poloclub/unitable

verdverm · 2 years ago

I've been using camelot, which builds on the lower python pdf libraries, to extract tables from pdfs. Haven't tried anything exotic, but it seems to work. The tables I parse tend to be full page or the most dominant element

https://camelot-py.readthedocs.io/en/master/

I like Camelot because it gives me back pandas dataframes. I don't want markdown, I can make that from a dataframe if needed

passion__desire · 2 years ago

Have you checked Surya ?

I did and I had issues when tables had mixed text and numbers.

Example:

£243,234 would be £234,

Or £243 234

Or £243,234 (correct).

Some cells weren't even detected.

saliagato · 2 years ago

worked 100% of the time for me

filkin · 2 years ago

which software?

mind-blight · 2 years ago

One thing I've noticed with pdfminer is that it can have horrible load times for some PDFs. I've seen 20 page PDFs take upwards of 45 seconds due to layout analysis. It's anaysis engine is also decent, but it takes newlines into account in weird ways sometimes - especially if you're asking for vertical text analysis