evanhu_ (u/evanhu_) - Readit News

evanhu_ commented on Show HN: Talk to any ArXiv paper just by changing the URL github.com/evanhu1/talk2a... · Posted by u/evanhu_

gtsnexp · 2 years ago

This is great! We could also point it to https://biorxiv.org/ Awesome work!

evanhu_ · 2 years ago

Thank you so much, yes I will have that up soon as well

evanhu_ commented on Show HN: Talk to any ArXiv paper just by changing the URL github.com/evanhu1/talk2a... · Posted by u/evanhu_

arbitrandomuser · 2 years ago

I wonder if they knew that they could get html versions of the paper by just changing the link from ...arxiv.. to ar5iv..

evanhu_ · 2 years ago

I did try that at first, it was hard to parse through the HTML code and organize into logical sections (authors, references, abstract) and then clean up the text to prepare it optimally for chunking and embedding. Once I found GROBID I just went with that route because it handled all that for me.

evanhu_ commented on Show HN: Talk to any ArXiv paper just by changing the URL github.com/evanhu1/talk2a... · Posted by u/evanhu_

schneehertz · 2 years ago

I used the example on the GitHub page, but found that I need to wait for Embedding. It would be beneficial to reduce latency and save API costs if there is a common cache.

evanhu_ · 2 years ago

There is a cache! You hit a new PDF but at least you will not have to wait for that one again ;)

evanhu_ commented on Show HN: Talk to any ArXiv paper just by changing the URL github.com/evanhu1/talk2a... · Posted by u/evanhu_

pushfoo · 2 years ago

You might be able to drop the PDF backend since they're close to getting HTML running well: https://news.ycombinator.com/item?id=38713215

Using that might be easier than a multi-modal approach. Bonus points for:

* Multiple papers at once

* Comparing PDF and HTML output with the LLM as input for it correcting similar converter code

evanhu_ · 2 years ago

Definitely I'll move to the LaTeX source code instead of a PDF backend since that allows better support for non textual data that gets poorly scraped by GROBID. That is a really cool development I didn't know about, also theres https://ar5iv.labs.arxiv.org/ which already has most arXiv papers as HTML documents. I chose GROBID because they not only parse the PDF but organize the text into logical sections for me (Intro, abstract, references) which I didn't want to manually do with heuristics that I'd have to devise.

evanhu_ commented on Show HN: Talk to any ArXiv paper just by changing the URL github.com/evanhu1/talk2a... · Posted by u/evanhu_

Aachen · 2 years ago

I thought this would be for contacting authors or chatting about the paper with other readers, but apparently RAG here is a new important TLA to take note of, meaning chat bot. You need to enter an API key from "Open"AI to use the service and it's about it answering your questions about the paper

evanhu_ · 2 years ago

Oops sorry for the miscommunication, actually you don't need to enter an API key for now. Feel free to just try it out!

evanhu_ commented on Show HN: Talk to any ArXiv paper just by changing the URL github.com/evanhu1/talk2a... · Posted by u/evanhu_

zzleeper · 2 years ago

Looks great! It would be very interesting to understand a bit they why/how of some of the steps, such as the reranking and how you arrived at your chunking algo.

evanhu_ · 2 years ago

Thank you :). I updated the README to have some more explanation of the steps.

The chunking algorithm chunks by logical section (intro, abstract, authors, etc.) and also utilizes recursive subdivision chunking (chunk at 512 characters, then 256, then 128...). It is quite naive still but it works OK for now. An improvement would perhaps involve more advanced techniques like knowledge graph precomputation.

Reranking works by instead of embedding each text chunk as a vector and performing cosine similarity nearest neighbor search, you use a Cross-Encoder model that compares two texts and outputs a similarity score. Specifically, I chose Cohere's Reranker that specializes in comparing Query and Answer chunk pairs.

evanhu_ commented on Show HN: Talk to any ArXiv paper just by changing the URL github.com/evanhu1/talk2a... · Posted by u/evanhu_

gorkish · 2 years ago

Very nice; appears to work well. Just an FYI that I did get a couple errors where the max context length was exceeded, one using the demo summarization task as the first query. I was using my own API key when the error occured.

evanhu_ · 2 years ago

Thank you! Thanks for pointing that out, since the underlying RAG is rather naive (simple embedding cosine similarity lookup, as opposed to knowledge graph / advanced techniques), I opted to embed both "small" (512 character and below) chunks as well as entire section chunks (embedding the entire introduction) in order to support questions such as "Please summarize the introduction". Since I also use 5 chunks for each context, I suspect this can add up to a massive amount on papers with huge sections.

evanhu_ commented on Show HN: Talk to any ArXiv paper just by changing the URL github.com/evanhu1/talk2a... · Posted by u/evanhu_

aendruk · 2 years ago

Any plans for bioRxiv?

evanhu_ · 2 years ago

Yes! I'll set up talk2biorxiv.org very soon as it would be simple to port over. I also plan on making the underlying research PDF RAG framework available as an independent module

evanhu_ commented on Show HN: Talk to any ArXiv paper just by changing the URL github.com/evanhu1/talk2a... · Posted by u/evanhu_

skeptrune · 2 years ago

This is the first time I have seen someone use GROBID. It seems like an incredibly cool solution

evanhu_ · 2 years ago

I spent forever looking at various PDF parsing solutions like Unstructured, and eventually stumbled across GROBID, which was an absolute perfect fit since it's entirely made for scientific papers and has header/section level segmentation capabilities (splitting the paper into Abstract, Introduction, References, etc.) It's lightweight and fast too!

Posted by u/evanhu_ 2 years ago

Show HN: Talk to any ArXiv paper just by changing the URL github.com/evanhu1/talk2a...

u/evanhu_

KarmaCake day76October 5, 2021View Original