Readit News logoReadit News
evanhu_ commented on Show HN: Talk to any ArXiv paper just by changing the URL   github.com/evanhu1/talk2a... · Posted by u/evanhu_
gtsnexp · 2 years ago
This is great! We could also point it to https://biorxiv.org/ Awesome work!
evanhu_ · 2 years ago
Thank you so much, yes I will have that up soon as well
evanhu_ commented on Show HN: Talk to any ArXiv paper just by changing the URL   github.com/evanhu1/talk2a... · Posted by u/evanhu_
arbitrandomuser · 2 years ago
I wonder if they knew that they could get html versions of the paper by just changing the link from ...arxiv.. to ar5iv..
evanhu_ · 2 years ago
I did try that at first, it was hard to parse through the HTML code and organize into logical sections (authors, references, abstract) and then clean up the text to prepare it optimally for chunking and embedding. Once I found GROBID I just went with that route because it handled all that for me.
evanhu_ commented on Show HN: Talk to any ArXiv paper just by changing the URL   github.com/evanhu1/talk2a... · Posted by u/evanhu_
schneehertz · 2 years ago
I used the example on the GitHub page, but found that I need to wait for Embedding. It would be beneficial to reduce latency and save API costs if there is a common cache.
evanhu_ · 2 years ago
There is a cache! You hit a new PDF but at least you will not have to wait for that one again ;)
evanhu_ commented on Show HN: Talk to any ArXiv paper just by changing the URL   github.com/evanhu1/talk2a... · Posted by u/evanhu_
pushfoo · 2 years ago
You might be able to drop the PDF backend since they're close to getting HTML running well: https://news.ycombinator.com/item?id=38713215

Using that might be easier than a multi-modal approach. Bonus points for:

* Multiple papers at once

* Comparing PDF and HTML output with the LLM as input for it correcting similar converter code

evanhu_ · 2 years ago
Definitely I'll move to the LaTeX source code instead of a PDF backend since that allows better support for non textual data that gets poorly scraped by GROBID. That is a really cool development I didn't know about, also theres https://ar5iv.labs.arxiv.org/ which already has most arXiv papers as HTML documents. I chose GROBID because they not only parse the PDF but organize the text into logical sections for me (Intro, abstract, references) which I didn't want to manually do with heuristics that I'd have to devise.
evanhu_ commented on Show HN: Talk to any ArXiv paper just by changing the URL   github.com/evanhu1/talk2a... · Posted by u/evanhu_
Aachen · 2 years ago
I thought this would be for contacting authors or chatting about the paper with other readers, but apparently RAG here is a new important TLA to take note of, meaning chat bot. You need to enter an API key from "Open"AI to use the service and it's about it answering your questions about the paper
evanhu_ · 2 years ago
Oops sorry for the miscommunication, actually you don't need to enter an API key for now. Feel free to just try it out!
evanhu_ commented on Show HN: Talk to any ArXiv paper just by changing the URL   github.com/evanhu1/talk2a... · Posted by u/evanhu_
zzleeper · 2 years ago
Looks great! It would be very interesting to understand a bit they why/how of some of the steps, such as the reranking and how you arrived at your chunking algo.
evanhu_ · 2 years ago
Thank you :). I updated the README to have some more explanation of the steps.

The chunking algorithm chunks by logical section (intro, abstract, authors, etc.) and also utilizes recursive subdivision chunking (chunk at 512 characters, then 256, then 128...). It is quite naive still but it works OK for now. An improvement would perhaps involve more advanced techniques like knowledge graph precomputation.

Reranking works by instead of embedding each text chunk as a vector and performing cosine similarity nearest neighbor search, you use a Cross-Encoder model that compares two texts and outputs a similarity score. Specifically, I chose Cohere's Reranker that specializes in comparing Query and Answer chunk pairs.

evanhu_ commented on Show HN: Talk to any ArXiv paper just by changing the URL   github.com/evanhu1/talk2a... · Posted by u/evanhu_
gorkish · 2 years ago
Very nice; appears to work well. Just an FYI that I did get a couple errors where the max context length was exceeded, one using the demo summarization task as the first query. I was using my own API key when the error occured.
evanhu_ · 2 years ago
Thank you! Thanks for pointing that out, since the underlying RAG is rather naive (simple embedding cosine similarity lookup, as opposed to knowledge graph / advanced techniques), I opted to embed both "small" (512 character and below) chunks as well as entire section chunks (embedding the entire introduction) in order to support questions such as "Please summarize the introduction". Since I also use 5 chunks for each context, I suspect this can add up to a massive amount on papers with huge sections.
evanhu_ commented on Show HN: Talk to any ArXiv paper just by changing the URL   github.com/evanhu1/talk2a... · Posted by u/evanhu_
aendruk · 2 years ago
Any plans for bioRxiv?
evanhu_ · 2 years ago
Yes! I'll set up talk2biorxiv.org very soon as it would be simple to port over. I also plan on making the underlying research PDF RAG framework available as an independent module
evanhu_ commented on Show HN: Talk to any ArXiv paper just by changing the URL   github.com/evanhu1/talk2a... · Posted by u/evanhu_
skeptrune · 2 years ago
This is the first time I have seen someone use GROBID. It seems like an incredibly cool solution
evanhu_ · 2 years ago
I spent forever looking at various PDF parsing solutions like Unstructured, and eventually stumbled across GROBID, which was an absolute perfect fit since it's entirely made for scientific papers and has header/section level segmentation capabilities (splitting the paper into Abstract, Introduction, References, etc.) It's lightweight and fast too!

u/evanhu_

KarmaCake day76October 5, 2021View Original