Readit News logoReadit News
Koaisu commented on Show HN: Talk to any ArXiv paper just by changing the URL   github.com/evanhu1/talk2a... · Posted by u/evanhu_
evanhu_ · 2 years ago
Definitely I'll move to the LaTeX source code instead of a PDF backend since that allows better support for non textual data that gets poorly scraped by GROBID. That is a really cool development I didn't know about, also theres https://ar5iv.labs.arxiv.org/ which already has most arXiv papers as HTML documents. I chose GROBID because they not only parse the PDF but organize the text into logical sections for me (Intro, abstract, references) which I didn't want to manually do with heuristics that I'd have to devise.
Koaisu · 2 years ago
Maybe you could use: https://github.com/facebookresearch/nougat/tree/main or https://github.com/VikParuchuri/marker

Both are tools to convert pdfs into Latex or Markup with latex formulas. Maybe that helps

u/Koaisu

KarmaCake day56August 16, 2023View Original