ArXiv now offers papers in HTML format

shrimpx · 2 years ago

Since the article doesn't link to any example HTML article, here's a random link:

https://browse.arxiv.org/html/2312.12451v1

It's cool that it has a dark mode. Didn't see a toggle but renders in the system mode.

Overall will make arXiv a lot more accessible on mobile.

burkaman · 2 years ago

And here's the PDF of the same paper for comparison: https://arxiv.org/pdf/2312.12451.pdf

FredPret · 2 years ago

The contrast is massive. I'm much more likely to read the html version; that PDF is deeply off-putting in some hard to define way. Maybe it's the two columns, or the font, or the fact that the format doesn't adjust to fit different screen sizes.

znpy · 2 years ago

I prefer the pdf version, mostly. I can annotate it on the side both in print and digitally with my iPad. I can also invert colors in pdf readers to get some kind of “dark mode” easily.

The html version is wasting a lot of space on the right side and the color scheme is awful (dark grey on a brown background, seriously? How is that any better? Edit: disabling dark mode yields a better reading experience wrt color scheme). Also, somehow links to references make another http request and have no backlink?

The html version could make sense if it had more dynamic functionalities: change fonts/line spacing, toggle color schemes, maybe a mini map or some other navigational tool? Also, some kind of support for highlighting and/or annotating?

jez · 2 years ago

It would be neat if they offered submitters the chance to upload their own HTML version alongside the PDF version, instead of always relying on an automatic conversion process.

- I can imagine authors feeling frustrated if someone reaches out about a problem in the HTML version of their paper, but they have no way to correct it except by hoping that a change to the PDF fixes a change to the generated HTML. Easier to just fix the formatting problem in the PDF outright.

- It would be neat to allow people to experiment with alternative formatting for their papers. For example, imagine a paper about a programming language that embeds a sandbox you can use to play around with the language under discussion. Or a paper about multivariable calculus and you can interact with a three dimensional plot of some function.

IlliOnato · 2 years ago

No, it would not. It's critically important that there is only one "logical" article, albeit with different representations. In other words, a single "source of truth".

With "sideloading" of HTML there is no way in general to make sure that the contents of LaTeX (and PDF) on one side and HTML on the other side is the same.

felixfbecker · 2 years ago

Maybe some day for some papers HTML could be the source of truth instead of LaTeX. After all, the original use case for HTML and the web was academics. The HTML and CSS specs have evolved a lot since then, with support for the typesetting features you need for papers (justified text, hyphenation, page breaks, page numbers, ...) and even math formulas are possible now again natively with MathML thanks to Igalia. Diagrams can be accessible vector SVGs instead of raster images. Referencing, linking, citing, figures, tables, etc have always been native to HTML. It's trivial nowadays too to wrap a headless chromium in a CLI to convert an HTML document to PDF rendered in the exact same way that the browser would (i.e. not some static conversion tool that lags behind standards or has render implementation differences).

dataflow · 2 years ago

> With "sideloading" of HTML there is no way in general to make sure that the contents of LaTeX (and PDF) on one side and HTML on the other side is the same.

Is it not possible to write LaTeX code that produces different contents in HTML vs. PDF?

GoblinSlayer · 2 years ago

Huh? What's the point of html version if you define it as source of deception?

diffeomorphism · 2 years ago

> It would be neat if they offered submitters the chance to upload their own HTML version alongside the PDF version, instead of always relying on an automatic conversion process.

Please don't. Then you will have a mismatch between the source and the "own html" which ruins the point of uploading the source.

eviks · 2 years ago

Pdf isn't the source

layer8 · 2 years ago

They’d have to define and document a “safe” subset of HTML, and implement a filter/checker for it. Otherwise we’d end up with papers containing ads and tracking and XSS vulnerabilities and whatnot.

digging · 2 years ago

Those are issues with JavaScript, not HTML. Wouldn't filtering out iframes pretty much keep us in the clear?

kjkjadksj · 2 years ago

Most authors probably have no interest in learning html. Also most authors want nothing to do with the work by the time its submitted. It was probably hell getting the project to that point of publishing, they want to be done with it and move on to the next thing going on in their career asap.

jez · 2 years ago

I think this is an argument in favor of doing automatic PDF -> HTML conversion for the authors that don't want to touch it, but I don't think it's an argument against letting those who are fine with HTML provide their own.

bookofjoe · 2 years ago

You hit on an unappreciated truth. By the time my papers appeared in print, I was so sick of them and the endless effort involved in taking them from raw data to finished, edited, proofed, rewritten a zillion times to meet the reviewers' and editors' requests and corrections and suggestions, that I didn't even read the published paper when it arrived as preprints and in the journal.

Enough!

My proof: https://scholar.google.com/citations?user=5DdrMc8AAAAJ&hl=en

tiagod · 2 years ago

I was under the impression the source authors publish to arxiv was a latex file

jez · 2 years ago

Ah, thanks for clarifying!

I looked up the submission formats, and it looks like if you authored the paper in TeX/LaTeX, they do not accept pre-rendered versions of the document.

https://info.arxiv.org/help/submit/index.html#formats-for-te...

But if you did not author it in TeX/LaTeX (e.g., Word, Google Docs, etc.) it appears you can upload a PDF or HTML yourself.

jraph · 2 years ago

It is.

thomasahle · 2 years ago

> It would be neat if they offered submitters the chance to upload their own HTML version alongside the PDF version, instead of always relying on an automatic conversion process.

Can you recommend a system I can use to compile my latex, while also making sure the html is going to look good? I'd like some kinds of css style @media queries to switch between certain parts of the layout, while keeping a single latex file.

turing_complete · 2 years ago

With the shelf life of web technologies, authors would constantly have to maintain their "papers" or they just would not be accessible after a while.

erik_seaberg · 2 years ago

Knuth’s stated intent in maintaining TeX is only to fix bugs, not evolve the system in a way that might break old documents. Not sure if this is equally true for Lamport’s LaTeX macros but it wouldn’t surprise me.

pasc1878 · 2 years ago

Plain html from mid 90s still renders and looks as good as it ever was.

I think CSS is also backwards compatible.

It is the JavaScript birs that change

svag · 2 years ago

The tool that it's being used for this offering is this one, https://github.com/arXiv/arxiv-readability, just to save a few clicks :)

IshKebab · 2 years ago

Wow I did not know they have the LaTeX for all the papers and compile it themselves! That's pretty crazy. What if they don't have packages you need? What if your paper isn't written with LaTeX?

r4indeer · 2 years ago

> What if they don't have packages you need?

Unlikely. But if so, you can provide the packages yourself: https://info.arxiv.org/help/submit_tex.html#wegotem

> What if your paper isn't written with LaTeX?

Then they still accept PDF or HTML. See: https://info.arxiv.org/help/submit/index.html#formats-for-te...

aragilar · 2 years ago

They specify what version of texlive they use. This is significantly better than what publishers offer (usually a really old latex version, not even pdflatex).

Deleted Comment

dginev · 2 years ago

That's it in spirit, but in practice it's refreshed:

https://github.com/arXiv/arxiv-view-as-html

ofou · 2 years ago

I wonder how better is this compared to Pandoc's

Deleted Comment

injuly · 2 years ago

For anyone who needs it, arxiv-vanity is amazing: https://www.arxiv-vanity.com/

westurner · 2 years ago

arxiv-sanity-lite: https://github.com/karpathy/arxiv-sanity-lite

jll29 · 2 years ago

It's a cool feature because it makes the papers more finable, more easily navigatable, easier to read online and faster to scroll through. I am also happy for blind people that they can more easily use ArXive with Braille readers now.

(I'm still a fan of printing the PDFs, because I annotate on paper and refer to page numbers, but the HTML feature is in addition to PDF download, not a replacement.)

One thing that still sucks (not ArXiv related though) is reading mathematical formulae on the Kindle - wonder if someone with rendering expertise could have a look into the MOBI format.

isaacfung · 2 years ago

This would never happen but in an ideal world, we should be able to click on a citation to jump to the part of the paper that is being referenced and each paper page should have a discussion board so we can easily communicate with the authors and group the discussion in one place instead of us having to google to see if there is relevant discussion on twitter/reddit. We can even put links to talks, tutorials, blogs, github repo, demo, paperswithcode/google scholar/open review, background material, a timeline of citations in tree form on the same page(actually I am seeing more machine learning papers that have a project page that does some of these) or even turn it into a mini wiki. I just think html has so much more potential(especially now with LLM we can do semantic search). I wonder if there would be interest in such a chrom extension overlay.

Related projects:

https://github.com/ahrm/sioyek

https://github.com/arxiv-vanity/engrafo

https://github.com/dginev/ar5iv

https://academ.us/article/2111.15588/ (powered by https://github.com/jgm/pandoc I believe)

me_jumper · 2 years ago

I think https://web.hypothes.is/ would be of interest to you.

astrolx · 2 years ago

This is excellent news. Their HTML formatting is also more pleasant than the HTML articles offered by most journals in my field (e.g arXiv HTML footnotes displayed as sidenotes on large displays!)

tarboreus · 2 years ago

One of the reasons is to make the papers more accessible to people with disabilities, especially the blind. I participated in a conference they hosted on this a few months ago, I recommend taking a look at the recordings if you're interested in thinking on this.

https://accessibility2023.arxiv.org/

miki123211 · 2 years ago

Blind person here, can confirm this. Reading PDFs with a screen reader is bad, reading PDFs that come from LaTeX is worse, reading LaTeX math is pretty much impossible. All the semantic info you need is just thrown away.

You can make decently accessible PDFs but it's lots of work, you need Acrobat on the producer' side and might also need it on the consumer's side. Free tools don't even come close. There's also the fact that the process of making accessible PDFs in Acrobat isn't itself accessible.

With that said, the way screen readers treat HTML math certainly isn't perfect, it's geared more towards school children than anything above calculus. I'm probably going to stay with my LaTeX source files for now. At least ArXiv offers those, not many sites do. To be fair, that approach also has its own set of problems (particularly when people use some extra fancy formatting in their math equations, making the markup hard to read), but I find this to be the best approach for me so far, at least on AI/ML papers.

kkylin · 2 years ago

I teach math at a university. A couple years ago I had two blind students in my section of first-year calculus, and I really struggled with the tooling. Using latexml, I could produce documents that one of the students could use with a screen reader, but the other student never managed to make it work on their machine. Both students prefer braille but I didn't find anything open source that could typeset mathematical braille easily. Our disability resource office sends things out to a contractor to typeset into braille; the turn-around is measured in weeks.

Anyway, if you (or anyone else reading this) has suggestions I'd really appreciate it!

Blikkentrekker · 2 years ago

I made these arguments two decades ago when I was still in university that PDF is a horrible format because it's purely præsentational, especially for people with disabilities whose software relies on semantic information. LaTeX last time I used it didn't even have a different symbol for uppercase Alpha and A because the glyphs are indistinguishable.

They argued that PDF was superior because the publisher could control how it looked and it looked the same everywhere but the point is that it should not. Things such as font size and line spacing should be at the control of the consumer, not the publisher. This isn't simply blind people but for instance also persons with dyslexia who use particular fonts to make it easier to read for them. Or in my case, someone who simply gets a headache from fronts and line-spacing that is too big. I've also been using darkmode everywhere for so long now that reading black text on a white surface on a screen gives me a headache.

ldenoue · 2 years ago

I wrote an app called PDF Reflow that reflows the original PDF using image processing to cut out words into tiles so you see the reflowed version of the text in their original look.

https://www.appblit.com/pdfreflow

ahepp · 2 years ago

Do you think there's potential for language models to play a role here? I know that AI can get tossed around as a buzzword, but hasn't it proved quite successful in fields like computer vision?

I'm not deeply familiar with the state of that art, but it seems like recovering the metadata from a PDF generated by LaTeX would be no more impressive than many other things we're currently seeing language models achieve?

jakderrida · 2 years ago

Hold on... Are you telling me that all these complex sentences are being typed out based on your voice alone? That's insane.

phlakaton · 2 years ago

For the math equations, I'm curious: does MathML do any better for you than LaTeX?

saurik · 2 years ago

Huh. It would seem like, of all the things which should make it easy to generate the correct accessibility information, the pipeline of compiling a paper from source code in LaTeX should nail it... maybe we should all pitch in to some pool to pay someone to put in the required effort to connect all the dots?

spookie · 2 years ago

Yup LaTeX math doesn't make sense. I've been trying to hack my way into getting a voice model to read it but no real progress.

anthk · 2 years ago

Emacs with Emacspeak has a math reading module.

Deleted Comment

wilg · 2 years ago

For accessibility purposes (and regular reading), it would be so much better to drop the justified text. Ragged edge is the way to go!

https://www.boia.org/blog/why-justified-or-centered-text-is-...

jonatanheyman · 2 years ago

Not necessarily:

https://heyman.info/2023/fill-justified-text-on-the-web

reqo · 2 years ago

A lot of AI/ML papers these days have an accompanying interactive page like [0], will we see anything like these now directly in arXive?

[0] https://voyager.minedojo.org/

z2h-a6n · 2 years ago

I think then arXiv would have to deal with mantaining the tech stack and providing the presumably much higher server capacity to serve the more varied web pages that would result, so it seems like a tall order. arXiv already has an experimental integration with Papers with Code [0], which I guess provides similar results for the reader, though the authors have to figure out their own web hosting.

[0] https://info.arxiv.org/labs/showcase.html#arxiv-links-to-cod...

MahiShafiullah · 2 years ago

Second that. Something I put out recently had an (admittedly video heavy) webpage that had 1TB of traffic over the past month. Cloudflare handled it for free for me, but at ArXiv’s scale it’s bound to be a problem.