Is "human intelligence" and "intelligence" equal?
And: How to become conscious before being intelligent?
Or: If intelligence is a side-effect, how often this side-effect can't be observed?
Xor: What if an intelligent being denies being conscious?
Seems like there would be low hanging fruit in heavier pre processing then? Something deterministic like a reading level score. Or even a tiny model trained for the task to pick out good data?
An example: More than ten years ago a friend of mine was fascinated by the german edition of the book "A Cultural History of Physics" by Károly Simonyi. He scanned the book (600+ pages) and created a PDF (nearly) same layout.
Against my advice he used Adobe tools for it instead of creating an epub or something like DocBook.
The PDF looks great, but the text inside is impossible to use as training data for a small LLM. The lines from the two columns are mixed and a lot of spaces are randomly placed (makes it particularly difficult because mathematical formulas often appear in the text itself).
After many attempts (with RegEx and LLMs), I gave up and rendered each page and had a large LLM extract the text.
The author is frontend designer and has a nice website, too: https://dbushell.com/
I like the personal, individual style of both pages.
I agree it is a clever way. But it also shows exactly how hard it is to use XML and XSLT in a "proper way": Formal everything is fine to do it in this way (except the server is sending 'content-type: application/xml' for the /index.xsl, it should be 'application/xslt+xml').
Almost all implementations in XML and XSLT that I have seen in my career showed a nearly complete lack of understanding of how they were intended to be used and how they should work together. Starting with completely pointless key/value XMLs (I'm looking at you, Apple and Nokia), through call-template orgies (IBM), to ‘yet-another-element-open/-close’ implementations (almost every in-house application development in PHP, JAVA or .NET).
I started using XSLT before the first specification had been published. Initially, I only used it in the browser. Years later, I was able to use XSLT to create XSDs and modify them at runtime.
With Safari (standard and tech preview) the rendering looks strange (at least). The root sign does not have a strait line at the top (for many fonts) and at least the partial derivative is not rendered as italic (for all fonts).
The article has some good practical tips and it's not on the author but man I really wish we'd stop abusing the term "engineering" in a desperate attempt to stroke our own egos and or convince people to give us money. It's pathetic. Coming up with good inputs to LLMs is more art than science and it's a craft. Call a spade a spade.
But: Interestingly, the behavior of LLMs in different contexts is also the subject of scientific research.