Readit News logoReadit News
joakleaf commented on I Tested the M5 iPad Pro's Neural-Accelerated AI, and the Hype Is Real   macstories.net/stories/ip... · Posted by u/alwillis
joakleaf · 24 days ago
Related and test on MacBook Pro M5 vs M4:

https://machinelearning.apple.com/research/exploring-llms-ml...

"Exploring LLMs with MLX and the Neural Accelerators in the M5 GPU"

joakleaf commented on The last European train that travels by sea   bbc.com/travel/article/20... · Posted by u/1659447091
kibwen · 2 months ago
> In August, the Italian government revived long-standing plans to build a vast €13.5bn (£11.7bn) suspension bridge over the strait – one of the world's most ambitious engineering projects.

What makes it particularly ambitious? The strait of Messina is two miles across, and I don't think that even cracks the top 100 of the world's longest bridges.

joakleaf · 2 months ago
It will be by far the longest span of a suspension bridge at 3300 meter.

The current longest is in Turkey at 2023 meter.

Each of the pylons of the Messina Bridge will be around 400 meters tall. Which is taller than the Empire State Building.

The strait is too deep, with too much current and seismic activity to place the pylons in the water. So they have to be on the shore, as I understand it.

joakleaf commented on Steve Jobs and Cray-1 to be featured on 2026 American Innovations $1 coin   usmint.gov/news/press-rel... · Posted by u/maguay
contrarian1234 · 2 months ago
yeah, i don't wanna shit of steve jobs. I'm sure he reflects on stuff. (though this thing seems to suggest.. he needs to reflect on some real basic human stuff..) I'm sure you can find some cute quotes from Bill Gates too. It's just not really what he's known for

just out of pure curiosity.. what's the context of this? He wrote a poem.. to email to himself? and.. how did he get access to his private emails?

I can't think of any other example of people writing and mailing poems to themselves

joakleaf · 2 months ago
It was released by The Steve Jobs archive posthumously

https://stevejobsarchive.com/

The archive was launched by Laurene Powell Jobs in 2022

joakleaf commented on Europe's EV sales surge 26% in 2025 while Tesla faces decline   notebookcheck.net/Europe-... · Posted by u/doener
joakleaf · 3 months ago
Here is a list of EV sales in Europe by country first half 2025:

https://www.best-selling-cars.com/europe/2025-half-year-euro...

This gives a much much more nuanced look, and it doesn't look as completely clear cut as the headline implies.

For example: Spain saw an increase of 83.9% and France a decline of 6.9%

... And then you see that Denmark bought as many EVs as spain, although Spain has 10x population.

joakleaf commented on Scientist exposes anti-wind groups as oil-funded, now they want to silence him   electrek.co/2025/08/25/sc... · Posted by u/xbmcuser
Havoc · 4 months ago
The sudden US pivot towards actively suppressing wind energy is absolutely wild.

There are farms that are nearing completion and now are just in limbo.

https://edition.cnn.com/2025/08/26/business/wind-project-can...

joakleaf · 4 months ago
I listened to the press conference the other day with Trumps cabinet meeting.

It is bizar to listen to.

Robert F. Kennedy Jr. claimed that windmills had killed 100+ whales. I tried to find out what he referred to, but couldn't find anything but articles debunking any claim that windmills affect whales (after construction).

He also claimed that the price per kWh of wind energy is above $0.30, which is quite a bit from the $0.03 ($0.12 offshore) price per kWh listed in Wikipedia [1] for United States.

At the same meeting Trump stated that the only viable solution is fossil fuel."... and maybe a little nuclear, but mostly fossil fuel.". And that wind is about 10x more expensive than natural gas (again contradicting the prices listed in the Wikipedia reference where the prices for onshore wind and natural gas are almost identical).

[1] https://en.wikipedia.org/wiki/Cost_of_electricity_by_source

joakleaf commented on So you want to parse a PDF?   eliot-jones.com/2025/8/pd... · Posted by u/UglyToad
reactordev · 5 months ago
We tried the xml structured route, only to end up with pea soup afterwards. Rasterizing and OCR was the only way to get standardized output.
joakleaf · 5 months ago
I know OCR is easier to set up, but you lose a lot going that way.

We process several million pages from Newspapers and Magazines from all over the world with medium to very high complexity layouts.

We built the PDF parser on top of open source PDF libraries, and this gives many advantages: • We can accurately get headlines other text placed on top on images. OCR is generally hopeless with text placed on top of images or on complex backgrounds • Distinguish letters accurately (i.e. number 1, I, l, "o", "zero") • OCR will pick up ghost letters from images, where OCR program believes there is text, even if there isn't. We don't. • We have much higher accuracy than OCR because we don't depend on the OCR programs' ability to recognize the letters. • We can utilize font information and accurate color information, which helps us distinguish elements from each other. • We have accurate bounding box locations of each letter, word, line, and block (pts).

To do it, we completely abandon the PDF text-structure and only use the individual location of each letter. Then we combine letter positions to words, words to lines, and lines to text-blocks using a number of algorithms.

We use the structure blocks that we generated with machine learning afterwards, so this is just the first step in analyzing the page.

It may seem like a large undertaking, but it literally only took a few months to built this initially, and we have very rarely touched the code over the last 10 years. So it was a very good investment for us.

Obviously, you can achieve a lot of the same with OCR -- But you lose information, accuracy, and computational efficiency. And you depend on the OCR program you use. Best OCR programs are commercial and somewhat pricy at scale.

joakleaf commented on So you want to parse a PDF?   eliot-jones.com/2025/8/pd... · Posted by u/UglyToad
petesergeant · 5 months ago
Very interesting. How often do you encounter PDFs that are just scanned pages? I had to make heavy use of pdfsandwich last time I was accessing journal articles.

> quality is higher because we don't depend on vision based text recognition

This surprises me a bit; outside of an actual scan leaving the computer I’d expect PDF->image->text in a computer to be essentially lossless.

joakleaf · 5 months ago
This happens -- also variants which have been processed with OCR.

So if it is scanned it contains just a single image - no text.

OCR programs will commonly create a PDF where the text/background and detected images are separate. And then the OCR program inserts transparent (no-draw) letters in place of the text it has identified, or (less frequently) place the letters behind the scanned image in the PDF (i.e. with lower z).

We can detect if something has been generated by an OCR program by looking at the "Creator data" in the PDF that describes the program use to create the PDF. So we can handle that differently (and we do handle that a little bit differently).

PDF->image->text is 100% not lossless.

When you rasterize the PDF, you losing information because you are going from a resolution independent format to a specific resolution: • Text must be rasterized into letters at the target resolution • Images must be resampled at the target resolution • Vector paths must be rasterized to the target resolution

So for example the target resolution must be high enough that small text is legible.

If you perform OCR, you depend on the ability of the OCR program to accurately identify the letters based on the rasterized form.

OCR is not 100% accurate, because it is computer vision recognition problem, and • there are hundrends of thousands of fonts in the wild each with different details and appearances. • two letters can look the same; simple example where trivial OCR/recognition fails is capital letter "I" and lower case "l". These are both vertical lines, so you need the context (letters nearby). Same with "O" and zero. • OCR is also pretty hopeless with e.g. headlines/text written on top of images because it is hard to distinguish letters from the background. But even regular black on white text fails sometimes. • OCR will also commonly identify "ghost" letters in images that are not really there. I.e. pick up a bunch of pixels that have been detected as a letter, but really is just some pixel structure part of the image (not even necessarily text on the image) -- A form of hallucination.

joakleaf commented on So you want to parse a PDF?   eliot-jones.com/2025/8/pd... · Posted by u/UglyToad
petesergeant · 5 months ago
> instead of just using the "quality implementation" to actually get structured data out?

I suggest spending a few minutes using a PDF editor program with some real-world PDFs, or even just copying and pasting text from a range of different PDFs. These files are made up of cute-tricks and hacks that whatever produced them used to make something that visually works. The high-quality implementations just put the pixels where they're told to. The underlying "structured data" is a lie.

EDIT: I see from further down the thread that your experience of PDFs comes from programmatically generated invoice templates, which may explain why you think this way.

joakleaf · 5 months ago
We do a lot of parsing of PDFs and basically break the structure into 'letter with font at position (box)' because the "structure" within the PDF is unreliable.

We have algorithms that combines the individual letters to words, words to lines, lines to boxes all by looking at it geometrically. Obviously identify the spaces between words.

We handle hidden text and problematic glyph-to-unicode tables.

The output is similar to OCR except we don't do the rasterization and quality is higher because we don't depend on vision based text recognition.

The base implementation of all this, I made in less than a month 10 years ago and we rarely, if ever, touch it.

We do machine learning afterwards on the structure output too.

joakleaf commented on So you want to parse a PDF?   eliot-jones.com/2025/8/pd... · Posted by u/UglyToad
throwaway4496 · 5 months ago
I do PDF for a living, millions of PDFs per month, this is complete nonsense. There is no way you get better results from rastering and OCR than rendering into XML or other structured data.
joakleaf · 5 months ago
I sort of agree... I do the same.

We also parse millions of PDFs per month in all kinds languages (both Western and Asian alphabets).

Getting the basics of PDF parsing to work is really not that complicated -- A few months work. And is an order of magnitude more efficient than generating an image in 300-600 DPI and doing OCR or Visual LLM.

But some of the challenges (which we have solved) are:

• Glyphs to unicode tables are often limited or incorrect • "Boxing" blocks of text into "paragraphs" can be tricky • Handling extra spaces and missing spaces between letters and words. Often PDFs do not include the spaces or they are incorrect so you need to identify gaps yourself. • Often graphic designers of magazines/newspapers will hide text behind e.g. a simple white rectangle, and place new version of the text above. So you need to keep track of z-order and ignore hidden text. • Common text can be embedded as vector paths -- Not just logos but we also see it with text. So you need a way to handle that. • Dropcap and similar "artistic" choices can be a bit painful

There are lot of other smaller issues -- but they are generally edge cases.

OCR handles some of these issues for you. But we found that OCR often misidentifies letters (all major OCR), and they are certainly not perfect with spaces either. So if you are going for quality, you can get better results if you parse the PDFs.

Visual Transformers are not good with accurate coordinates/boxing yet -- At least we haven't seen a good enough implementation of it yet. Even though it is getting better.

joakleaf commented on Tesla has yet to start testing its robotaxi without driver weeks before launch   electrek.co/2025/05/14/te... · Posted by u/TheAlchemist
jansan · 7 months ago
There aren't even many autonomous trains, although you would think this is the easiest to do. In Tokyo the Yurikamome started operating about 30 years ago (it is fun to ride at night on the front seats when Odaiba is almost empty), and in Germany there are two autonomous subway lines in Nürnberg. I wonder why transition is taking so long, considering that the shortage of train drivers will become much worse very soon.
joakleaf · 7 months ago
Copenhagen's metro started with 2 lines in 2002 (now 4 lines).

There are plans for more autonomy on the city's commuter trains (non-metro) in 2030-2037, starting with the first line in 2030/2031.

u/joakleaf

KarmaCake day1090November 2, 2011View Original