vient (u/vient) - Readit News

vient commented on Ask HN: Why hasn't x86 caught up with Apple M series? · Posted by u/stephenheron

Chiikawa · 3 days ago

I also own a 155H laptop using Linux Mint! Would you share your settings with TLP and LPMD? I am not getting not much longer battery life than Windows 11 on it after some tinkering, so seeing somebody else's setup may help a lot. Thanks!

vient · 2 days ago

Won't say I got much longer battery life, and even what I got may be as well explained as "TLP made energy profile management almost as good as on Windows, and then Windows's tendency to get a bunch of junk processes seeping on your battery tipped the scales to favor Linux". Also I ended up switching back to Windows because of never-ending hardware issues with Linux, installing it on 155H back in February 2024 was especially rough but even 6 months later I randomly got Bluetooth not working anymore after Ubuntu update.

My TLP and LPMD configs: https://gist.github.com/vient/f8448d56c1191bf6280122e7389fc1...

TLP: don't remember details now, as I recall scaling governor does not do anything on modern CPUs when energy perf policy is used. CPU_MAX_PERF_ON_BAT=30 seems to be crucial for battery savings, sacrificing performance (not too much for everyday use really) for joules in battery. CPU_HWP_DYN_BOOST_ON_BAT=0 further prohibits using turbo on battery, just in case.

LPMD: again, did not use it much in the end so not sure what even is written in this config. May need additional care to run alongside TLP.

Also, I used these boot parameters. For performance, I think, beneficial one are *mitigations, nohz_full, rcu*

    quiet splash sysrq_always_enabled=1 mitigations=off i915.mitigations=off transparent_hugepage=always iommu=pt intel_iommu=on nohz_full=all rcu_nocbs=all rcutree.enable_rcu_lazy=1 rcupdate.rcu_expedited=1 cryptomgr.notests no_timer_check noreplace-smp page_alloc.shuffle=1 tsc=reliable

vient commented on Ask HN: Why hasn't x86 caught up with Apple M series? · Posted by u/stephenheron

aurareturn · 3 days ago

  Where are you getting M4 die sizes from?

M1 Pro is ~250mm2. M4 Pro likely increased in size a bit. So I estimated 300mm2. There are no official measurements but should be directionally correct.

  AMD's multicore passmark score is more than 40% higher.

It's an out of date benchmark that not even AMD endorses and the industry does not use. Meanwhile, AMD officially endorses Cinebench 2024 and Geekbench. Let's use those.

   The AMD is an older fab process and does not have P/E cores. What are you measuring?

Efficiency. Fab process does not account for the 3.65x efficiency deficit. N4 to N3 is roughly ~20-25% more efficient at the same speed.

  The P/E design choice gives different trade-offs e.g. AMD has much higher average single core perf.

Citation needed. Further more, macOS uses P cores for all the important tasks and E cores for background tasks. I fail to see why even if AMD has a higher average ST would translate to better experience for users.

  14.8 TFLOPS vs. M4 Pro 9.2 TFLOPS.

TFLOPs are not the same between architectures.

  19% higher 3D Mark

Equal in 3DMark Wildlife, loses vs M4 Pro in Blender.

  34% higher GeekBench 6 OpenCL

OpenCL has long been deprecated on macOS. 105727 is the score for Metal, which is supported by macOS. 15% faster for M4 Pro.

The GPUs themselves are roughly equal. However, Strix Halo is still a bigger SoC.

vient · 3 days ago

> TFLOPs are not the same between architectures.

Shouldn't they be the same if we are speaking about same precision? For example, [0] shows M4 Max 17 TFLOPS FP32 vs MAX+ 395 29.7 TPLOFS FP32 - not sure what exact operation was measured but at least it should be the same operation. Hard to make definitive statements without access to both machines.

[0] https://www.cpu-monkey.com/en/compare_cpu-apple_m4_max_16_cp...

vient commented on Ask HN: Why hasn't x86 caught up with Apple M series? · Posted by u/stephenheron

gettingoverit · 3 days ago

My CPU is at over 5GHz, 1% load and 70C at the moment. That's in a "power-saving mode".

If nothing would be wrong, it'd be at something like 1.5GHz with most of the cores unpowered.

vient · 3 days ago

Something is wrong with power governor then. I have an opposite experience, was able to tune Linux on a Core Ultra 155H laptop so it works longer than Windows one. Needed to use kernel 6.11+ and TLP [0] with pretty aggressive energy saving settings. Also played a bit with Intel LPMD [1] but did not notice much improvement.

[0] https://github.com/linrunner/TLP

[1] https://github.com/intel/intel-lpmd

vient commented on Ask HN: Why hasn't x86 caught up with Apple M series? · Posted by u/stephenheron

aurareturn · 3 days ago

The reason why I keep reposting this table is because people post incorrect statements about AMD/Apple so often, often with zero data backing.

For Blender numbers, M4 Pro numbers came from Max Tech's review.[0] I don't remember where I got the Strix Halo numbers from. Could have been from another Youtube video or some old Notebookcheck article.

Anyway, Blender has official GPU benchmark numbers now:

M4 Pro: 2497 [1]

Strix Halo: 1304 [2]

So M4 Pro is roughly 90% faster in the latest Blender. The most likely reason for why Blender's official numbers favors M4 Pro even more is because of more recent optimizations.

Sources:

[0]https://youtu.be/0aLg_a9yrZk?si=NKcx3cl0NVdn4bwk&t=325

[1] https://opendata.blender.org/devices/Apple%20M4%20Pro%20(GPU...

[2] https://opendata.blender.org/devices/AMD%20Radeon%208060S%20...

vient · 3 days ago

Weren't we comparing CPUs though? Those Blender benchmarks are for GPUs.

Here is M4 Max CPU https://opendata.blender.org/devices/Apple%20M4%20Max/ - median score 475

Ryzen MAX+ PRO 395 shows median score 448 (can't link because the site does not seem to cope well with + or / in product names)

Resulting in M4 winning by 6%

vient commented on The first Media over QUIC CDN: Cloudflare moq.dev/blog/first-cdn/... · Posted by u/kixelated

joshcartme · 7 days ago

It reproduces for me in FF 142 on Windows. When I first went to https://cloudflare-quic.com/ it said HTTP/3, but after a few hard refreshes it says HTTP/2 and hasn't gone back to 3

vient · 7 days ago

Oh, I see - hard refresh consistently shows HTTP/2 but after one or two soft refreshes it becomes HTTP/3 for me until next hard refresh.

Edit: it is always second soft refresh for me that starts showing HTTP/3. Computers work in mysterious ways sometimes.

vient commented on The first Media over QUIC CDN: Cloudflare moq.dev/blog/first-cdn/... · Posted by u/kixelated

brycewray · 7 days ago

Semi-related and just FYI for Firefox users who visit Cloudflare-hosted, HTTP/3-using sites:

https://bugzilla.mozilla.org/show_bug.cgi?id=1979683

vient · 7 days ago

Limited to macOS? Does not reproduce in FF 141 and 142 on Windows.

vient commented on Test Results for AMD Zen 5 agner.org/forum/viewtopic... · Posted by u/matt_d

ashvardanian · a month ago

> All vector units have full 512 bits capabilities except for memory writes. A 512-bit vector write instruction is executed as two 256-bit writes.

That sounds like a weird design choice. Curious if this will affect memcpy-heavy workloads.

Writes aside, Zen5 is taking much longer to roll out than I thought, and some of AMD's positioning is (almost expectedly) misleading, especially around AI.

AMD's website claims Zen5 is the "Leading CPU for AI" (<https://www.amd.com/en/products/processors/server/epyc/ai.ht...>), but I strongly doubt that. First, they compare Zen5 (9965), which is still largely unavailable, to Xeon2 (8280), a 2 generations older processor. Xeon4 is abundantly available and comes with AMX, an exclusive feature to Intel. I doubt AVX-512 support with a 512-bit physical path and even twice as many cores will be enough to compete with that (if we consider just the ALU throughput rather than the overall system & memory).

vient · a month ago

AMX is indeed a very strong feature for AI. I've compared Ryzen 9950X with w7-2495X using single-thread inference of some fp32/bf16 neural networks, and while Zen 5 is clearly better than Zen 4, Xeon is still a lot faster even considering that its frequency is almost 1GHz less.

Now, if we say "Zen5 is the leading consumer CPU for AI" then no objections can be made, consumer Intel models do not even support AVX-512.

Also, note that for inference they compare with Xeon 8592+ which is the top Emerald Rapids model. Not sure if comparison with Granite Rapids would have been more appropriate but they surely dodged the AMX bullet by testing FP32 precision instead of BF16.

vient commented on Transition to using 16 KB page sizes for Android apps and games android-developers.google... · Posted by u/ingve

danudey · a month ago

It can be an issue of behavior; for example, Redis recommended disabling transparent huge page support in Linux because of (among other things?) copy-on-write memory page behaviors, and still does if you're going to persist data to disk.

1. You have a redis instance with e.g. 1GB of mapped memory in one 1GB huge page

2. Redis forks a copy of itself when it tries to persist data to disk so it can avoid having to lock the entire dataset for writes

3. The new Redis process does anything to modify any of the data anywhere in that 1GB

4. The OS has to now allocate a new 1GB page and copy the entire data set over

5. Oops, we're under memory pressure! Better page out 1GB of data to the paging file, or flush 1GB of data from the filesystem cache, so that I can allocate this 1GB page for the next 200ms.

You could imagine how memory allocators that try to be intelligent about what they're allocating and how much in order to optimize performance might care; when a custom allocator is trying to allocate many small pages and keep them in a pool so it can re-use them without having to request new pages from the OS, getting 100x 2M pages instead of 100x 4k pages is a colossal waste of memory and (potentially) performance.

It's not necessarily that the allocators will break or behave in weird, incorrect ways (they may) but often that the allocators will say "I can't work under these conditions!" (or will work but sub-optimally).

vient · a month ago

True, but that has nothing to do with tagged pointers.

vient commented on Transition to using 16 KB page sizes for Android apps and games android-developers.google... · Posted by u/ingve

majke · a month ago

A lot of software wont work if you do that. Many jits and memory allocators have opinions on page size. Also tagged pointers are very common.

vient · a month ago

Memory page size should be transparent for tagged pointers (any pointers, really), I don't see how they can be affected. You have an object at address 0xAB0BA, does the size of underlying page matter?