You have to stop the leak into side channels in the first place, it's simply not practical to try to prevent secrets from escaping out of side channels. This is, unfortunately, the much harder problem with much worse performance implications (and indeed the reason why Spectre v1 is still almost entirely unmitigated).
ldr x2, [x2]
cbnz x2, skip
/* bunch of slow operations */
ldr x1, [x1]
add x1, x1, CACHE_STRIDE
ldr x1, [x1]
add x1, x1, CACHE_STRIDE
ldr x1, [x1]
add x1, x1, CACHE_STRIDE
ldr x1, [x1]
add x1, x1, CACHE_STRIDE
skip:
Here, if the branch condition is predicted not taken and ldr x2 misses in the cache, the CPU will speculatively execute long enough to launch the four other loads. If x2 is in the cache, the branch condition will resolve before we execute the loads. This gives us a 4x signal amplification using absolutely no external timing, just exploiting the fact that misses lead to longer speculative windows.
After repeating this procedure enough times and amplifying your signal, you can then direct measure how long it takes to load all these amplified lines (no mispredicted branches required!). Simply start the clock, load each line one by one in a for loop, and then stop the clock.
As I mentioned earlier, unless your plan is to treat every hit as a miss to DRAM, you can't hide this information.
The current sentiment for spectre mitigations is that once information has leaked into side channels you can't do anything to stop attackers from extracting it. There are simply too many ways to expose uarch state (and caches are not the only side channels!). Instead, your best and only bet is to prevent important information from leaking in the first place.
This is not to mention the fact that you can use transient execution itself (without any side channels) to amplify a single cache line being present/not present into >100ms of latency difference. Unless your plan is to burn 100ms of compute time to hide such an issue (nobody is going to buy your core in that case), you can't solve this problem like this.
You are not reading it correctly. It is not code as everyone knows it. It's like an electrical circuit with variable names attached to each conductor, and the code propagates information like electricity would.
There's tools dedicated to this, able to draw pictures of such code circuits (e.g. Simulink, Ascet). And such pictures can be automatically translated into c-code, that looks even worse than anything translated manually.
In the end, of course the tests prove that the code works like the picture of the circuit shows, and therefore the car must work correctly! This avoids the need for anyone working on only the code to understand a car.
In reality, things usually work in the end only because of how simple everything is and high number of iterations.
This only appears true because we've made our entire world so safe that we can call one fatality every 100 million miles dangerous. Given everything we use cars for and the immense utility of them, being driven almost exclusively by amateurs ... cars are remarkably safe.
And if you could find a way to reliably remove the 1% that cause most of the problems, it would be even safer.
https://www.reddit.com/r/RISCV/comments/z6xzu0/multi_core_im...
When I got my ECE degree in 1999, I was so excited to start an open source project for at least a 256+ core (MIPS?) processor in VHDL on an FPGA to compete with GPUs so I could mess with stuff like genetic algorithms. I felt at the time that too much emphasis was being placed on manual layout, when even then, tools like Mentor Graphics, Cadence and Synopsys could synthesize layouts that were 80% as dense as what humans could come up with (sorry if I'm mixing terms, I'm rusty).
Unfortunately the Dot Bomb, 9/11 and outsourcing pretty much gutted R&D and I felt discouraged from working on such things. But supply chain issues and GPU price hikes for crypto have revealed that it's maybe not wise to rely on the status quo anymore. Here's a figure that shows just how far behind CPUs have fallen since Dennard scaling ended when smartphones arrived in 2007 and cost/power became the priority over performance:
https://www.researchgate.net/figure/The-Dennard-scaling-fail...
FPGA performance on embarrassingly parallel tasks scales linearly with the number of transistors, so more closely approaches the top line.
I did a quick search and found these intros:
https://www.youtube.com/watch?v=gJno9TloDj8
https://www.hackster.io/pablotrujillojuan/creating-a-risc-v-...
https://en.wikipedia.org/wiki/Field-programmable_gate_array
https://www.napatech.com/road-to-fpga-reconfigurable-computi...
Looks like the timeline went:
2000: 100-500 million transistors
2010: 3-5 billion transistors
2020: 50-100 billion transistors
https://www.umc.com/en/News/press_release/Content/technology...
https://www.design-reuse.com/news/27611/xilinx-virtex-7-2000...
https://www.hpcwire.com/off-the-wire/xilinx-announces-genera...
I did a quick search on Digi-Key, and it looks like FPGAs are overpriced by a factor of about 10-100 with prices as high as $10,000. Since most of the patents have probably run out by now, that would be a huge opportunity for someone like Micron to make use of Inflation Reduction Act money and introduce a 100+ billion transistor 1 GHz FPGA for a similar price as something like an Intel i9, say $500 or less.
Looks like about 75 transistors per gate, so I'm mainly interested in how many transistors it takes to make a 32 or 64 bit ALU, and for RAM or DRAM. I'm envisioning an 8x8 array of RISC-V cores, each with perhaps 64 MB of memory for 16 GB total. That would compete with Apple's M1, but with no special heterogenous computing hardware, so we could get back to generic multicore desktop programming and not have to deal with proprietary GPU drivers and function coloring problems around CPU code vs shaders.
But, for what it's worth, there do seem to be some practical considerations why your idea of a hugely parallel computer would not meaningfully rival the M1 (or any other modern processor). The issue that everyone has struggled with for decades now is that lots of tasks are simply very difficult to parallelize. Hardware people would love to be able to just give software N times more cores and make it go N times faster, but that's not how it works. The most famous enunciation of this is Amdahl's Law [2]. So, for most programs people use today, 1024 tiny slow cores may very well be significantly worse than the eight fast, wide cores you can get on an M1.
[1] https://chipyard.readthedocs.io/en/stable/Chipyard-Basics/in...