For example, if you're running experiments in one big batch overnight, making that faster doesn't seem very helpful. But with a big enough improvement, you can now run several batches of experiments during the day, which is much more productive.
For example, if you're running experiments in one big batch overnight, making that faster doesn't seem very helpful. But with a big enough improvement, you can now run several batches of experiments during the day, which is much more productive.
But when I do describe code, dataflow and memory layout in a SIMD friendly way it's pretty much the same for x86_64 and ARM.
Then I can just use `a + b` and `f32x4` (or its C equivalent) instead of `_mm_add_ps` and `_mm128` (x86_64) or `vaddq_f32` and `float32x4_t` (ARM).
Portable SIMD means I don't need to write this code twice and memorize arcane runes for basic arithmetic operations.
For more specialized stuff you have intrinsics.
However, we still end up writing different code for different target SOCs, as the microarchitecture is different, and we want to maximize our throughput and take advantage of any ISA support for dedicated instructions / type support.
One big challenge is targeting in-order cores the compiler often does a terrible job of register allocation (we need to use pretty much all the architectural registers to allow for vector instruction latencies), so we find the model breaks down somewhat there as we have to drop to inline assembly.
Deleted Comment
When will we have cheap drones that can take out other cheap drones?
Although, reasoning about performance can be hard anyway.
For the uninitiated, most high-performance CPUs of recent years:
- Are massively out-of-order. It will run any operation that has all inputs satisfied in the next slot of the right type available.
- Have multiple functional units. A recent apple CPU can and will run 5+ different integer ops, 3+ load/stores and 3+ floating point ops per cycle if it can feed them all. And it may well do zero-cost register renames on the fly for "free".
- Functional units are pipelined, you can throw 1 op in the front end of the pipe each cycle, but the result sometimes isn't available for consumption until maybe 3-20 cycles later (latency depends on the type of the op and if it can bypass into the next op executed).
- They will speculate on branch results and if they get them wrong it needs to flush the pipeline and do the right thing.
- Assorted hazards may give +/- on the timing you might get in a different situation.