Anyone experimented with this yet? I'd like to know how this is resolved when the architecture doesn't support Intel's SIMD approach, they map the objects pretty close to the instructions (SIMD.float32x4.sub and the likes).
I'm trying to figure out what happens when you port this to ARM NEON, and how you catch it with architectures that don't support NEON (they often lack them in Marvell and Allwinner).
I'm a Mozilla engineer involved in this. NEON support is very important and we're designing the spec to support it well.
CPUs that lack SIMD units can support the functionality (though not the performance of course), and there's even a polyfill library that can lower this API into scalar operations for SIMD-less browsers too.
It would be great, if you could detect SIMDable operations in classic JS (e.g. in loops) and use SIMD for interpret them. I think that adding low-level features into a high-level language is not good practice.
The primitives are pretty generic, just a few new vector types based on typed arrays. Operations on those types are supported on CPUs without a SIMD unit, they're just slower, but not any slower than coding with non-SIMD operations.
What about 8 and 16 bit ints? How about signed vs unsigned? Or what about pixel like data that clamps instead of overflows? What about 64-bit IEEE? What id the SIMD unit is 64 bits wide? Or 256? It just seems so not future and varying implementation proof.
> architectures that don't support NEON (they often lack them in Marvell and Allwinner).
I'm probably nitpicking here, but:
* All Allwinner SoCs have NEON[0]
* Most current ARMv7 processors have NEON. Of the current ARM cores, only Cortex-A5 and Cortex-A9 don't have mandatory NEON support (it's optional). Cortex-A5 is intended for embedded applications. Of the existing Cortex-A9 processors, AFAIK the only somewhat popular one without NEON support is NVIDIA Tegra 2, which is retired. Out of the third party cores, all Qualcomm and Apple ones have NEON support.
It happens in the browser, right? I think ultimately there just needs to be a unified API, or maybe more domain-specific APIs (like BLAS) that map to NEON or SSE instructions as appropriate, and do everything the slow way if they aren't available
SIMD.float32x4 and SIMD.int32x4 classes are available in Firefox Nightly, but without Float32x4Array and Int32x4Array loads and stores are horribly slow. About 100x slower than normal JavaScript in my tests.
I think their effort would be much useful, if they focus on WebCL. It is already standardized (unlike their "SIMD" object). CPU implementation of WebCL, that utilizes SIMD, would probably offer much better performance, than any current Javascript engine.
WebCL community is geared towards exposing platform OpenCL backed by GPU, that would be stealing their fire.
Though it might still be the smartest thing to do given the poor state and lack of recent progress of GPU OpenCL drivers. Even desktop apps that would like to use OpenCL are just barely limping along. See eg. Blender.
I agree, even CPU implementation of WebCL could be much faster, than the fastest JS engine (because it is low-level C code). Currently there are several very good OpenCL implementations, running on the CPU.
Stuff like this, and asm.js, and WebKit's crazy LLVM-based DFG optimizations, all lead me to think Native Client will end up being looked at as a transitional, niche tech: you keep tuning the JS engine until it's not unacceptably slower than native for asm.js-type code, and you keep hooking webpages up with more and more native capabilities and code (graphics, SIMD, crypto/compression), and eventually very few NaCl use cases are left. I don't see other vendors getting on the NaCl bandwagon because they don't want to depend on Google's code and it's a lot of work to implement--so maybe NaCl ends up remembered as the toolchain some companies used to port some apps to Chrome OS and that's mostly it. Sort of a shame; I _like_ NaCl, just don't see a great path for it compared to iterating on existing technologies.
Given Nitro's being converted to LLVM bit code and NaCl was using it as well, it seems like a good thing to have LLVM bitcode/assembly be declared the web standard language .. then any tool that generates llvmbc can be used with browsers. This would open up a lot of scope for other languages too - bringing the choice of languages the server-side enjoys to the client-side as well.
Threads... Native Client has threads. Javascript does not. I'm writing a scheduler for Emscripten that supports pthreads but it's crazy because it's still single threaded and doesn't look like readable javascript like it used to output. Bending over backwards. ...and don't get me started on web workers.
Please start, I'm curious as to why web workers doesn't fit the needs like using threads, I have a fairly basic knowledge of web workers and not much practice with them, but I would like to know if possible what are the limitations.
It will, however, run (...) on the platforms that support SIMD. This includes both the client platforms (...) as well as servers that run JavaScript, for example through the Node.js V8 engine.
...and:
A major part of the SIMD.JS API implementation has already landed in Firefox Nightly and our full implementation of the SIMD API for Intel Architecture has been submitted to Chromium for review.
...and:
Google, Intel, and Mozilla are working on a TC39 ECMAScript proposal to include this JavaScript SIMD API in the future ES7 version of the JavaScript standard.
So, yes, there's definitely an intention there to put it into V8/Node.js/ES7 (guess, it will be in this exact order).
Considering the architecture of Node.JS is not suited to compute-heavy tasks, I wonder what kind of code are you willing to optimize in Node.JS with SIMD?
Any improvements are still welcomed of course. There are a number of people/entities that are building desktop apps on Node and those tend to do 'compute-heavy tasks', a developer whose piece of code runs in a mean of 10 seconds would also welcome a possible optimisation to run it in less than that.
Not knowing much, I think it'll be interesting to see how general purpose applications would benefit from SIMD if it's accessed from a higher level. Does that mean that if I want to loop through 103 items and run arithmetic operations on them I'd have to do the following, (let's say I'm multiplying each item in items[] by 2, and items.length % 4 !== 0):
var batch = [],
results= [],
i = 0,
j = 0,
len = items.length,
a, b = SIMD.int32x4(2, 2, 2, 2),
c;
var mod = len % 4;
items.forEach(function (item) {
if (i < mod) {
results.push(item * 2);
i++;
} else if (j < 4) {
batch.push(item);
j++;
} else {
a = SIMD.float32x4(batch[0], batch[1], batch[2], batch[3]);
c = SIMD.float32x4.mul(a, b);
results.push(c.x);
results.push(c.y);
results.push(c.z);
results.push(c.w);
batch = [];
j = 0;
}
});
Of course this is the interpretation of a non-CS graduate who taught himself JS, some of the stuff mentioned at https://01.org/node/1495 seems a bit over my head. It'd be great if V8 would (unless it already does) transparently handle creating SIMD-optimised code where one is looping through an array or the like instead.
The architecture is suited just fine, it all depends on how you decide to use Node.js. If you're using it for handling lots of asynchronous IO (eg. for a webserver) then it's probably unwise to mix in lots of blocking compute-heavy tasks on the main thread. But that's not the only way to use Node.js.
For example, we have a media library that implements HTML5 canvas2d, WebGL, WebAudio, video/image/sound encode and decode and a bunch of other media related compute-heavy tasks. We've done this as a native module because it lets us control all those APIs using the same code we would use in the browser, but we run that code in Node.js as a separate worker process where we don't care about things like IO latency. It works great.
I was doing research on parallel computing last summer. Does anybody know if this SIMD object is similar to the ParallelArray object Intel made in Rivertrail? Or are there any similarities between the two libraries?
Yet another parallel framework. Without the 3rd party eco-system of APIs for matrix math, any framework is doomed to just add noise, not value. Sure there's some benefit in getting marginal speedup on some algorithms but for real speedup, you need to know the parallel architecture of the processor (GPU, CPU or APU) which means a learning curve. The GPGPU industry has been trying since long to abstract away the fine details and offer a plug-and-play kind of easy-to-learn framework but then we suffer performance losses and it really doesn't make sense to invest in GPUs for the kind of performance gains you get with these high-level APIs.
Read the release, this is a collaboration with Google and Mozilla. But you are right, one of the main reasons CUDA is so popular is because of cuBLAS. And it is a big pipe dream that you could program a GPU without being aware of communication and memory transfer behavior.
Won't we have HSA in the future? HSA is supposed to provide unified coherent memory access to both CPU and GPU. Do you think HSA is a pipe dream? If so why?
It's not parallel, a framework, or a GPU feature. It's single-instruction-multiple-data (SIMD) which is used to speed up single threaded execution on a CPU when working with lists of numbers.
He found himself writing the NEON code in assembly entirely by hand because vector intrinsics didn't even expose CPU features he wanted to use—even in C, where vector intrinsics are CPU-specific.
Having access to SIMD is definitely better than not having it, but it really should be paired with good optimized implementations of things like BLAS and FFT libraries.
I'm trying to figure out what happens when you port this to ARM NEON, and how you catch it with architectures that don't support NEON (they often lack them in Marvell and Allwinner).
CPUs that lack SIMD units can support the functionality (though not the performance of course), and there's even a polyfill library that can lower this API into scalar operations for SIMD-less browsers too.
https://www.dartlang.org/articles/simd/
The primitives are pretty generic, just a few new vector types based on typed arrays. Operations on those types are supported on CPUs without a SIMD unit, they're just slower, but not any slower than coding with non-SIMD operations.
I'm probably nitpicking here, but:
* All Allwinner SoCs have NEON[0]
* Most current ARMv7 processors have NEON. Of the current ARM cores, only Cortex-A5 and Cortex-A9 don't have mandatory NEON support (it's optional). Cortex-A5 is intended for embedded applications. Of the existing Cortex-A9 processors, AFAIK the only somewhat popular one without NEON support is NVIDIA Tegra 2, which is retired. Out of the third party cores, all Qualcomm and Apple ones have NEON support.
[0] http://linux-sunxi.org/Allwinner_SoC_Family
MPEG1 in javascript is pretty fast and runs at native speeds on phones
https://github.com/phoboslab/jsmpeg
but MPEG4/H.264 is not
https://github.com/mbebenita/Broadway
In any case, the two are not mutually exclusive, different people are working on each.
Though it might still be the smartest thing to do given the poor state and lack of recent progress of GPU OpenCL drivers. Even desktop apps that would like to use OpenCL are just barely limping along. See eg. Blender.
BTW. GPUs use SIMD too :)
It will, however, run (...) on the platforms that support SIMD. This includes both the client platforms (...) as well as servers that run JavaScript, for example through the Node.js V8 engine.
...and:
A major part of the SIMD.JS API implementation has already landed in Firefox Nightly and our full implementation of the SIMD API for Intel Architecture has been submitted to Chromium for review.
...and:
Google, Intel, and Mozilla are working on a TC39 ECMAScript proposal to include this JavaScript SIMD API in the future ES7 version of the JavaScript standard.
So, yes, there's definitely an intention there to put it into V8/Node.js/ES7 (guess, it will be in this exact order).
Not knowing much, I think it'll be interesting to see how general purpose applications would benefit from SIMD if it's accessed from a higher level. Does that mean that if I want to loop through 103 items and run arithmetic operations on them I'd have to do the following, (let's say I'm multiplying each item in items[] by 2, and items.length % 4 !== 0):
Of course this is the interpretation of a non-CS graduate who taught himself JS, some of the stuff mentioned at https://01.org/node/1495 seems a bit over my head. It'd be great if V8 would (unless it already does) transparently handle creating SIMD-optimised code where one is looping through an array or the like instead.(edited to fix code, hopefully)
For example, we have a media library that implements HTML5 canvas2d, WebGL, WebAudio, video/image/sound encode and decode and a bunch of other media related compute-heavy tasks. We've done this as a native module because it lets us control all those APIs using the same code we would use in the browser, but we run that code in Node.js as a separate worker process where we don't care about things like IO latency. It works great.
Edit: grammar
I'm developing an MMO server in Node JS. So this is welcome in for example Vector calculations.
The reason I choose JS is that I can write "dumbed down" code, that just anyone with JS experience can manage.
Hopefully, JS will be just as fast as optimized C++ in the near future ... First you will ignore it, then laughs at it ...
The author of FFTS, for example, chose a different strategy on ARM than x86_64: http://anthonix.com/ffts/preprints/tsp2013.pdf
He found himself writing the NEON code in assembly entirely by hand because vector intrinsics didn't even expose CPU features he wanted to use—even in C, where vector intrinsics are CPU-specific.
Having access to SIMD is definitely better than not having it, but it really should be paired with good optimized implementations of things like BLAS and FFT libraries.