Intel Bringing SIMD to JavaScript

Anyone experimented with this yet? I'd like to know how this is resolved when the architecture doesn't support Intel's SIMD approach, they map the objects pretty close to the instructions (SIMD.float32x4.sub and the likes).

I'm trying to figure out what happens when you port this to ARM NEON, and how you catch it with architectures that don't support NEON (they often lack them in Marvell and Allwinner).

sunfish · 12 years ago

I'm a Mozilla engineer involved in this. NEON support is very important and we're designing the spec to support it well.

CPUs that lack SIMD units can support the functionality (though not the performance of course), and there's even a polyfill library that can lower this API into scalar operations for SIMD-less browsers too.

IvanK_net · 12 years ago

It would be great, if you could detect SIMDable operations in classic JS (e.g. in loops) and use SIMD for interpret them. I think that adding low-level features into a high-level language is not good practice.

spankalee · 12 years ago

It's based on SIMD support in Dart, which has been available for a while now and does supports NEON.

https://www.dartlang.org/articles/simd/

The primitives are pretty generic, just a few new vector types based on typed arrays. Operations on those types are supported on CPUs without a SIMD unit, they're just slower, but not any slower than coding with non-SIMD operations.

mzs · 12 years ago

What about 8 and 16 bit ints? How about signed vs unsigned? Or what about pixel like data that clamps instead of overflows? What about 64-bit IEEE? What id the SIMD unit is 64 bits wide? Or 256? It just seems so not future and varying implementation proof.

lgeek · 12 years ago

> architectures that don't support NEON (they often lack them in Marvell and Allwinner).

I'm probably nitpicking here, but:

* All Allwinner SoCs have NEON[0]

* Most current ARMv7 processors have NEON. Of the current ARM cores, only Cortex-A5 and Cortex-A9 don't have mandatory NEON support (it's optional). Cortex-A5 is intended for embedded applications. Of the existing Cortex-A9 processors, AFAIK the only somewhat popular one without NEON support is NVIDIA Tegra 2, which is retired. Out of the third party cores, all Qualcomm and Apple ones have NEON support.

[0] http://linux-sunxi.org/Allwinner_SoC_Family

nitrogen · 12 years ago

Marvell has a license to design their own cores, don't they?

hamiltonkibbe · 12 years ago

It happens in the browser, right? I think ultimately there just needs to be a unified API, or maybe more domain-specific APIs (like BLAS) that map to NEON or SSE instructions as appropriate, and do everything the slow way if they aren't available

Marat_Dukhan · 12 years ago

SIMD.float32x4 and SIMD.int32x4 classes are available in Firefox Nightly, but without Float32x4Array and Int32x4Array loads and stores are horribly slow. About 100x slower than normal JavaScript in my tests.

Would these changes eventually make their way into node.js (through V8)?

morkbot · 12 years ago

From the linked article (OK, press release):

It will, however, run (...) on the platforms that support SIMD. This includes both the client platforms (...) as well as servers that run JavaScript, for example through the Node.js V8 engine.

...and:

A major part of the SIMD.JS API implementation has already landed in Firefox Nightly and our full implementation of the SIMD API for Intel Architecture has been submitted to Chromium for review.

...and:

Google, Intel, and Mozilla are working on a TC39 ECMAScript proposal to include this JavaScript SIMD API in the future ES7 version of the JavaScript standard.

So, yes, there's definitely an intention there to put it into V8/Node.js/ES7 (guess, it will be in this exact order).

mantrax5 · 12 years ago

Considering the architecture of Node.JS is not suited to compute-heavy tasks, I wonder what kind of code are you willing to optimize in Node.JS with SIMD?

nevi-me · 12 years ago

Any improvements are still welcomed of course. There are a number of people/entities that are building desktop apps on Node and those tend to do 'compute-heavy tasks', a developer whose piece of code runs in a mean of 10 seconds would also welcome a possible optimisation to run it in less than that.

Not knowing much, I think it'll be interesting to see how general purpose applications would benefit from SIMD if it's accessed from a higher level. Does that mean that if I want to loop through 103 items and run arithmetic operations on them I'd have to do the following, (let's say I'm multiplying each item in items[] by 2, and items.length % 4 !== 0):

  var batch = [], 
    results= [], 
    i = 0, 
    j = 0, 
    len = items.length, 
    a, b = SIMD.int32x4(2, 2, 2, 2), 
    c;
  var mod = len % 4;
  items.forEach(function (item) {
    if (i < mod) {
      results.push(item * 2);
      i++;
    } else if (j < 4) {
      batch.push(item);
      j++;
    } else {
      a = SIMD.float32x4(batch[0], batch[1], batch[2], batch[3]);
      c = SIMD.float32x4.mul(a, b);
      results.push(c.x);
      results.push(c.y);
      results.push(c.z);
      results.push(c.w);
      batch = [];
      j = 0;
    }
  });

Of course this is the interpretation of a non-CS graduate who taught himself JS, some of the stuff mentioned at https://01.org/node/1495 seems a bit over my head. It'd be great if V8 would (unless it already does) transparently handle creating SIMD-optimised code where one is looping through an array or the like instead.

(edited to fix code, hopefully)

dpe82 · 12 years ago

The architecture is suited just fine, it all depends on how you decide to use Node.js. If you're using it for handling lots of asynchronous IO (eg. for a webserver) then it's probably unwise to mix in lots of blocking compute-heavy tasks on the main thread. But that's not the only way to use Node.js.

For example, we have a media library that implements HTML5 canvas2d, WebGL, WebAudio, video/image/sound encode and decode and a bunch of other media related compute-heavy tasks. We've done this as a native module because it lets us control all those APIs using the same code we would use in the browser, but we run that code in Node.js as a separate worker process where we don't care about things like IO latency. It works great.

Edit: grammar

z3t4 · 12 years ago

What's with the Node JS architecture that makes it not suitable for compute-heavy tasks?

I'm developing an MMO server in Node JS. So this is welcome in for example Vector calculations.

The reason I choose JS is that I can write "dumbed down" code, that just anyone with JS experience can manage.

Hopefully, JS will be just as fast as optimized C++ in the near future ... First you will ignore it, then laughs at it ...

jmpe · 12 years ago

ronjouch · 12 years ago

Holy hell, Gary Bernhardt was right all along and the future will be METAL... https://www.destroyallsoftware.com/talks/the-birth-and-death...

rasz_pl · 12 years ago

Wonder what native SIMD support can do to projects like MPEG decoders in pure javascript.

MPEG1 in javascript is pretty fast and runs at native speeds on phones

https://github.com/phoboslab/jsmpeg

but MPEG4/H.264 is not

https://github.com/mbebenita/Broadway

X-Cubed · 12 years ago

The H.264 demo runs at 60+ FPS on the main thread on Chrome 37 (dev) on Windows for me. Very impressive.

I think their effort would be much useful, if they focus on WebCL. It is already standardized (unlike their "SIMD" object). CPU implementation of WebCL, that utilizes SIMD, would probably offer much better performance, than any current Javascript engine.

azakai · 12 years ago

WebCL is not yet standardized, there is still ongoing discussion about that last I heard.

In any case, the two are not mutually exclusive, different people are working on each.

The version 1.0.0 was standardized in March 2014 - https://www.khronos.org/registry/webcl/specs/1.0.0/

fulafel · 12 years ago

WebCL community is geared towards exposing platform OpenCL backed by GPU, that would be stealing their fire.

Though it might still be the smartest thing to do given the poor state and lack of recent progress of GPU OpenCL drivers. Even desktop apps that would like to use OpenCL are just barely limping along. See eg. Blender.

I agree, even CPU implementation of WebCL could be much faster, than the fastest JS engine (because it is low-level C code). Currently there are several very good OpenCL implementations, running on the CPU.

BTW. GPUs use SIMD too :)

twotwotwo · 12 years ago

Stuff like this, and asm.js, and WebKit's crazy LLVM-based DFG optimizations, all lead me to think Native Client will end up being looked at as a transitional, niche tech: you keep tuning the JS engine until it's not unacceptably slower than native for asm.js-type code, and you keep hooking webpages up with more and more native capabilities and code (graphics, SIMD, crypto/compression), and eventually very few NaCl use cases are left. I don't see other vendors getting on the NaCl bandwagon because they don't want to depend on Google's code and it's a lot of work to implement--so maybe NaCl ends up remembered as the toolchain some companies used to port some apps to Chrome OS and that's mostly it. Sort of a shame; I _like_ NaCl, just don't see a great path for it compared to iterating on existing technologies.

sriku · 12 years ago

Given Nitro's being converted to LLVM bit code and NaCl was using it as well, it seems like a good thing to have LLVM bitcode/assembly be declared the web standard language .. then any tool that generates llvmbc can be used with browsers. This would open up a lot of scope for other languages too - bringing the choice of languages the server-side enjoys to the client-side as well.

zbowling · 12 years ago

Threads... Native Client has threads. Javascript does not. I'm writing a scheduler for Emscripten that supports pthreads but it's crazy because it's still single threaded and doesn't look like readable javascript like it used to output. Bending over backwards. ...and don't get me started on web workers.

erid · 12 years ago

Please start, I'm curious as to why web workers doesn't fit the needs like using threads, I have a fairly basic knowledge of web workers and not much practice with them, but I would like to know if possible what are the limitations.

cjreyes · 12 years ago

bwhthd · 12 years ago

I was doing research on parallel computing last summer. Does anybody know if this SIMD object is similar to the ParallelArray object Intel made in Rivertrail? Or are there any similarities between the two libraries?

iamsalman · 12 years ago

Yet another parallel framework. Without the 3rd party eco-system of APIs for matrix math, any framework is doomed to just add noise, not value. Sure there's some benefit in getting marginal speedup on some algorithms but for real speedup, you need to know the parallel architecture of the processor (GPU, CPU or APU) which means a learning curve. The GPGPU industry has been trying since long to abstract away the fine details and offer a plug-and-play kind of easy-to-learn framework but then we suffer performance losses and it really doesn't make sense to invest in GPUs for the kind of performance gains you get with these high-level APIs.

seanmcdirmid · 12 years ago

Read the release, this is a collaboration with Google and Mozilla. But you are right, one of the main reasons CUDA is so popular is because of cuBLAS. And it is a big pipe dream that you could program a GPU without being aware of communication and memory transfer behavior.

sanxiyn · 12 years ago

Won't we have HSA in the future? HSA is supposed to provide unified coherent memory access to both CPU and GPU. Do you think HSA is a pipe dream? If so why?

AshleysBrain · 12 years ago

It's not parallel, a framework, or a GPU feature. It's single-instruction-multiple-data (SIMD) which is used to speed up single threaded execution on a CPU when working with lists of numbers.

LnxPrgr3 · 12 years ago

My understanding is architectures are different enough that the fastest SIMD strategy is sometimes CPU-dependent.

The author of FFTS, for example, chose a different strategy on ARM than x86_64: http://anthonix.com/ffts/preprints/tsp2013.pdf

He found himself writing the NEON code in assembly entirely by hand because vector intrinsics didn't even expose CPU features he wanted to use—even in C, where vector intrinsics are CPU-specific.

Having access to SIMD is definitely better than not having it, but it really should be paired with good optimized implementations of things like BLAS and FFT libraries.

cma · 12 years ago

>Single instruction, multiple data (SIMD), is a class of parallel computers in Flynn's taxonomy