Amazon launches Trainium3

I've had to repeatedly tell our AWS account reps that we're not even a little interested in the Trainium or Inferentia instances unless they have a provably reliable track record of working with the standard libraries we have to use like Transformers and PyTorch.

I know they claim they work, but that's only on their happy path with their very specific AMI's and the nightmare that is the neuron SDK. You try to do any real work with them and use your own dependencies and things tend to fall apart immediately.

It was just in the past couple years that it really became worthwhile to use TPU's if you're on GCP and that's only with the huge investment on Google's part into software support. I'm not going to sink hours and hours into beta testing AWS's software just to use their chips.

ecshafer · 2 months ago

IMO AWS once you get off the core services is full of beta services. S3, Dynamo, Lambda, ECS, etc are all solid. But there are a lot of services they have that have some big rough patches.

jeffparsons · 2 months ago

RDS, Route53, and Elasticache are decent, too. But yes, I've also been bitten badly in the distant past by attempting to rely on their higher-level services. I guess some things don't change.

I wonder if the difference is stuff they dogfood versus stuff they don't?

kentm · 2 months ago

I'd add SQS to the solid category.

But yes, the less of a core building block the specific service is (or widely used internally in Amazon), the more likely you are to run into significant issues.

BOOSTERHIDROGEN · 2 months ago

Lightsail fortunately behave like core services.

Deleted Comment

weird-eye-issue · 2 months ago

True with Cloudflare too. Just stick with Workers, R2, Durable Objects, etc...

belter · 2 months ago

>But there are a lot of services they have that have some big rough patches.

Enlight us...

nextworddev · 2 months ago

Kinesis is decent

hnlmorg · 2 months ago

This. 100 times this.

mountainriver · 2 months ago

Agree, Google put a ton of work into making TPUs usable with the ecosystem. Given Amazon’s track record I can’t imagine they would ever do that.

klysm · 2 months ago

There might be enough market pressure right now to make them think about it, but the stock price went up enough from just announcing it so whatever

retinaros · 2 months ago

https://awsdocs-neuron.readthedocs-hosted.com/en/latest/fram...

htrp · 2 months ago

spoiler alert, they don't work without a lot of custom code

AWS keeps making grand statements about Trainium but not a single customer comes on stage to say how amazing it is. Everyone I talked to that tries it says there were too many headaches and they moved on. AWS pushes it hard but “more price performant” isn’t a benefit if it’s a major PITA to deploy and run relative to other options. Chips without a quality developer experience isn’t gonna work.

Seems AWS is using this heavily internally, which makes sense, but not observing it getting traction outside that. Glad to see Amazon investing there though.

phamilton · 2 months ago

The inf1/inf2 spot instances are so unpopular that they cost less than the equivalent cpu instances. Exact same (or better) hardware but 10-20% cheaper.

We're not quite seeing that on the trn1 instances yet, so someone is using them.

kcb · 2 months ago

Heh, I was looking at an eks cluster recently that was using Cast AI autoscalar. Scratching my head as there was a bunch of inf instances. Then I realized it must be cheap spot pricing.

giancarlostoro · 2 months ago

Not just AWS, looks like Anthropic uses it heavily as well. I assume they get plenty of handholding from Amazon though. I'm surprised any cloud provider does not invest drastically more into their SDK and tooling, nobody will use your cloud if they literally cannot.

cmiles8 · 2 months ago

Well AWS says Anthropic uses it but Anthropic isn’t exactly jumping up and down telling everyone how awesome it is, which tells you everything you need to know.

If Anthropic walked out on stage today and said how amazing it was and how they’re using it the announcement would have a lot more weight. Instead… crickets from Anthropic in the keynote

IshKebab · 2 months ago

> I'm surprised any cloud provider does not invest drastically more into their SDK and tooling

I used to work for an AI startup. This is where Nvidia's moat is - the tens of thousands of man-hours that has gone into making the entire AI ecosystem work well with Nvidia hardware and not much else.

It's not that they haven't thought of this, it's just that they don't want to hire another 1k engineers to do it.

logicchains · 2 months ago

>I'm surprised any cloud provider does not invest drastically more into their SDK and tooling, nobody will use your cloud if they literally cannot.

Building an efficient compiler from high-level ML code to a TPU is actually quite a difficult software engineering feat, and it's not clear that Amazon has the kind of engineering talent needed to build something like that. Not like Google which have developed multiple compilers and language runtimes.

Deleted Comment

ZeroCool2u · 2 months ago

deepsquirrelnet · 2 months ago

Heavens to Betsy, I don’t know if you can hear me, But try supporting these things if you actually want them to be successful. About the 3rd day into trying to roll your own LMI container in sagemaker because they haven’t updated the vLLM version in 6 months and you can’t run a regular sagemaker endpoint because of a ridiculous 60s timeout that was determined to be adequate 8 years ago. I can only imagine the hell that awaits the developer that decides to try their custom silicon.

smilekzs · 2 months ago

Single-chip specs according to:

https://awsdocs-neuron.readthedocs-hosted.com/en/latest/abou...

https://awsdocs-neuron.readthedocs-hosted.com/en/latest/nki/...

Eight NeuronCore-v4 cores that collectively deliver:

    2,517 MXFP8/MXFP4 TFLOPS
    671 BF16/FP16/TF32 TFLOPS
    2,517 FP16/BF16/TF32 sparse TFLOPS
    183 FP32 TFLOPS

HBM: 144 GiB HBM @ 4.9 TB/sec (4 stacks)

SRAM: 32 MiB * 8 = 256 MiB (ignoring 2 MiB * 8 = 16 MiB of PSUM which is not really general-purpose nor DMA-able)

Interconnect: 2560 GB/s (I think bidirectional, i.e. Jensen Math™)

----

At 3nm process node the FLOP/s is _way_ lower than competition. Compare to B200 which does 2250 BF16, x2 FP8, x4 FP4. TPU7x does 2307 BF16, x2 FP8 (no native FP4). HBM also lags behind (vs ~192 GiB in 6 stacks for both TPU7x and B200).

The main redeeming qualities seem to be: software-managed SRAM size (double of TPU7x; GPUs have L2 so not directly comparable) and on-paper raw interconnect BW (double of TPU7x and more than B200).

landl0rd · 2 months ago

Anyone considering using trainium should view this Completely Factual Infomercial: https://x.com/typedfemale/status/1945912359027114310

Pretty accurate in my experience, especially re: the neuron sdk. Do not use.

jauntywundrkind · 2 months ago

Amazon aside, interesting future here with NVLink getting more and more folks using it. Intel is also onboard with NVlink. This is like an PCI -> AGP moment, but Nvidia's AGP.

AMD felt like they were so close to nabbing the accelerator future back in HyperTransport days. But the recent version Infinity Fabric is all internal.

There's Ultra Accelerator Link (UALink) getting some steam. Hypothetically CXL should be good for uses like this, using PCIe PHY but lower latency lighter weight; close to ram latency, not bad! But still a mere PCIe speed, not nearly enough, with PCIe 6.0 just barely emerging now. Ideally IMO we'd also see more chips come with integrated networking too: it was so amazing when Intel Xeon's had 100Gb Omni-Path for barely any price bump. UltraEthernet feels like it should be on core, gratis.

wmf · 2 months ago

NVLink Fusion sounds like a total trap where you pay to become Jensen's slave. It may make sense for Intel because they're desperate. It's not a good look for AWS to put themselves in the same category.

UltraEthernet feels like it should be on core, gratis.

I've been saying for a while that AMD should put a SolarFlare NIC in their I/O die. They already have switchable PCIe/SATA ports, why not switchable PCIe/Ethernet? UEC might be too niche though.

trebligdivad · 2 months ago

The Block floating point (MXFP8/4 stuff) is interesting; the AI stuff is really pushing basic data types that haven't moved for decades.

https://en.wikipedia.org/wiki/Block_floating_point

aaa_aaa · 2 months ago

Interesting that in the article, they do not say what the chip actually does. Not even once.

Symmetry · 2 months ago

A bunch of 128x128 systolic arrays at its heart. More details:

https://newsletter.semianalysis.com/p/amazons-ai-self-suffic...

Nice article; I hate to think what the DC bus bars look like! ~50v at ~25kW/rack; 500A bus bars - I guess split, but still!

Training. It's in the name.

cobolcomesback · 2 months ago

Ironically these chips are being targeted at inference as well (the AWS CEO acknowledged the difficulties in naming things during the announcement).

djmips · 2 months ago

The I stands for Inference then?

Kye · 2 months ago

Vector math

egorfine · 2 months ago

Probably because the only task this chip has to perform is to please shareholders hence there is no need to explain anything to us peasant developers.

caminante · 2 months ago

Time to go squat on trainium4.com [0]

[0] https://www.godaddy.com/domainsearch/find?domainToCheck=trai...