bnprks (u/bnprks) - Readit News

bnprks commented on Everything I know about the fast inverse square root algorithm github.com/francisrstokes... · Posted by u/atan2

johndough · 2 years ago

If your computer was built after 1999, it probably supports the SSE instruction set. It contains the _mm_rsqrt_ps instruction, which is faster and will give you four reciprocal square roots at once: https://www.intel.com/content/www/us/en/docs/intrinsics-guid...

That being said, the techniques discussed here are not totally irrelevant (yet). There still exists some hardware with fast instructions for float/int conversion, but lacking rsqrt, sqrt, pow, log instructions, which can all be approximated with this nice trick.

bnprks · 2 years ago

Amusingly, (to me at least) there's also an SSE instruction for non-reciprocal square roots but it's so much slower than reciprocal square root that calculating sqrt(x) as x * 1/sqrt(x) is faster assuming you can tolerate the somewhat reduced precision.

bnprks commented on Energy-Efficient Llama 2 Inference on FPGAs via High Level Synthesis arxiv.org/abs/2405.00738... · Posted by u/PaulHoule

jesprenj · 2 years ago

> Although the GPU performs inference faster than the FPGA, one of the primary bottlenecks of deep learning inference is memory bandwidth and the availability of on-chip memory (Balasubramanian et al., 2021). A RTX 3090 has 24GB VRAM running at 1219 MHz with a base core clock of 1395 MHz (TechPowerUp, 2024). In comparison, a VU9P FPGA has 345.9 MB of combined on-chip BRAM and URAM, running at a much slower clock speed of around 200-300 MHz depending on the module; however, with much lower clock speeds, the FPGA is able to achieve better efficiency on power and energy consumption, as shown below.

So as far as I can understand, the biggest "bottleneck"/limiting factor with using FPGAs for LLMs is the available memory -- with current large models exceeding 40 GiB in parameter size, GPUs and TPUs with DRAM look like the only way to go forward for the months to come ... Thoughts?

bnprks · 2 years ago

Yeah, I think DRAM is almost certainly the future, just in terms of being able to afford the memory capacity to fit large models. Even Cerebras using a full wafer only gets up to 44 GB of SRAM on a chip (at a cost over $2M).

An interesting twist is that this DRAM might not need to be a central pool where bandwidth must be shared globally -- e.g. the Tensortorrent strategy seems to be aiming for using smaller chips that each have their own memory. Splitting up memory should yield very high aggregate bandwidth even with slower DRAM, which is great as long as they can figure out the cross-chip data flow to avoid networking bottlenecks

bnprks commented on Energy-Efficient Llama 2 Inference on FPGAs via High Level Synthesis arxiv.org/abs/2405.00738... · Posted by u/PaulHoule

bnprks · 2 years ago

Seems like the claims of the abstract for speed and energy-efficiency relative to an RTX 3090 are when the GPU is using a batch size of 1. I wonder if someone with more experience can comment on how much throughput gain is possible on a GPU by increasing batch size without severely harming latency (and what the power consumption change might be).

And from a hardware cost perspective the AWS f1.2xlarge instances they used are $1.65/hr on-demand, vs say $1.29/hr for an A100 from Lambda Labs. A very interesting line of thinking to use FPGAs, but I'm not sure if this is really describing a viable competitor to GPUs even for inference-only scenarios.

bnprks commented on Deaths at a California skydiving center, but the jumps go on sfgate.com/bayarea/articl... · Posted by u/nradov

jessriedel · 2 years ago

> According to its data, there were 10 fatalities out of an estimated 3.65 million jumps in 2023

So, among the USPA's membership, there's a ~3 * 10^-6 chance of death per jump, which is basically compatible with how it had been described to me in the past: ~1/1000 chance that your main chute doesn't deploy, times a ~1/1000 chance that the reserve doesn't deploy, times a small factor because people (especially beginners) do dumb stuff.

At $10M statistical life in the US, that's $30 per jump, which is less than, but not vastly less than, the price of the jump itself. It seems quite plausible that the jump centers that are not members of the USPA have higher risk, which could start too look overly risky (in the specific sense that consumers would be much less likely to participate if they had access to the figures). But I'd bet it's less than $200/jump worth of risk.

I wish these sorts of discussions would focus more on the numbers and making sure the risks are tracked and public.

bnprks · 2 years ago

The article also gives reason to be skeptical of the quoted "10 fatalities out of an estimated 3.65 million jumps in 2023". If we count 28 known fatalities at this one facility from 1983 to 2021, we get around 0.75 fatalities per year.

In other words, we would expect that 14 facilities of similar death counts to the one in the article would equal the total US fatalities for a year. The USPA dropzone locator [1] lists 142 facilities, so if we take everything at face value then this facility is ~10x worse than the average for USPA members.

> But I'd bet it's less than $200/jump worth of risk

In this case at least, it seems that this specific facility is higher risk than that. And with a lack of legally mandated reporting requirements, I'd say the onus is on a facility to prove safety once it's averaging a death every 1.3 years.

[1]: https://www.uspa.org/dzlocator?pagesize=16&Country=US

bnprks commented on Oxide Cloud Computer. No Cables. No Assembly. Just Cloud oxide.computer/... · Posted by u/vmoore

amluto · 2 years ago

> For example "3.2TB Enterprise NVMe Mixed Use AG Drive U.2 Gen4 with carrier" is $3,301.65 each

That’s the pricing for people who don’t know to ask for real pricing — it’s an absolute joke. I don’t know how might extra margin gets captured here, but it’s a lot.

Even in teeny tiny volumes, Dell will give something closer to real pricing, and a decent heuristic is that it’s at least 2x cheaper.

This is a real SSD. Dell likely buys this brand and others:

https://www.serversupply.com/SSD/NVMe/30.72TB/KIOXIA/KCMY1RU...

Yes, that is almost an order of magnitude cheaper per TiB. If you buy from a sketchier vendor, you’ll get all the way to 10x :)

bnprks · 2 years ago

Thanks for the link, and very good to know. I've always struggled to find component prices for Kioxia drives and higher-capacity RAM sticks so it's good to see I can finally look these up on serversupply when I'm curious.

bnprks commented on Oxide Cloud Computer. No Cables. No Assembly. Just Cloud oxide.computer/... · Posted by u/vmoore

fbdab103 · 2 years ago

I do not purchase hardware, but $1MM is way above what I would have expected. Going to Dell, the most expensive pre-built rack mount starts at ~$30k. Assuming 16 of those only gets you to $480k. Throw in an extra premium for the rack itself + small company margins still leaves me reaching to get to that price point.

bnprks · 2 years ago

Are you sure you're comparing equivalent memory and storage specs? I needed to go into the customization menus in the Dell configurator to spec something equivalent, where prices started going up quite rapidly.

For example "3.2TB Enterprise NVMe Mixed Use AG Drive U.2 Gen4 with carrier" is $3,301.65 each, and you'd need 10 of those to match the Oxide storage spec -- already above the $30k total price you quoted. Similarly, "128GB LRDIMM, 3200MT/s, Quad Rank" was $3,384.79 each, and you'd need 8 of those to reach the 1TiB of memory per server Oxide provides.

With just the RAM and SSD cost quoted by Dell, I get to $60k per server (x16 = $960k), which isn't counting CPU, power, or networking.

I agree these costs are way way way higher than what I'd expect for consumer RAM or SSD, but I think if Oxide is charging in line with Dell they should be asking at least $1MM for that hardware. (At least compared to Dell's list prices -- I don't purchase enterprise hardware either so I don't know how much discounting is typical)

Edit: the specific Dell server model I was working off of for configuration was called "PowerEdge R6515 Rack Server", since it was one of the few I found that allowed selecting the exact same AMD EPYC CPU model that Oxide uses [1]

[1]: https://www.dell.com/en-us/shop/dell-poweredge-servers/power...

bnprks commented on Oxide Cloud Computer. No Cables. No Assembly. Just Cloud oxide.computer/... · Posted by u/vmoore

ilhuadjkv · 2 years ago

Would love to know what the minimum buy in is on one of these

bnprks · 2 years ago

The specifications page [1] gives a bit more context. I think minimum buy is about a half rack, which includes at least 16 64-core CPUs, 16 TiB of RAM, and 465.75 TiB of NVMe SSD storage. Playing around a bit with the Dell server configurator tool, it seems like that is going to come in a rough ballpark of $1MM as stated in a sibling comment.

[1]: https://oxide.computer/product/specifications

bnprks commented on Array Languages: R vs. APL (2023) jcarroll.com.au/2023/07/0... · Posted by u/todsacerdoti

VTimofeenko · 2 years ago

Asking out of lack of experience with R: how does such invocation handle case when `x` is defined with a different value at call site?

In pseudocode:

  f =
  let x = 1 in # inner vars for f go here
  arg -> arg + 1 # function logic goes here

  # example one: no external value
  f (x+1) # produces 3 (arg := (x+1) = 2; return arg +1)

  # example two: x is defined in the outer scope
  let x = 4 in
  f (x+2) # produces 5 (arg := 4; return arg + 1)? Or 3 if inner x wins as in example one?

bnprks · 2 years ago

If the function chooses to overwrite the value of a variable binding, it doesn't matter how it is defined at the call site (so inner x wins in your example). In the tidyverse libraries, they often populate a lazy list variable (think python dictionary) that allows disambiguating in the case of name conflicts between the call site and programmatic bindings. But that's fully a library convention and not solved by the language.

bnprks commented on Array Languages: R vs. APL (2023) jcarroll.com.au/2023/07/0... · Posted by u/todsacerdoti

broomcorn · 2 years ago

That sounds like asking for trouble. Someone coming from any other programming language could easily forget that expression evaluation is stateful. Better to be explicit and create an object representing a expression. Tell me, at least, that the variable is immutable in that context?

bnprks · 2 years ago

The good news is that most variables in R are immutable with copy-on-write semantics. Therefore, most of the time everything here will be side-effect-free and any weird editing of the variable bindings is confined to within the function. (The cases that would have side effects are very uncommonly used in my experience)

bnprks commented on Array Languages: R vs. APL (2023) jcarroll.com.au/2023/07/0... · Posted by u/todsacerdoti

bnprks · 2 years ago

One of the wildest R features I know of comes as a result of lazy argument evaluation combined with the ability to programmatically modify the set of variable bindings. This means that functions can define local variables that are usable by their arguments (i.e. `f(x+1)` can use a value of `x` that is provided from within `f` when evaluating `x+1`). This is used extensively in practice in the dplyr, ggplot, and other tidyverse libraries.

I think software engineers often get turned off by the weird idiosyncrasies of R, but there are surprisingly unique (arguably helpful) language features most people don't notice. Possibly because most of the learning material is data-science focused and so it doesn't emphasize the bonkers language features that R has.