For OpenAI, I’d assume that a GPU is dedicated to your task from the point you press enter to the point it finishes writing. I would think most of the 700 million barely use ChatGPT and a small proportion use it a lot and likely would need to pay due to the limits. Most of the time you have the website/app open I’d think you are either reading what it has written, writing something or it’s just open in the background, so ChatGPT isn’t doing anything in that time. If we assume 20 queries a week taking 25 seconds each. That’s 8.33 minutes a week. That would mean a single GPU could serve up to 1209 users, meaning for 700 million users you’d need at least 578,703 GPUs. Sam Altman has said OpenAI is due to have over a million GPUs by the end of year.
I’ve found that the inference speed on newer GPUs is barely faster than older ones (perhaps it’s memory speed limited?). They could be using older clusters of V100, A100 or even H100 GPUs for inference if they can get the model to fit or multiple GPUs if it doesn’t fit. A100s were available in 40GB and 80GB versions.
I would think they use a queuing system to allocate your message to a GPU. Slurm is widely used in HPC compute clusters, so might use that, though likely they have rolled their own system for inference.
It's not clear, but I assume it sends commands to the actual film hardware, and its not doing some real-time control.
"The software shows a handful of controls for the projectionist to queue up the film and control the platters that feed film at six feet per second. " [0]
[0] https://www.extremetech.com/mobile/imax-using-20-year-old-pa...
[1] https://history-computer.com/palm-pilot-guide/ [2] https://www.zdnet.com/article/pocket-pc-sales-1-million-and-...
The latest Nvidia driver no longer supports the K40, so you’ll have to use version 470 (or lower, officially Nvidia says 460, but 470 seems to work). That supports CUDA 11.4 natively. Newer versions of CUDA 11.x are supported: https://docs.nvidia.com/deploy/cuda-compatibility/index.html though CUDA 12 is not.
In my testing, a system with a single RTX3060 was faster in tensorflow than with 3 K40s and probably close to the performance of 4 k40s.
If you are considering other GPUs, there are some good benchmarks here (The RTX3060 is not there, though the GTX1080Ti was almost the same performance in the tensorflow test they run): https://lambdalabs.com/gpu-benchmarks
As others have said Google CoLab is free option you can use.
Back then, the whole professional world had switched to 64 bit, both from a performance and memory size perspective. That is why the dotcom time basically was based on Sparc Suns. The Itanium was way late, it still was Intels only offering in the domain. Until x86-64 came and very quickly entered the professional compute centers. The performance race in the consumer space then sealed the deal by providing faster CPUs than the classic RISC processors of the time, including Itanium.
It is a bit sad to see it go, I wonder how well the architecture would have performed in modern processes. After all, an iPhone has a much larger transistor count than those "large" and "hot" Itaniums.
Doesn't this have more to do with the daemon that the user executing commands ?
Given that understanding, could someone please explain what "rootless" would mean? I want to understand these in simpler terms:)
(Thank you in advance)
At a highway rest stop, why would a fast charger that can charge 1 car in 6 minutes be any more expensive than a bank of slow chargers that can charge 10 cars in 60 minutes?