Readit News logoReadit News
itamarst · 3 years ago
There's just a huge amount of waste in many cases which is very easy to fix. For example, if we have a list of fractions (0.0-1.0):

* Python list of N Python floats: 32×N bytes (approximate, the Python float is 24 bytes + 8-byte pointer for each item in the list)

* NumPy array of N double floats: 8×N bytes

* Hey, we don't need that much precision, let's use 32-bit floats in NumPy: 4×N

* Actually, values of 0-100 are good enough, let's just use uint8 in NumPy and divide by 100 if necessary to get the fraction: N bytes

And now we're down to 3% of original memory usage, and quite possibly with no meaningful impact on the application.

(See e.g. https://pythonspeed.com/articles/python-integers-memory/ and https://pythonspeed.com/articles/pandas-reduce-memory-lossy/ for longer prose versions that approximate the above.)

justinlloyd · 3 years ago
I was hired by SONY at one point to help optimize a piece of video editing software that the outsourced team had created and could not figure out how to make it a) go faster and b) use less RAM. This is back when 16GB was on the upper end of what workstation class machines could handle.

The desktop application, written in an early 2004-ish version of C# and .NET regularly brought the workstations to their knees and thrashed memory and pegged the CPU when large 1080p images were being loaded in or moved around.

Each RGBA image, stored as a PNG on the HDD, was loaded in, each RGBA 32-bit unsigned integer was unpacked into its attendant R, G, B, and alpha components into 32-bit unsigned integers themselves, which were then stored in an Int (boxed the native to a non-native type) which were then appended to a dynamically allocated at read time ArrayList. Deep copies were made of each image any time one of them was resized or manipulated, keeping the original untouched image in RAM in case it was needed. A copy of the image before transformation was stored on the Undo stack. And the 1080p working surface of the screen was super-sampled at 8x resolution to support defringing of images when layered. All stored as 8-bit RGBA components in boxed 32-bit integers in a dynamically allocated ArrayList.

eckza · 3 years ago
Don't leave us hanging - what did you _do?_ Did you fix this?
deckard1 · 3 years ago
interesting. Python doesn't use tagged pointers? I would think most dynamic languages would store immediate char/float/int in a single tagged 32-bit/64-bit word. That's some crazy overhead.
nneonneo · 3 years ago
Absolutely everything in CPython is a PyObject, and that can’t be changed without breaking the C API. A PyObject contains (among other things) a type pointer, a reference count, and a data field; none of these things can be changed without (again) breaking the C API.

There have definitely been attempts to modernize; the HPy project (https://hpyproject.org/), for instance, moves towards a handle-oriented API that keeps implementation details private and thus enables certain optimizations.

acdha · 3 years ago
This has been talked about for years but I believe it's still complicated by C API compatibility. The most recent discussion I see is here:

https://github.com/faster-cpython/ideas/discussions/138

Victor Stinner's experiment showed some performance regressions, too:

https://github.com/vstinner/cpython/pull/6#issuecomment-6561...

justinlloyd · 3 years ago
I love Python as a language, and all its packages, and have been using it since the late 90's, but Python's legacy decisions are one step away from causing the language to being found face down in a dirty ditch after an all night bender.

Deleted Comment

ikiris · 3 years ago
I mean, its python. If you're expecting it to be efficient, I don't really know what to tell you.
adamsmith143 · 3 years ago
Ok now I have 100s of columns. I should do this for every single one in every single dataset I have?
itamarst · 3 years ago
It takes like 5 minutes, and once you are in the habit it's something you do automatically as you write the code and so it doesn't actually cost you extra time.

Efficient representation should be something you build into your data model, it will save you time in the long run.

(Also if you have 100s of columns you're hopefully already benefiting from something like NumPy or Arrow or whatever, so you're already doing better than you could be... )

dvfjsdhgfv · 3 years ago
You'll need to decide on a case by case basis. Many datasets I work with are being generated by machines, come from network cards etc. - these are quite consistent. Occasionally I deal with datasets prepare by humans and these are mediocre at best, and in these cases I spend a lot of time cleaning them up. Once it's done, I can clearly see if there are some columns can be stored in a more efficient way, or not. If the dataset is large, I do it, because it gives me extra freedom if I can fit everything in RAM. If it's small, I don't bother, my time is more expensive than potential gains.
nomel · 3 years ago
Assuming your data is not ephemeral, and you have some way to ingest the data, from a full precision data store, why not?

Store at full precision, process at fractional precision, a story as old as time.

staticassertion · 3 years ago
Yes?

Deleted Comment

shapefrog · 3 years ago
> which is very easy to fix

go on amazon and buy another stick of RAM

BLanen · 3 years ago
You're describing operations done on data in memory to save memory. That list of fractions still needs to be in memory at some point. And if you're batching, this whole discussion goes out of the window.
rcoveson · 3 years ago
Why would the whole original dataset need to be in memory all at once to operate on it value-by-value and put it into an array?
forrestthewoods · 3 years ago
What an unhelpful post.

The realization that modern servers can easily persist multiple terabytes of data is profound.

The fact that some datasets are just floats and you can quantize some floats from 32-bits down to 8-bits is true but not a helpful observation.

I also don’t know where you get “Python float is 24 bytes + 8-byte pointer for each item in the list”. Wat.

tomatotomato37 · 3 years ago
There's still a difference between terabytes of data spread between RAM sticks multiple inches away from the CPU die and the megabytes of cache data close enough to the compute silicon to experience quantum effects. A modern CPU can burn through hundreds of instructions in the time it takes to resolve a thrashing cache
minitech · 3 years ago
> I also don’t know where you get “Python float is 24 bytes + 8-byte pointer for each item in the list”. Wat.

Not sure how many ways there are to reword that. A CPython float takes 24 bytes of memory, and storing them in a list means 8 bytes per item for the pointer. So in CPython, a list of n floats takes 32n bytes of memory.

  >>> sys.getsizeof(1.0)
  24
  >>> l = list(map(float, range(62_500_000)))  # memory use goes up by >2 GB
  >>> del l  # memory use goes down by >2 GB
(no need to go straight to NumPy to avoid this when relevant, though – array.array is built in.)

marcinzm · 3 years ago
We went with this approach. Pandas hit GIL limits which made it too slow. Then we moved to Dask and hit GIL limits on the scheduler process. Then we moved to Spark and hit JVM GC slowdowns on the amount of allocated memory. Then we burned it all down and became hermits in the woods.
mumblemumble · 3 years ago
I have decided that all solutions to questions of scale fall into one of two general categories. Either you can spend all your money on computers, or you can spend all your money on C/C++/Rust/Cython/Fortran/whatever developers.

There's one severely under-appreciated factor that favors the first option: computers are commodities that can be acquired very quickly. Almost instantaneously if you're in the cloud. Skilled lower-level programmers are very definitely not commodities, and growing your pool of them can easily take months or years.

jbverschoor · 3 years ago
Buying hardware won't give you the same performance benefits as a better implementation/architecture.

And if the problem is big enough, buying hardware will cause operational problems, so you'll need more people. And most likely you're not gonna wanna spend on people, so you get a bunch of people who won't fix the problem, but buy more hardware.

hinkley · 3 years ago
On the other hand, you can't argue with a computer. That may explain why some of my coworkers seem to behave as if they wish they were computers...

It's too difficult to renegotiate with computers, too easy to renegotiate with people. When you don't actually know what you need to do, you need people. When you think you know what you need to do, but you're wrong, then you really need people. Most of us are in the latter category, most of the time.

thrwyoilarticle · 3 years ago
Then you discover - too late to change your mind - that you're throwing hardware at an NP algo
nomel · 3 years ago
At some point, you just mmap shared system memory, as read only, giving direct global access to your dataset, like the good old days (or in any embedded system).
mritchie712 · 3 years ago
Did you consider Clickhouse? join's are slow, but if your data is in a single table, it works really well.
marcinzm · 3 years ago
We were trying to keep everything on one machine in (mostly) memory for simplicity. Once you open up the pandoras box of distributed compute there's a lot of options including other ways of running Spark. But yes, in retrospect, we should have opened that box first.
blub · 3 years ago
Have you tried vaex? It is a panda-like Python library that uses C++ underneath, memory mapping and optimizes its memory access patterns. It’s very fast at least up to 1TB allegedly, I’ve used it for 10-15GB.
rbanffy · 3 years ago
> Then we burned it all down and became hermits in the woods.

Did the results of the calculations drive that decision?

Deleted Comment

louwrentius · 3 years ago
The original site made by lukegb inspired me because of the down-to-earth simplicity. Scaling vertically is often so much easier and better in so many dimensions than creating a complex distributed computing setup.

This is why I recreated the site when it went down quite a while ago.

The recent article "Use One Big Server"[0] inspired me to (re)submit this website to HN because it addresses the same topic. I like this new article so much because in this day and age of the cloud, people tend to forget how insanely fast and powerful modern servers have become.

And if you don't have budget for new equiment, the second-hand stuff from a few years back is stil beyond amazing and the prices are very reasonable compared to cloud cost. Sure, running bare metal co-located somewhere has it's own cost, but it's not that of a big deal and many issues can be dealt with using 'remote hands' services.

To be fair, the article admits that in the end it's really about your organisation's specific circumstances and thus your requirements. Physical servers and/or vertical scaling may not (always) be the right answer. That said, do yourself a favour, and do take this option seriously and at least consider it. You can even do an experiment: buy some second-hand gear just to gain some experience with hardware if you don't have it already and do a trial in a co-location.

Now that we are talking, yourdatafitsinram.net runs on a Raspberry Pi 4 which in turn is running on solar power.[1] (The blog and this site are both running on the same host)

[0]: https://news.ycombinator.com/item?id=32319147

[1]: https://louwrentius.com/this-blog-is-now-running-on-solar-po...

karamanolev · 3 years ago
> many issues can be dealt with using 'remote hands' services.

I have a few second-hand HP/Dell/Supermicro systems running colocated. I find that for all software issues, remote management / IPMI / KVM over IP is perfectly sufficient. Remote hands are needed only for actual hardware issues, most of which is "replace this component with an identical one". Usually HDD, if you're running those. Overall, I'm quite happy with the setup and it's very high on the value/$ spectrum.

toast0 · 3 years ago
IPMI is nice, although the older you go, the more particular it gets. I had professional experience with the SuperMicro Xeon e5-2600 series v1-4, and recently started renting a previous generation server[1] and it's worse than the ones I used before. It's still servicable though; but I'm not sure it it's using a dedicated LAN, because the kvm and the sol drop out when the OS starts or ends; it'll come back, but you miss early boot messages.

It's definitely worth the effort to script starting the KVM, and maybe even the sol. If you've got a bunch of servers, you should script the power management as well, if nothing else, you want to rate limit power commands across your fleet to prevent accidental mass restarts. Intentional mass restarts can probably happen through the OS, so 1 power command per second across your fleet is probably fine. (You can always hack out the rate limit if you're really sure).

[1] I don't need a whole server, but for $30/month when I wanted to leave my VPS behind for a few reasons anyway...

louwrentius · 3 years ago
Yes, I bet a lot of people aren't even aware of IPMI/KVM over IP capabilities that servers have for decades, which makes hardware management (manual or automated!) much easier.

Remote hands is for the inevitable hardware failure (Disk, PSU, Fan) or human error (you locked yourself out somehow remotely from IPMI).

P.S. I have a HP Proliant DL380 G8 with 128 GB of memory and 20 physical cores as a lab system for playing with many virtual machines. I turn it on and off on demand using IPMI.

bob1029 · 3 years ago
This kind of realization that "yes, it probably will" has recently inspired me to hand-build various database engines wherein the entire working set lives in memory. I do realize others have worked on this idea too, but I always wanted to play with it myself.

My most recent prototypes use a hybrid mechanism that dramatically increases the supported working set size. Any property larger than a specific cutoff would be a separate read operation to the durable log. For these properties, only the log's 64-bit offset is stored in memory. There is an alternative heuristic that allows for the developer to add attributes which signify if properties are to be maintained in-memory or permitted to be secondary lookups.

As a consequence, that 2TB worth of ram can properly track hundreds or even thousands of TB worth of effective data.

If you are using modern NVMe storage, those reads to disk are stupid-fast in the worst case. There's still a really good chance you will get a hit in the IO cache if you application isn't ridiculous and has some predictable access patterns.

saltcured · 3 years ago
I don't mean to discourage personal exploration in any way, but when doing this sort of thing it can also be illuminating to consider the null hypothesis... what happens if you let the conventional software use a similarly enlarged RAM budget or fast storage?

SQLite or PostgreSQL can be given some configuration/hints to be more aggressive about using RAM while still having their built-in capability to spill to storage rather than hit a hard limit. Or on Linux (at least), just allowing the OS page cache to sprawl over a large RAM system may make the IO so fast that the database doesn't need to worry about special RAM usage. For PostgreSQL, this can just be hints to the optimizer to adjust the cost model and consider random access to be cheaper when comparing possible query plans.

Once you do some sanity check benchmarks of different systems like that, you might find different bottlenecks than expected, and this might highlight new performance optimization quests you hadn't even considered before. :-)

lmwnshn · 3 years ago
To add to this thread, some people have done this exploration before, see slides 11 and 20 of [0] or Figure 1 of [1]. Could you just throw more RAM at your problems in a disk-backed system? In practice, probably. But there are distinct advantages to designing upfront for an in-memory scenario.

[0] https://15721.courses.cs.cmu.edu/spring2020/slides/02-inmemo...

[1] https://15721.courses.cs.cmu.edu/spring2020/papers/02-inmemo...

bob1029 · 3 years ago
> what happens if you let the conventional software use a similarly enlarged RAM budget or fast storage?

Oh I absolutely have gone down this road as well.

The biggest thing for me is taking advantage of the other benefits you can get with in-memory working sets, such as arbitrary pointer machine representations.

When working with a traditional SQL engine (even one tuned for memory-only operation), there are many rules you have to play by or things will not go well.

louwrentius · 3 years ago
Extra anecdote:

Around 2000, a guy told me he was asked to support very significant performance issues with a server running a critical application. He quickly figured out that the server ran out of memory. Option 1 was to rewrite the application to use less memory. He chose option two: increase the server memory, going from 64 MB to 128 MB (Yes MB).

At that time, 128 MB was an ungodly amount of memory and memory was very expensive. But it was still cheaper to just throw RAM at the problem than to spend many hours rewriting the application.

navjack27 · 3 years ago
In 2000? 128 ungodly? What!
nomel · 3 years ago
The price curve was in sharp decline around that time, easily making "around 2000" +/- 5 years, possible: https://jcmit.net/memoryprice.htm

For example, four years earlier, it would be >$3k. A year before that, >$6k.

This was an amazing, and terrible, time to be into computers.

none_to_remain · 3 years ago
Several years ago my job then got a dev and prod server with a terabyte of RAM. I liked the dev server because a few times I found myself thinking "this would be easy to debug if I had an insane amount of RAM" and then I would remember I did.
MathYouF · 3 years ago
What kind of things are easier to debug with lots of RAM and how would you do it?
none_to_remain · 3 years ago
Basically working with code manipulating a couple dozen GB of data and then keeping a couple dozen copies of that to examine it after various stages of manipulation.
AdamJacobMuller · 3 years ago
https://www.redbooks.ibm.com/redpapers/pdfs/redp5510.pdf

I want one of these.

a system with 1TB of ram is 133k, 8.5mil for a system with 64TB of ram?

chaxor · 3 years ago
Absolutely not. You can purchase a system with 1TB of ram, and some decent CPUs etc for ~25k. My lab just did this. That's far overpriced. 133k is closer to what you would spend if you used the machine with 1tb "in the cloud".
didgetmaster · 3 years ago
I still remember the first advertisement I saw for 1TB of disk space. I think it was about 1997 and about the biggest individual drive you could buy was 2GB. The system was the size of a couple of server racks and they put 500 of those disks in it. It cost over $1M for the whole system.
nimish · 3 years ago
That's insanely overpriced. A 128gb lrdimm is $1000. So a tb on a commodity 8 mem slot board would be 8k plus a few thousand for the cpu and chassis.
adgjlsfhk1 · 3 years ago
Note that when you start talking about multiple 10s of TiB of ram, you start having to buy super high density dims which are very expensive (because not many of them get made and anyone who needs one has lots of money)
vlunkr · 3 years ago
Amazing. This has been the solution to postgres issues for me. Just add enough memory that everything, or at least everything that is accessed frequently can fit in RAM. Suddenly everything is cached and fast.