Readit News logoReadit News
Rietty commented on Ask HN: Who here is not working on web apps/server code?    · Posted by u/ex-aws-dude
jftuga · 2 months ago
> General day to day is creating jobs that will process large amounts of input data and storing them into Snowflake

About how long do these typically take to execute? Minute, Tens of Minutes, Hours?

My work if very iterative where the feedback loop is only a few minutes long.

Rietty · 2 months ago
Some of the largest are a few billion rows and we sample randomly when developing code then execute it on all
Rietty commented on Ask HN: Who here is not working on web apps/server code?    · Posted by u/ex-aws-dude
doom2 · 2 months ago
Hello fellow data engineer! I feel like I don't see a lot of us around / don't see many popular submissions dealing with data engineering. I also work with financial datasets (think aggregated consumer transaction data) for use by investors and corporate clients
Rietty · 2 months ago
Many of my datasets are similiar!
Rietty commented on Ask HN: Who here is not working on web apps/server code?    · Posted by u/ex-aws-dude
jftuga · 2 months ago
> General day to day is creating jobs that will process large amounts of input data and storing them into Snowflake

About how long do these typically take to execute? Minute, Tens of Minutes, Hours?

My work if very iterative where the feedback loop is only a few minutes long.

Rietty · 2 months ago
Depends on the dataset anywhere from seconds to tens of minutes depending on preprocessing needed.
Rietty commented on Ask HN: Who here is not working on web apps/server code?    · Posted by u/ex-aws-dude
Rietty · 2 months ago
Working in a Data Engineering/Operations role which focuses heavily on financial datasets. Everything is within AWS and Snowflake and each table can easily have >100M records of any type of random data (there is a lot of breadth.) General day to day is creating jobs that will process large amounts of input data and storing them into Snowflake, sending out tons of automated reports and emails to decision makers as well as gathering more data from the web.

All of this is done in a Python environment with usage of Rust for speeding up critical code/computations. (The rust code is delivered as Python modules.)

The work is interesting and different challenges arise when having to process and compute datasets that are updated with 10s of TBs of fresh data daily.

Rietty commented on High-income job losses are cooling housing demand   jbrec.com/insights/job-gr... · Posted by u/gmays
vel0city · 2 months ago
Where in the US would $140k become $70k after tax? Or do you mean after all other pre-tax adjustments such as insurance, 401ks, in addition to taxes?
Rietty · 2 months ago
According to a friend I know who lives in the tri-state area that is what happens to them, but they max out 401K, have insurance etc.
Rietty commented on x86 architecture 1 byte opcodes   sandpile.org/x86/opc_1.ht... · Posted by u/eklitzke
charcircuit · 3 months ago
>Not sure why you're being downvoted.

I downvote people when they say they don't know what something is when they could have used a LLM to explain it to them.

Rietty · 3 months ago
What if the LLM gives them bad information and they don't know it? I personally would also just ask in a thread than risk the LLM info.
Rietty commented on SIMD within a register: How I doubled hash table lookup performance   maltsev.space/blog/012-si... · Posted by u/axeluser
mizmar · 6 months ago
Something similar is used in swiss tables - metadata bucket entries are 1 bit occupancy marker and 7 MSB bits of hash (don't remember how tombstones are represented). Metadata table is scanned first, the upper part of hash should discard filter out most colliding entries and the "false positives" lead to probe in entry table and key comparison (possibly optimized with full-hash comparison).

Buckets use 16 bytes because sse2 and arm neon SIMD are basically guaranteed.

I was shocked to read on how swiss tables - proclaiming to be the fastest hash tables - work. It's just open-addressing linear probe with no technique to deal with collisions and clustering. Plus the initial version rounded hashes%capacity to bucket size, thus using 4 less bits of hash, leading to even more collisions and clustering. Yet the super fast probe with apparently made it not an issue? Mind-boggling. (Later version allowed to scan from arbitrary position by mirroring first bucket as last.)

Rietty · 6 months ago
> Yet the super fast probe with apparently made it not an issue?

Could you explain how that makes it a non-issue? It seems counter-intuitive to me that it solves the problem by just probing faster?

Rietty commented on Why I wrote the BEAM book   happihacking.com/blog/pos... · Posted by u/lawik
toast0 · 8 months ago
Hardware issues happen, but if you're lucky it's a simple failure and the box stops dead. Not much fun, but recovery can be quick and automated.

What's real trouble is when the hardware fault is like one of the 16 nic queues stopped, so most connections work, but not all (depends on the hash of the 4-tuple) or some bit in the ram failed and now you're hitting thousands of ECC correctable errors per second and your effective cpu capacity is down to 10% ... the system is now too slow to work properly, but manages to stay connected to dist and still attracts traffic it can't reasonably serve.

But OS/thread hangs are avoidable in my experience. If you run your beam system with very few OS processes, there's no reason for the OS to cause trouble.

But on the topic of a 15ms pause... it's likely that that pause is causally related to cascading pauses, it might be the beginning or the end or the middle... But when one thing slows down, others do too, and some processes can't recover when the backlog gets over a critical threshold which is kind of unknowable without experiencing it. WhatsApp had a couple of hacks to deal with this. A) Our gen_server aggregation framework used our hacky version of priority messages to let the worker determine the age of requests and drop them if they're too old. B) we had a hack to drop all messages in a process's mailbox through the introspection facilities and sometimes we automated that with cron... Very few processes can work through a mailbox with 1 million messages, dropping them all gets to recovery faster. C) we tweaked garbage collection to run less often when the mailbox was very large --- i think this is addressed by off-heap mailboxes now, but when GC looks through the mailbox every so many iterations and the mailbox is very large, it can drive an unrecoverable cycle as eventually GC time limits throughput below accumulation and you'll never catch up. D) we added process stats so we could see accumulation and drain rates and estimate time to drain / or if the process won't drain and built monitoring around that.

Rietty · 8 months ago
> we had a hack to drop all messages in a process's mailbox through the introspection facilities and sometimes we automated that with cron...

What happens to the messages? Do they get processed at a slower rate or on a subsystem that works in the background without having more messages being constantly added? Or do you just nuke them out of orbit and not care? That doesn't seem like a good idea to me since loss of information. Would love to know more about this!

Rietty commented on Boeing Revokes Health Benefits for Striking Workers   commondreams.org/news/boe... · Posted by u/speckx
frankharv · a year ago
Should they be contributing to their 401K too???

Health Insurance is a benefit of employment.

You chose to unemploy yourself now go pay your own way.

What do you expect after they rejected a 30 percent raise.

They want their pension back and they ain't getting it.

Rietty · a year ago
Does striking mean you are unemployed, and if so, why? I mean I understand that you're not gonna get paid if you strike.. but feels weird to also have other things affected? Then again I don't know how the US works. (Genuine question)
Rietty commented on Today Microsoft Banned My Country Iran from Minecraft   old.reddit.com/r/GirlGame... · Posted by u/iosystem
fernandopj · 2 years ago
Minecraft is an online game, so in today's world, he didn't purchase the game, just a license to play it online. No server access, no game.
Rietty · 2 years ago
Minecraft is not an online game, it has a multi-player option sure.. but it doesn't strictly need to be online. There is a single-player mode since basically forever..

u/Rietty

KarmaCake day65December 27, 2018
About
Just someone who likes to read HN. visit my blog at https://rietty.com
View Original