lmwnshn (u/lmwnshn) - Readit News

lmwnshn commented on DOOMscrolling: The Game ironicsans.ghost.io/dooms... · Posted by u/jfil

Are people who understand relational databases and people who understand pivot tables disjoint sets?

I can look at someone’s finished pivot table and reproduce it from the data through other means, but any explanation of what a pivot table actually is and does reads like pure gibberish to me.

lmwnshn · 3 months ago

Check out the full version of Towards Scalable Dataframe Systems from VLDB 2020 [0]. They propose an algebra for dataframes and section 4.4's example succinctly describes the pivot operator.

[0] https://arxiv.org/abs/2001.00888

lmwnshn commented on Congress moves to reject bulk of White House's proposed NASA cuts arstechnica.com/space/202... · Posted by u/DocFeind

francisofascii · 5 months ago

What is a typical annual stipend? $37K? Sounds barely livable depending on the location, roommate situation, your stage in life, etc.

lmwnshn · 5 months ago

https://csstipendrankings.org/

lmwnshn commented on Databases in 2024: A Year in Review cs.cmu.edu/~pavlo/blog/20... · Posted by u/avinassh

bionhoward · a year ago

Redis is slow?

lmwnshn · a year ago

Slow is relative, but you might want to check out Garnet [0] for ideas. Previous discussion at [1], current compatibility at [2].

[0] https://www.microsoft.com/en-us/research/blog/introducing-ga...

[1] https://news.ycombinator.com/item?id=39752504

[2] https://microsoft.github.io/garnet/docs/commands/api-compati...

lmwnshn commented on Databases in 2024: A Year in Review cs.cmu.edu/~pavlo/blog/20... · Posted by u/avinassh

mebcitto · a year ago

I assume it's not that kind of ban, but more like he'll recommend his students to avoid the company.

lmwnshn · a year ago

Pretty much. Plus, from my perspective - if a company is willing to screw over your advisor/professor, you know that they won't hesitate to screw you over too.

lmwnshn commented on Databases in 2024: A Year in Review cs.cmu.edu/~pavlo/blog/20... · Posted by u/avinassh

dig1 · a year ago

> There was no major effort to fork off MongoDB, Neo4j, Kafka, or CockroachDB when they announced their license changes.

AFAIK people didn't take MongoDB seriously from the start, especially with the "web scale database" joke circulating. The Neo4j Community version has been under GPLv3 for quite some time, while the Enterprise version has always been somewhat closed, regardless of whether the source code was available on GitHub (the mentioned license change affected the Enterprise version).

Regarding CockroachDB, I must admit that I've only heard about it on HN and don't know anyone who seriously uses it. As for Kafka, there are two versions: Apache Kafka, the open-source version that almost everyone uses (under the Apache license), and Confluent Kafka, which is Apache Kafka enhanced with many additional features from Confluent, and the license change affected Confluent Kafka. In short, maybe the majority simply didn't care about these projects very much, so there is no major fork.

> It cannot be because the Redis and Elasticsearch install base is so much larger than these other systems, and therefore, there were more people upset by the change since the number of MongoDB and Kafka installations was equally as large when they switched their licenses.

I can’t speak for MongoDB, but the Confluent Kafka install base is significantly smaller than that of Apache Kafka, Redis and ES.

> Dana, Bohan, and I worked on this research project and startup for almost a decade. And now it is dead. I am disappointed at how a particular company treated us at the end, so they are forever banned from recruiting CMU-DB students. They know who they are and what they did.

Call me a skeptic, but I can't see this as a fair approach. If your company fails for whatever reasons, you should not recruit the university department/group/students against your peers (I can't find that CMU-DB was one of the founders of Ottertune).

Wrt Andy, here are [1] somehow interesting views from (presumably) previous employees.

[1] https://www.reddit.com/r/Database/comments/1dgaazw/comment/l...

lmwnshn · a year ago

> ... you should not recruit the university department/group/students against your peers ...

As a student who chose to stay at CMU for a PhD because of this group, it is quite the opposite situation - you may also misunderstand the nature of the "ban" (students can still apply directly to the company).

From the student perspective, we benefit from knowing the reputation of potential employers. For example: CompanyX went back on their promises so don't trust them unless they give it to you right away, CompanyY has a culture of being stingy, the people who went to CompanyZ love it there, and so on.

So it's more like (1) providing additional data about the company's past behavior, and (2) not actively giving the company a platform. I personally find this great for students.

lmwnshn commented on An Empirical Evaluation of Columnar Storage Formats [pdf] vldb.org/pvldb/vol17/p148... · Posted by u/eatonphil

gregw2 · 2 years ago

(2023)

(VLDB Volume 17 no 2 is 2023)

lmwnshn · 2 years ago

You're correct, but for additional context, this paper will actually be presented at VLDB 2024 [0].

> All papers published in this issue will be presented at the 50th International Conference on Very Large Data Bases, Guangzhou, China, 2024.

And that's because in the submission guidelines [1],

> The last three revision deadlines will be May 15, June 1, and July 15, 2023. Note that the June deadline is on the 1st instead of the 15th, and it is the final revision deadline for consideration to present at VLDB 2023; submissions received after this deadline will roll over to VLDB 2024.

So whether it is (2023) or (2024) is a little ambiguous.

[0] https://www.vldb.org/pvldb/vol17/FrontMatterVol17No2.pdf

[1] https://vldb.org/pvldb/volumes/16/submission

lmwnshn commented on Tax prep companies: $90M lobbying against free tax-filing opensecrets.org/news/2023... · Posted by u/everybodyknows

jstarfish · 2 years ago

Eh, I forgot to file state taxes one year and they spent the next 3 harassing me about it and threatening me with fines. I continued to ignore it out of sheer laziness (was a trivial amount since I was underemployed) and eventually they garnished my wages for what was owed. It was unexpected and harsh; 25% skimmed off every check.

...but--and I don't know where the fault here lies--payroll garnished too much. Pay remained 25% less than it should be and my employer's hands were tied unless I had release forms faxed over, and then I had to go harass the state for a refund. Meanwhile, my autopay regimen was disrupted so some bills were going unpaid. But they were more responsible than I was and paid out in 4-6 weeks as promised.

All the time I didn't spend just sitting down and paying the taxes, I ended up spending on phone hold trying to reclaim overpayments and reactivate services. Would have been easier to just pay the taxes in the first place.

lmwnshn · 2 years ago

For an eh in the other direction: I overpaid PA state taxes in 2020 by a decent chunk. The last time I called, they said that they're still processing amended returns from 2019 (which you can verify by going to their "Where's my refund" page and looking at the year dropdown).

lmwnshn commented on Database Gyms [pdf] cidrdb.org/cidr2023/paper... · Posted by u/greghn

lmwnshn · 3 years ago

[first author here] I'm not sure why this is on the front page. Speaking only on my own behalf, I like to think of this as a paper that's motivated by problems that I kept running into while re-implementing papers related to self-driving database systems [0] research.

My TLDR would be: existing research has focused on trying to develop better models of database system behavior, but look at recent trends in modeling. Transformers, foundation models, AutoML -- modeling is increasingly "solved", as long as you have the right training data. Training data is the bottleneck now. How can we optimize the training data collection pipeline? Can we engineer training data that generalizes better? What opportunities arise when you control the entire pipeline?

Elaborating on that, I think you can abstract existing training data collection pipelines into these four modules:

- [Synthesizer]: The field has standardized on the use of various synthetic workloads (e.g., TPC-C, TPC-H, DSB) and common workload trace formats for real-world workloads (e.g., postgres_log, MySQL general query log). Research on workload forecasting and dataset scaling exists. In 2023, why can't I say "assuming trends hold, show me what my workload and database state will look like 3 months from now"?

- [Trainer]: Given a workload and state (e.g., from the Synthesizer), existing research executes the workload on the state to produce training data. But executing workloads in real-time kind of sucks. Maybe you have a workload trace that's one month long, well, I don't want to wait one month for training data. But I can't just smash all the queries together either, that wouldn't be representative of actual deployment conditions. So right now, I'm intrigued by the idea of executing workloads in faster than real-time. Think of a fast-forward button on physics simulators, where you can reduce simulation fidelity in exchange for speed. Can we do that for databases? I'm also interested in playing tricks to help the training data generalize across different hardware, and in general, there seems to be a lot of unexplored opportunity here. Actively working on this!

- [Planner]: Given the training data (e.g., from the Trainer) and an objective function (e.g., latency, throughput), you might consider a set of tuning actions that improve the objective (e.g., build some indexes, change some knob settings). But how should you represent these actions? For example, a number of papers one-hot encode the possible set of indexes, but (1) you cannot actually do this in practice, there are too many indexes, and (2) you lose the notion of "distance" between your actions (e.g., indexes on the same table should probably be considered "related" in some way). Our research group is currently exploring some ideas here.

- [Decider]: Finally, once you're done applying all this domain-specific stuff to encode the states and actions, you're solidly in the realm of "learning to pick the best action" and can probably hand it off to a ML library. Why reinvent the wheel? :P That said, you can still do interesting work here (e.g., UDO is intelligent about batched action evaluation), but it's not something that I'm currently that interested in (relative to the other stuff above, which is more of an uncharted territory).

If anyone is at SIGMOD this week, I'm happy to chat! :)

[0] https://db.cs.cmu.edu/papers/2017/p42-pavlo-cidr17.pdf

lmwnshn commented on Building a new database management system in academia (2017) cs.cmu.edu/~pavlo/blog/20... · Posted by u/greghn

tlarkworthy · 3 years ago

and DuckDB came out of academia too and is not based on Postgres either (highly relevant and notably absent in the authors list of academic DBs at the end of the article)

https://duckdb.org/pdf/SIGMOD2019-demo-duckdb.pdf

EDIT: oh the article is old

lmwnshn · 3 years ago

You may be interested in DuckDB's CIDR talk on the little miracles that made it possible. [0]

[0] https://twitter.com/motherduck/status/1615487300523429889

lmwnshn commented on Everything you always wanted to know about mathematics (2013) [pdf] math.cmu.edu/~jmackey/151... · Posted by u/ggr2342

rickstanley · 3 years ago

Does this replace the OP? Or is "An Infinite Descent into Pure Mathematics" a complementary book?

lmwnshn · 3 years ago

My link replaces; we did not use OP's book when I took the course.

That said, OP's book looks more conversational in tone, which I personally have a slight preference for.