[0] https://www.microsoft.com/en-us/research/blog/introducing-ga...
[1] https://news.ycombinator.com/item?id=39752504
[2] https://microsoft.github.io/garnet/docs/commands/api-compati...
AFAIK people didn't take MongoDB seriously from the start, especially with the "web scale database" joke circulating. The Neo4j Community version has been under GPLv3 for quite some time, while the Enterprise version has always been somewhat closed, regardless of whether the source code was available on GitHub (the mentioned license change affected the Enterprise version).
Regarding CockroachDB, I must admit that I've only heard about it on HN and don't know anyone who seriously uses it. As for Kafka, there are two versions: Apache Kafka, the open-source version that almost everyone uses (under the Apache license), and Confluent Kafka, which is Apache Kafka enhanced with many additional features from Confluent, and the license change affected Confluent Kafka. In short, maybe the majority simply didn't care about these projects very much, so there is no major fork.
> It cannot be because the Redis and Elasticsearch install base is so much larger than these other systems, and therefore, there were more people upset by the change since the number of MongoDB and Kafka installations was equally as large when they switched their licenses.
I can’t speak for MongoDB, but the Confluent Kafka install base is significantly smaller than that of Apache Kafka, Redis and ES.
> Dana, Bohan, and I worked on this research project and startup for almost a decade. And now it is dead. I am disappointed at how a particular company treated us at the end, so they are forever banned from recruiting CMU-DB students. They know who they are and what they did.
Call me a skeptic, but I can't see this as a fair approach. If your company fails for whatever reasons, you should not recruit the university department/group/students against your peers (I can't find that CMU-DB was one of the founders of Ottertune).
Wrt Andy, here are [1] somehow interesting views from (presumably) previous employees.
[1] https://www.reddit.com/r/Database/comments/1dgaazw/comment/l...
As a student who chose to stay at CMU for a PhD because of this group, it is quite the opposite situation - you may also misunderstand the nature of the "ban" (students can still apply directly to the company).
From the student perspective, we benefit from knowing the reputation of potential employers. For example: CompanyX went back on their promises so don't trust them unless they give it to you right away, CompanyY has a culture of being stingy, the people who went to CompanyZ love it there, and so on.
So it's more like (1) providing additional data about the company's past behavior, and (2) not actively giving the company a platform. I personally find this great for students.
(VLDB Volume 17 no 2 is 2023)
> All papers published in this issue will be presented at the 50th International Conference on Very Large Data Bases, Guangzhou, China, 2024.
And that's because in the submission guidelines [1],
> The last three revision deadlines will be May 15, June 1, and July 15, 2023. Note that the June deadline is on the 1st instead of the 15th, and it is the final revision deadline for consideration to present at VLDB 2023; submissions received after this deadline will roll over to VLDB 2024.
So whether it is (2023) or (2024) is a little ambiguous.
[0] https://www.vldb.org/pvldb/vol17/FrontMatterVol17No2.pdf
...but--and I don't know where the fault here lies--payroll garnished too much. Pay remained 25% less than it should be and my employer's hands were tied unless I had release forms faxed over, and then I had to go harass the state for a refund. Meanwhile, my autopay regimen was disrupted so some bills were going unpaid. But they were more responsible than I was and paid out in 4-6 weeks as promised.
All the time I didn't spend just sitting down and paying the taxes, I ended up spending on phone hold trying to reclaim overpayments and reactivate services. Would have been easier to just pay the taxes in the first place.
My TLDR would be: existing research has focused on trying to develop better models of database system behavior, but look at recent trends in modeling. Transformers, foundation models, AutoML -- modeling is increasingly "solved", as long as you have the right training data. Training data is the bottleneck now. How can we optimize the training data collection pipeline? Can we engineer training data that generalizes better? What opportunities arise when you control the entire pipeline?
Elaborating on that, I think you can abstract existing training data collection pipelines into these four modules:
- [Synthesizer]: The field has standardized on the use of various synthetic workloads (e.g., TPC-C, TPC-H, DSB) and common workload trace formats for real-world workloads (e.g., postgres_log, MySQL general query log). Research on workload forecasting and dataset scaling exists. In 2023, why can't I say "assuming trends hold, show me what my workload and database state will look like 3 months from now"?
- [Trainer]: Given a workload and state (e.g., from the Synthesizer), existing research executes the workload on the state to produce training data. But executing workloads in real-time kind of sucks. Maybe you have a workload trace that's one month long, well, I don't want to wait one month for training data. But I can't just smash all the queries together either, that wouldn't be representative of actual deployment conditions. So right now, I'm intrigued by the idea of executing workloads in faster than real-time. Think of a fast-forward button on physics simulators, where you can reduce simulation fidelity in exchange for speed. Can we do that for databases? I'm also interested in playing tricks to help the training data generalize across different hardware, and in general, there seems to be a lot of unexplored opportunity here. Actively working on this!
- [Planner]: Given the training data (e.g., from the Trainer) and an objective function (e.g., latency, throughput), you might consider a set of tuning actions that improve the objective (e.g., build some indexes, change some knob settings). But how should you represent these actions? For example, a number of papers one-hot encode the possible set of indexes, but (1) you cannot actually do this in practice, there are too many indexes, and (2) you lose the notion of "distance" between your actions (e.g., indexes on the same table should probably be considered "related" in some way). Our research group is currently exploring some ideas here.
- [Decider]: Finally, once you're done applying all this domain-specific stuff to encode the states and actions, you're solidly in the realm of "learning to pick the best action" and can probably hand it off to a ML library. Why reinvent the wheel? :P That said, you can still do interesting work here (e.g., UDO is intelligent about batched action evaluation), but it's not something that I'm currently that interested in (relative to the other stuff above, which is more of an uncharted territory).
If anyone is at SIGMOD this week, I'm happy to chat! :)
https://duckdb.org/pdf/SIGMOD2019-demo-duckdb.pdf
EDIT: oh the article is old
[0] https://twitter.com/motherduck/status/1615487300523429889
That said, OP's book looks more conversational in tone, which I personally have a slight preference for.
I can look at someone’s finished pivot table and reproduce it from the data through other means, but any explanation of what a pivot table actually is and does reads like pure gibberish to me.
[0] https://arxiv.org/abs/2001.00888