I highly recommend people give this paper a read. I think it points the way to a radical redesign of fundamental parts of the system stack over the next 5-10 years. If you work in systems and you aren’t thinking about this stuff, you’re about to be lapped.
However, it's certainly possible that the time for this idea has come. Google is probably in the best position to apply it.
I will say that after having worked at Google for over a decade, some of it on data center performance, there are plenty of inefficiencies that have nothing to do with algorithms or code, but have more to do with very coarse-grained resource allocation. In other words, they are "boring management problems" rather than sexy CS problems.
I should also add that you're expanding the claims made in the paper, which appears to be a common mistake. If you look at the related work and citations, you'll see that it's mostly about B-Trees, bloom filters, and indices. But I have seen people generalizing this to "a revolution in systems research", which I don't believe is justified.
Indexed data structures are important but a revolution in them is not going to cause a revolution in all of systems. Off the top of my head, the general area of "scheduling" (in time and space) is probably more important and more potentially impactful.
Have you seen the results when they let a trained model manage Borg? The power reductions were immediate, non-trivial, and performance stayed the same. There's your scheduling result for you.
Look at it this way. As the paper points out, a Hashtable is just a heuristic that works fairly well in the worst case and reasonably well in the average case. No one would argue that you couldn't hand-roll an algorithm that is better for a specific, narrowly defined data set. This paper is demonstrating that you don't have to do it by hand, you can train a model automatically without the expected caveats like: size (doesn't have to be huge), latency (the model trains very quickly) or specialized hardware (they used bog-standard CPUs, not GPUs or TPUs).
This is obviously just my opinion, but I think it's pretty big. It's not big because of the paper itself, although that's amazing in isolation (they improved on hashtables!). As I said above, it _points the way_. They have a throw-away sentence about co-optimizing data layout along with lookup and not showing those result. My guess is that beats a hashtable in every way. More importantly, if a model can beat bloom filters, b-trees, and hashtables, you'd better hold onto your socks because most algorithms are just waiting to fall. To me this paper is Deep Blue vs Kasparov. We all know what comes next.
The article you link ends on a pretty weak claim, math didn't obsolete biologists and ML won't obsolete systems folks, but every time you write a heuristic you would probably get better results from a model.
Interesting article I read from Zalando, which in the same vein as the Learned Indexes paper, takes a previously computationally expensive routine (Optimal Cart Pick) and learns it as an approximation to a complex function that, in the end, has faster run-time.
Yo! This is what I'm talkin' about! Replace directories with self-consistent neural nwots [neural blackbox thingie of weights and activation vals]. It's cool, but for every convenience we take in the direction HD space we backpeddle in processing [meaning, although the dir structure could be super condensed this way with a learned index, this approach is less flexible to cold or not-cpu-related analysis]. It's very cool actually to consider the three branches of processing, hdd, and memcache ... You want to have a balance but in the ideal computer we do no calculation, and that is something worth mentioning.
Reading again of the abstract, allow me to add that it's an optimization beyond structures such as Btrees which are already not really decipherable datasets to the naked eyes. So any optimization in a space that's not designed to be human-legible can take whatever shape, and this is certainly a welcome achievement!
1) The Morning Paper - blog.acolyer.org - A new super interesting paper review everyday
2) This is my personal ritual. Every weekday I check out recent entries in "Hardware architectures" and "Emerging Technologies" section of arXiv.org
Many other sections are gaining traction with me, like Database.
I've read that paper, and I wasn't particularly impressed. Is there any concrete evidence that learned indices scale to databases with complex schemata and a large transactional volume? Could learned indices be profitably used, say, in the database backend of an ERP system?
There are immediate applications to data warehousing. I hope that's obvious to you, but if not I'm happy to unpack that.
Whether or not it applies to online databases probably depends on several factors. The ones I can think of off the top of my head are the read/write ratio, the write rate in general, and obviously size of your data set (although the approach works at GB, not TB, so most can take advantage).
Here's a way, I, a systems engineer who doesn't design databases, could imagine using it immediately:
1) Do I have any indexed queries? If not, nothing to do here.
2) Does the data change rapidly? If so, nothing to do here.
3) Okay, now I'm in the zone of profitability:
1) investigate just letting the model do all the work. This is unlikely to be successful in many cases, but it's pretty easy to try.
2) investigate "model-assist", in this mode I let the model give me an answer and, if I don't like it, I use the B-tree. The model is fast enough that this is not a huge hit to my overall performance.
3) investigate "model-replace", in this mode I build my b-tree and do everything like normal. In the meantime I'm training my model. Once my model seems well-trained (I could hand it a randomized subset of queries and, once the accuracy and performance beats the b-tree I consider it well-trained), I switch to using it. As data changes I could potentially go back to the b-tree, update the model (literally throw it away and re-build) and do it again.
4) Investigate "co-opetition" -- make the b-tree a component of the model. If you read the paper, you see how they layer different models. A b-tree is just a particular kind of model that happens to be computationally expensive but has high accuracy. You could include it in the model graph and let it compete for queries like everything else.
Anyway, I'm speculating here, but it seems obvious this has immediate impact in one large space relating to databases (data warehousing) and potential impact in many of the rest.
Data warehousing is one obvious application: data is immutable; have a sort key. So NN as a builtin once index structure is perfect for such case. It is only a matter of time for commercial analytic DB to pick up this idea, I would presume.
Leslie Lamport's 1978 "State the Problem Before Describing the Solution" [0]. On his web page the author adds that "The title says it all. This one-page note is as relevant today as when I wrote it. Replace "describing the solution" by "writing the program" and it becomes a practical recipe for improving software."
Herewith, the paper in full:
"After several years of writing papers in computer science, I discovered the basic expository rule embodied in the title of this note. As obvious as this rule may seem, there are fields in which it is seldom observed. (Computer networking is one example.) A typical paper in such a field is organized as follows:
(1) a brief informal statement of the problem;
(2) the solution;
(3) a statement and proof of the precise correctness properties satisfied by the solution.
In order to abide by the rule, the following organization should instead be used:
(1) a brief informal statement of the problem;
(2) the precise correctness conditions required of
a solution;
(3) the solution;
(4) a proof that the solution satisfies the
requisite conditions.
Although it may not be obvious at first glance, there is a profound difference between these two approaches. In the first, the precise correctness conditions can be (and usually are) stated in terms of the solution itself. Some results are proved about the solution, but it is often not clear exactly what problem is being solved. This makes the comparison of two different solutions rather difficult. With the second approach, one is forced to specify the precise problem to be solved independently of the method used in the solution. This can be a surprisingly difficult and enlightening task. It has on several occasions led me to discover that a "correct" algorithm did not really accomplish what I wanted it to. I strongly urge everyone to observe the rule.
(I am ignoring as unworthy of consideration the disturbingly large number of papers that never even attempt a precise statement of what problem they are solving.)"
Thanks for this, this one is very relevant to industry(or at least to mine). The amount of software designs and programs that don't actually fit the technical/business problems from the get go is astounding.
Even though it's from 1986 it's enlightening how useful it is to think about reactive systems using statecharts - especially as a UI developer this seems to make total sense and was a ah-ha moment when I first read it a couple months ago.
Elegant and simple. Thank you for sharing it even though it is over 30 years old.
Next time you happen to be at a store that has the rare, old fashioned-type automatic doors that swing out (not slide in/out), hurry out before your friends and stand on the pressure pad that is in the direction of the door swing out. No one will be able to exit.
This is a simple example of a state machine in the real world that I like to use to explain more complex state machines.
It's an old one, but this year I read The Emperor's Old Clothes by C.A.R. Hoare for the first time and it got me started in a Hoare rabbit hole, just a brilliant guy.
seL4 is about 9000 LOC. So this gives a good indication of what formal verification (Isabelle/HOL) is currently capable of. seL4 is also quite fast as a result of removing unnecessary checks.
Note that this is formal verification of an imperative, C-based program manipulating low-level structures done years ago. I wouldnt say that represents what we're currently capable of in general case. For starters, a functional program or imperative without pointers (eg in SPARK) is much easier to verify than a C program. Plus, the seL4 organization recently reduced effort for filesystems down by a few multiples with a functional language.
Truth is the seL4 team set themselves up for a harder-than-usual job. For good reason but most folks (esp managed languages) won't see as much difficulty.
I want to learn formal verification on my own. Do you know of any resources? I got Rolf Dreshler's book on circuit verification, but I would like to master both hardware and software. Would you kindly provide with some pointers?
It's a broad subject and I'm only interested in software verification. It's specialized enough that you should be reading papers and reading about systems. My favorite systems+papers are:
* Coq/dependent types. Check out Software Foundations by Benjamin Pierce, et. al. Everything's online. Also see Idris, which has a good book from Manning.
* SMT-solver-based verification of existing languages. See SPARK/Ada (there's a good book but I can't remember the name presently) and the GNAT website. Also, Frama-C, although the documentation is more spotty. Then there's Rustan Leino's work on Dafny and Boogie. Oh, and Why3. And some work on verifying Java code that I haven't played with yet.
I think this paper amazing because it solves a complex problem with a simple solution. How do you create a hash function that adjust to the varying number of underlying buckets ? Solution: hash to a circle.
The raft consensus algorithm article [1] and the follow up that modifies it slightly with some changes that are useful when implementing it [2]. Both are written very clearly and very focused on actually using the algorithm in practice. Problems that arise when implementing it are discussed and solutions are proposed as well.
I found Knuth’s “Dancing Links” paper [1] very well written and a somewhat easy read (I had to reread certain parts a couple times). I had to write a sudoku solver for one of my classes and I read that dancing links and algorithm x was one way to do it [2]. I then read some things around the internet to apply dancing links to sudoku solving [3] [4]. If you read the Wikipedia entry on exact cover problems there is a section on sudoku as an exact cover problem [5] this is one thing necessary to understand in order to implement a sudoku solver with algorithm x and dancing links.
Knuth even talks about dancing links in his new 2017 Christmas Tree Lecture [6] specifically here at 4:28 [7]. Basically he uses dancing links to solve for certain n in the problem he sets up in that lecture.
You may like to know that Knuth's fascicle 5C of The Art of Computer Programming Volume 4 is going to be entirely about Dancing Links. He's still working on it, but the “incomplete draft” is available at the (hidden) link https://cs.stanford.edu/~knuth/fasc5c.ps.gz — check it out if you're interested; it's already 130 pages of fun.
Wow, this really is good I started it and skimmed the rest of it the first 34 pages are explanations and algorithms of dancing links and the rest of the 130 pages is all exercises and answers to those exercises. Truly awesome, thanks again for the link.
I think he mentioned that it would be in his new book, but I had no idea there would be over 100 pages of dancing links. That is sweet! I’d love to take a look at it, so I could get a better grasp of how it applies to other problems.
The Case for Learned Index Structures:
https://arxiv.org/pdf/1712.01208v1.pdf
https://www.sigarch.org/the-unreasonable-ineffectiveness-of-...
However, it's certainly possible that the time for this idea has come. Google is probably in the best position to apply it.
I will say that after having worked at Google for over a decade, some of it on data center performance, there are plenty of inefficiencies that have nothing to do with algorithms or code, but have more to do with very coarse-grained resource allocation. In other words, they are "boring management problems" rather than sexy CS problems.
I should also add that you're expanding the claims made in the paper, which appears to be a common mistake. If you look at the related work and citations, you'll see that it's mostly about B-Trees, bloom filters, and indices. But I have seen people generalizing this to "a revolution in systems research", which I don't believe is justified.
Indexed data structures are important but a revolution in them is not going to cause a revolution in all of systems. Off the top of my head, the general area of "scheduling" (in time and space) is probably more important and more potentially impactful.
Look at it this way. As the paper points out, a Hashtable is just a heuristic that works fairly well in the worst case and reasonably well in the average case. No one would argue that you couldn't hand-roll an algorithm that is better for a specific, narrowly defined data set. This paper is demonstrating that you don't have to do it by hand, you can train a model automatically without the expected caveats like: size (doesn't have to be huge), latency (the model trains very quickly) or specialized hardware (they used bog-standard CPUs, not GPUs or TPUs).
This is obviously just my opinion, but I think it's pretty big. It's not big because of the paper itself, although that's amazing in isolation (they improved on hashtables!). As I said above, it _points the way_. They have a throw-away sentence about co-optimizing data layout along with lookup and not showing those result. My guess is that beats a hashtable in every way. More importantly, if a model can beat bloom filters, b-trees, and hashtables, you'd better hold onto your socks because most algorithms are just waiting to fall. To me this paper is Deep Blue vs Kasparov. We all know what comes next.
AFAIK storage is not the system bottle it used to be. We always want more, but network and cores are relatively plentiful.
If we could magically (and safely) modify the software stack, which areas could give x2 or x3 improvements ?
https://jobs.zalando.com/tech/blog/accelerating-warehouse-op...
Reading again of the abstract, allow me to add that it's an optimization beyond structures such as Btrees which are already not really decipherable datasets to the naked eyes. So any optimization in a space that's not designed to be human-legible can take whatever shape, and this is certainly a welcome achievement!
Can you expand on this bit? I'm not sure I understand what you mean
Besides HN how can one find great papers like this? How did you find out about it?
2) This is my personal ritual. Every weekday I check out recent entries in "Hardware architectures" and "Emerging Technologies" section of arXiv.org Many other sections are gaining traction with me, like Database.
Hope this helps.
https://github.com/papers-we-love/papers-we-love
http://learningsys.org/nips17/assets/slides/dean-nips17.pdf
You can generally find a paper for each section.
Additionally, there is http://www.arxiv-sanity.com/ which sorts new machine learning papers by popularity.
Whether or not it applies to online databases probably depends on several factors. The ones I can think of off the top of my head are the read/write ratio, the write rate in general, and obviously size of your data set (although the approach works at GB, not TB, so most can take advantage).
Here's a way, I, a systems engineer who doesn't design databases, could imagine using it immediately:
1) Do I have any indexed queries? If not, nothing to do here. 2) Does the data change rapidly? If so, nothing to do here. 3) Okay, now I'm in the zone of profitability:
1) investigate just letting the model do all the work. This is unlikely to be successful in many cases, but it's pretty easy to try.
2) investigate "model-assist", in this mode I let the model give me an answer and, if I don't like it, I use the B-tree. The model is fast enough that this is not a huge hit to my overall performance.
3) investigate "model-replace", in this mode I build my b-tree and do everything like normal. In the meantime I'm training my model. Once my model seems well-trained (I could hand it a randomized subset of queries and, once the accuracy and performance beats the b-tree I consider it well-trained), I switch to using it. As data changes I could potentially go back to the b-tree, update the model (literally throw it away and re-build) and do it again.
4) Investigate "co-opetition" -- make the b-tree a component of the model. If you read the paper, you see how they layer different models. A b-tree is just a particular kind of model that happens to be computationally expensive but has high accuracy. You could include it in the model graph and let it compete for queries like everything else.
Anyway, I'm speculating here, but it seems obvious this has immediate impact in one large space relating to databases (data warehousing) and potential impact in many of the rest.
Dead Comment
Herewith, the paper in full:
"After several years of writing papers in computer science, I discovered the basic expository rule embodied in the title of this note. As obvious as this rule may seem, there are fields in which it is seldom observed. (Computer networking is one example.) A typical paper in such a field is organized as follows:
(1) a brief informal statement of the problem;
(2) the solution;
(3) a statement and proof of the precise correctness properties satisfied by the solution.
In order to abide by the rule, the following organization should instead be used:
(1) a brief informal statement of the problem;
(2) the precise correctness conditions required of a solution;
(3) the solution;
(4) a proof that the solution satisfies the requisite conditions.
Although it may not be obvious at first glance, there is a profound difference between these two approaches. In the first, the precise correctness conditions can be (and usually are) stated in terms of the solution itself. Some results are proved about the solution, but it is often not clear exactly what problem is being solved. This makes the comparison of two different solutions rather difficult. With the second approach, one is forced to specify the precise problem to be solved independently of the method used in the solution. This can be a surprisingly difficult and enlightening task. It has on several occasions led me to discover that a "correct" algorithm did not really accomplish what I wanted it to. I strongly urge everyone to observe the rule.
(I am ignoring as unworthy of consideration the disturbingly large number of papers that never even attempt a precise statement of what problem they are solving.)"
[0] https://lamport.azurewebsites.net/pubs/state-the-problem.pdf
Indeed. Many -- perhaps even a majority of -- startup landing pages break this rule.
Even though it's from 1986 it's enlightening how useful it is to think about reactive systems using statecharts - especially as a UI developer this seems to make total sense and was a ah-ha moment when I first read it a couple months ago.
Next time you happen to be at a store that has the rare, old fashioned-type automatic doors that swing out (not slide in/out), hurry out before your friends and stand on the pressure pad that is in the direction of the door swing out. No one will be able to exit.
This is a simple example of a state machine in the real world that I like to use to explain more complex state machines.
I read the annotated version on Fermat's Library - great source of interesting papers (http://fermatslibrary.com/s/the-emperors-old-clothes).
https://www.sigops.org/sosp/sosp09/papers/klein-sosp09.pdf
seL4 is about 9000 LOC. So this gives a good indication of what formal verification (Isabelle/HOL) is currently capable of. seL4 is also quite fast as a result of removing unnecessary checks.
https://sel4.systems/
seL4 is smaller than L4Ka::Pistachio and it's also capability based which L4 isn't. They could have called it L5 or seL5.
Truth is the seL4 team set themselves up for a harder-than-usual job. For good reason but most folks (esp managed languages) won't see as much difficulty.
* Coq/dependent types. Check out Software Foundations by Benjamin Pierce, et. al. Everything's online. Also see Idris, which has a good book from Manning.
* SMT-solver-based verification of existing languages. See SPARK/Ada (there's a good book but I can't remember the name presently) and the GNAT website. Also, Frama-C, although the documentation is more spotty. Then there's Rustan Leino's work on Dafny and Boogie. Oh, and Why3. And some work on verifying Java code that I haven't played with yet.
I think this paper amazing because it solves a complex problem with a simple solution. How do you create a hash function that adjust to the varying number of underlying buckets ? Solution: hash to a circle.
[1] https://raft.github.io/raft.pdf
[2] http://openlife.cc/system/files/4-modifications-for-Raft-con...
Knuth even talks about dancing links in his new 2017 Christmas Tree Lecture [6] specifically here at 4:28 [7]. Basically he uses dancing links to solve for certain n in the problem he sets up in that lecture.
[1] https://arxiv.org/abs/cs/0011047
[2] https://en.wikipedia.org/wiki/Sudoku_solving_algorithms#Exac...
[3] http://garethrees.org/2007/06/10/zendoku-generation/
[4] https://www.ocf.berkeley.edu/~jchu/publicportal/sudoku/sudok...
[5] https://en.wikipedia.org/wiki/Exact_cover#Sudoku
[6] https://www.youtube.com/watch?v=BxQw4CdxLr8
[7] https://youtu.be/BxQw4CdxLr8?t=4m28s