Differential Datalog - Readit News

kybernetikos · 5 years ago

I always fancied a database that worked like a dynamic version of prolog, with facts that could be added over time, queries that can use inference etc. I couldn't really find anything that worked like that though, everything seemed to either have a fixed set of facts or didn't support inference. Looking for something like that is what first got me interested in differential dataflow.

The fact that this is bottom up makes me think that you probably can't go crazy with the number of rules.

triska · 5 years ago

Prolog provides this with so called dynamic predicates.

For example, if we declare the predicate f/1 dynamic using the standard directive dynamic/1:

    :- dynamic(f/1).

then we can assert new clauses for f/1 using the standard predicates asserta/1 and assertz/1:

    ?- asserta(f(a)), assertz(f(b)).
       true.

We then get:

    ?- f(X).
       X = a
    ;  X = b.

Rules can also be asserted:

    ?- use_module(library(lists)), asserta((f(X):-length(_,X))).
       true.

yielding:

   ?- f(X).
      X = 0
   ;  X = 1
   ;  X = 2
   ;  X = 3
   ;  ...

I used Scryer Prolog to run these examples.

Dynamic changes to the database have advantages and drawbacks. On the plus side, they let us incorporate new knowledge and rules, and we automatically benefit from built-in features such as argument indexing. However, running the same query twice can lead to different results, changes in the order of goals can affect the meaning of the conjunction, and we can no longer reason about different predicates in isolation. Dynamic changes of the database are said to be side-effects, since they affect the program in a way that is outside the logical interpretation of the given definitions.

YeGoblynQueenne · 5 years ago

>> Dynamic changes to the database have advantages and drawbacks. On the plus side, they let you incorporate knew knowledge and rules, and you automatically benefit from built-in features such as argument indexing. However, running the same query twice can lead to different results, changes in the order of goals can affect the meaning of the conjunction, and you can no longer reason about different predicates in isolation. Dynamic changes of the database are said to be side-effects, since they affect the program in a way that is outside the logical interpretation of the given definitions.

I wonder if that is part of the benefits advertised by DDlog, the ability to update a program without having to worry about clause ordering (which one does not have to, with datalog and bottom-up evaluation) or generally side effects.

Very tentatively, I'm guessing that, since datalog programs lack negation, any new facts or rules added to the (extensional) database will update the "output" of the database (its Herbrand model) monotonically, i.e. you'll see new facts comeing out but no facts will "disappear". I'm a bit unsure because the DDlog github is talking about functions and typed datalog and I'm not really sure how that works now.

kybernetikos · 5 years ago

This is interesting and Scryer Prolog looks like a great project. I wonder if it's possible to run it in a DB-like mode - with durable changes, and accessible as a server.

aminechikhaoui · 5 years ago

There is already https://developer.logicblox.com/ but it's proprietary owned by Infor nowadays.

ludsan · 5 years ago

I have been very interested in CHR (constraint handling rules) for this kind of activity. It was developed (I think) to provide a systematic basis for the various constraint libraries to 'speak the same language' instead of being implemented without knowledge of each other.

I don't see much literature out there on it (after the mid-2000s) but it always seemed a conceptually elegant way of doing forward-chaining (and inference) in the way of most rules engines. It only has a few simple rules: simplification (remove data/rules), simpagation and propagation (new data).

The best part is that it embeds very easily in Prolog.

flyingsilverfin · 5 years ago

The company I work for, Grakn Labs (grakn.ai), builds a database that does datalog + negation type rules! You can add/remove data as you go, and also add/remove rules in the schema over time. The terminology the post here uses is different from ours, but roughly we do "Backward-chaining", which starts from the query to the set of facts that are inserted, which the article calls "top-down".

namibj · 5 years ago

Do you handle incremental changes to the input data? Because that's the major selling point of DDlog: low-latency responses to changes in inputs.

seanmcdirmid · 5 years ago

Isn’t datalog with backward chaining just prolog? Datalog was known for forward chaining (bottom up) while prolog was known for backward chaining, datalog with backward chaining sounds like a misnomer.

namibj · 5 years ago

This gives you dynamic facts, feed-forward rule application, and soon even dynamic rules.

It doesn't actually care much about the number of rules, and more about how much memory it costs to materialize them. There are delta-join techniques to do at least most of that in linear memory (w.r.t. input size) though, and DDlog has at least experimental support for them.

sriku · 5 years ago

It's great to see differential-dataflow and timely-dataflow (both by Frank McSherry) come to life recently with this and materialize.com

fulafel · 5 years ago

Aside from their custom syntax in DDlog, how much semantic overlap is there with the query language of Datomic/Crux/Datalevin/Datascript flavour? Apart from the differential features. This one seems to be big on static type definitions at least.

edit: I guess this explains some of the static nature: "A DDlog program is compiled into a Rust library"

namibj · 5 years ago

You don't really use dynamic datalog queries with DDlog. Instead you write rules that handle your queries in a streaming fashion ("prepare" statement at compile time, stream tuples of parameters at runtime and listen for outputs), if you can't afford to keep a materialized view up-to-date that turns your queries into simple KV-lookup operations.

But, do not worry: Support for run-time rule updates is in the works and would allow for arbitrary dynamic datalog queries.

pjungwir · 5 years ago

So could you use this to implement always-up-to-date materialized views in Postgres? It says it uses relational data for inputs and outputs. It says it operates in-memory but also that it starts up from a set of known facts (which sounds like it could just read the current matview at startup). This would be a really cool extension for someone to build!

namibj · 5 years ago

Yes. Almost exactly what you're describing exists [0] already.

[0]: https://materialize.com/

aidos · 5 years ago

Looks like it’s Postgres protocol compatible rather than running as a Postgres extension?

I’ve been keeping an eye on work that’s been done to support this internally in Postgres. Looks like it’s still moving along. I guess it’s a pretty tricky puzzle (especially once you add aggregates). Depending on the performance implications, it would sit in a pretty sweet spot for some usecases. For example, at the moment I have a form of materialised view that’s simple too slow to build from scratch. My compromise is to run periodically, figure out which subset of the table needs refreshing and update just those rows.

https://wiki.postgresql.org/wiki/Incremental_View_Maintenanc...

tda · 5 years ago

What a horrible landing page. Is it a stand alone product? Does it integrate with my SQL database? Is it a server side application or a client side thing? What is the pricing model? So many questions, zero answers

Am in the process of exploring kafka to do something similar, and I think it will work but requires quite a complex setup (compared to an incrementally updated materialized view)

YeGoblynQueenne · 5 years ago

>> It says it operates in-memory but also that it starts up from a set of known facts (which sounds like it could just read the current matview at startup).

That "it starts up from a set of known facts" refers to the "bottom-up" evaluation of datalog programs, in contrast with the top-down evaluation in Prolog.

Consider for instance the following database of facts and rules:

  grandparent(X,Y):- parent(X,Z), parent(Z,Y). % (1)
  parent(maria, kostas).
  parent(kostas,yannis).

Prolog's top-down evaluation starts with the "head" of rule (1), grandparent(X,Y) and proceeds to its "body", (parent(X,Z), parent(Z,Y)) to instantiate the variables {X,Y,Z} according to the facts in the database and derive the new fact grandparent(maria,yannis).

Datalog's bottom-up evaluation instead starts with the facts in the program: parent(maria,kostas), parent(kostas, yannis) and adds those to the set of immediate consequences of the program (TP is called the "immediate consequence operator"):

  TP₀ = {parent(maria,kostas), parent(kostas,yannis)}

Now the "parent" facts in the body of the "grandparent" rule can be instantiated as follows:

  (parent(maria,kostas), parent(kostas,yannis)).

And finally the head of the rule can be instantiated to derive the same new atom as with Prolog's top-down evaluation:

  TP₁ = TP₀ ∪ {grandparent(maria,yannis)}

The advantage of bottom-up evaluation is decidability, as long as a program is datalog, which means no function symbols other than constants (which are function symbols of arity 0, i.e. with 0 arguments) and no negation. For instance, given the following program with a left-recursive clause, datalog's bottom-up evaluation terminates:

  ancestor(X,Y):- parent(X,Y). % (2)
  ancestor(X,Y):- ancestor(X,Z), ancestor(Z,Y). % (3)
  parent(maria,kostas).
  parent(kostas,yannis).

Given this program, datalog's bottom-up evaluation will first add the two facts of "parent" to the immediate consequences of the program; then instantiate the fact parent(X,Y) in the body of rule (2) to the two facts of "parent" in the program; and finally instantiate the head of rule (2), deriving two new facts of "ancestor":

  TP₁ = {ancestor(maria,kostas), ancestor(kostas,yannis), parent(maria,kostas),
  parent(kostas,yannis)}

Then the two facts of "ancestor" in the body of rule (3) will be instantiated to the facts derived in the previous step and a new fact derived from the instantiation of the variables in the head of the rule :

  TP₂ = TP₀ ∪ {ancestor(maria,yannis)}

And then the program will terminate because there are no more new facts to derive.

By contrast, Prolog's top-down evaluation will first correctly derive the two new "ancestor" facts from the head of rule (2) but then will try to evaluate the first ancestor(X,Z) fact in the body of rule (3). This will take it back to the head of rule (3) and evaluation will be stuck in an infinite recursion.

The trade-off is that bottom-up evaluation cannot handle negation or functions (including the cons function used to construct lists) and so is, in that sense, incomplete, whereas Prolog can do both. For this reason, datalog is preferred as a database language that offers a limited ability for reasoning (certainly richer than what is possible e.g. with SQL, that essentially only allows facts) whereas Prolog is preferred for general-purpose programming.

lykahb · 5 years ago

I wonder how the DDLog compares to Souffle. I think that the biggest difference is that Souffle is batch-oriented. It is less clear how different are the expressiveness, type systems or the recommended patterns. The syntax of DDLog looks seems to be closer to Souffle than to Datomic with EDN.

ben_pfaff · 5 years ago

I don't know anything about Souffle so I can't directly answer the question. But there is some related material.

The DDlog repo includes a Souffle-to-DDlog converter: https://github.com/vmware/differential-datalog/blob/v0.38.0/...

The Souffle converter is somewhat incomplete: https://github.com/vmware/differential-datalog/issues/174

jitl · 5 years ago

Very interesting! I started playing with Rust, timely-dataflow and differential-dataflow last weekend, but it was a lot to learn all at once. This looks like it's solved or suggested a path to productionizing differential-dataflow without needing a separate rust service and lots of operation tooling. It would be cool to use something like this for Notion's user-defined databases and queries.

namibj · 5 years ago

You'll be happy to hear that logic/rule updates at runtime recently transitioned from the "todo... at some time" to the "todo... working on it" state.

Do these databases ever grow beyond what you can afford to keep in RAM/swap? I'm working towards high-throughput out-of-core support in differential-dataflow, but can't get much actual coding done in my off time. Understanding/concepts/discussions are not affected, so feel free to contact me.

jitl · 5 years ago

I would love to have the option to suspend computation for unwanted outputs somehow (to disk) so I can scale way, way down with only partial data. My solution for “fits in RAM” is just “shard it”... but really I haven’t gotten far enough in my experiments to need to yet.

lukeswissman · 5 years ago

> You'll be happy to hear that logic/rule updates at runtime recently transitioned from the "todo... at some time" to the "todo... working on it" state.

What exactly do you mean with "logic/rule updates" ...in dd I can change/modify the actual dataflow?

0x70dd · 5 years ago

Rego, the language used in Open Policy Agent, is based on and extends Datalog. It's gaining a lot of traction in the past couple of years for evaluating authorization policies.

ben_pfaff · 5 years ago

There's an indirect relationship between Rego and DDlog, at least people-wise. OPA comes from Styra, which was founded by Tim Hinrichs and Teemu Koponen. Teemu designed nlog, which is also a Datalog variant, for use in NVP, which was the network virtualization product at Nicira and was later renamed NSX after VMware acquired Nicira. Tim also worked with Teemu (and me) on NSX at VMware. And Teemu was one of the architects of OVN (the biggest open source application for DDlog), with Tim also having some influence. And Teemu also knows Leonid and Mihai (the principal authors of DDlog).

Some of the episodes of my OVS Orbit podcast may be relevant:

* Episode 5: nlog, with Teemu Koponen from Styra and Yusheng Wang from VMware (https://ovsorbit.org/#e5)

* Episode 44: Cocoon-2, with Leonid Ryzhyk from VMware Research (https://ovsorbit.org/#e44)

* Episode 58: Toward Leaner, Faster ovn-northd, with Leonid Ryzhyk from VMware Research Group (https://ovsorbit.org/#e58)

simplify · 5 years ago

Ozo[1] is another authorization language, but based on Prolog. It'd be interesting to compare the capabilities between the two.

[1] https://www.osohq.com

userbinator · 5 years ago

I'm curious what VMware uses this for.

jacques_chester · 5 years ago

I don't how much I can say since I work for VMware, but it's finding new uses all the time and I've personally brought it up with other teams.

I'd known of it a while before seeing a presentation on how one service team used it. They had a stream processing problem where they were beginning to lag the input by up to an hour. With differential datalog the lag dropped to sub-second.

It's seriously cool technology.

namibj · 5 years ago

I'm told it's primarily used for "networking stuff and big data processing", but "log processing" and "network switches" were also mentioned as (I assume active) use-cases.

lstamour · 5 years ago

Section 4 of [1] lists applications, including OVN, but it appears to be used primarily for experimental or research projects at the moment, but that should change given enough time… maybe?

1: https://github.com/vmware/differential-datalog/blob/master/d...

rrdharan · 5 years ago

https://twitter.com/Ben_Pfaff/status/1047905803250724865

ben_pfaff · 5 years ago

Yes, and I'm really impressed with what we've been able to do with it in OVN lately. The patches aren't posted yet (probably Monday) but (with Leonid Ryzhyk's help) I managed to take one benchmark from about 110 seconds in the C version of OVN to about 35 seconds with DDlog.

We have other internal-to-VMware use cases for DDlog. The OVN use case is the one that's easiest to evangelize since it's all open source.

dataflow · 5 years ago

Maybe for hot migration?

rrdharan · 5 years ago

Pretty sure not. Posted in reply to the sibling. This is a VMware research project but looks like the target use case is for OVN:

https://blogs.vmware.com/opensource/2018/07/26/open-source-o...

https://blogs.vmware.com/research/2020/12/07/whats-new-diffe...

https://twitter.com/Ben_Pfaff/status/1047905803250724865

ben_pfaff · 5 years ago

Speaking with my VMware hat on, I don't know of use cases there.