Introduction to Datalog

In the last few months the mention of Datalog has increased, I wondered how it differed from graph databases and found a clear answer in SO [1]. I am not an incumbent but found graph databases and clause approaches interesting.

[1] https://stackoverflow.com/questions/29192927/a-graph-db-vs-a... (2015)

refset · 3 years ago

XTDB, which is mentioned in the post, is subtly different from the other Clojure-based Datalog systems in this respect, because its Datalog engine executes in terms of multi-way joins using a "Worst-Case Optimal Join" implementation that is ideal for graph processing (vs. a tree of binary hash joins). Therefore, based on statistics and query planning heuristics, it will often perform graph pattern matching before resolving the logic/horn clauses. (source: I work on the XTDB team)

eternalban · 3 years ago

Interesting architecture:

https://raw.githubusercontent.com/xtdb/xtdb/master/docs/conc...

Btw, is that 'RocksDB or ?' for the local store current or other storage engines can get plugged in?

p.s. this is datomic's architecture for comparison.

https://docs.datomic.com/on-prem/images/clientarch_orig.svg

felixyz · 3 years ago

I did an interview [1] with Kevin Feeney, one of the founders (no longer active) of TerminusDb, which goes into some depth about the difference between RDF stores and (property) graph databases, where the former is more closely aligned with datalog and logic programming. There are links to a really excellent series of blog posts by Kevin on this topic in the show notes.

[1] https://thesearch.space/episodes/5-kevin-feeney-on-terminusd...

forks · 3 years ago

I love The Search Space. Waiting patiently for new episodes!

westurner · 3 years ago

With RDF* and SPARQL* ("RDF-star" and "SPARQL-star") how are triple (or quad) stores still distinct from property graphs?

RDFS and SHACL (and OWL) are optional in a triple store, which expects the subject and predicate to be string URIs, and there is an object datatype and optional language:

  (?s ?p ?o <datatype> [lang])

  (?subject:URI, ?predicate:URI, ?object:datatype, object_datatype, [object_language])

RDFS introduces rdfs:domain and rdfs:range type restrictions for Properties, and rdfs:Class and rdfs:subClassOf.

`a` means `rdf:type`; which does not require RDFS:

  ("#xyz", a,        "https://schema.org/Thing")
  ("#xyz", rdf:type, "https://schema.org/Thing")

Quad stores have a graph_id string URI "?g" for Named Graphs:

  (?g ?s ?p ?o)

  ("https://example.org/ns/graphs/0", "#xyz", a, "https://schema.org/Thing")

  ("https://example.org/ns/graphs/1", "#xyz", a, "https://schema.org/ScholarlyArticle")

There's a W3C CG (Community Group) revising very many of the W3C Linked Data specs to support RDF-star: https://www.w3.org/groups/wg/rdf-star

Looks like they ended up needing to update basically most of the current specs: https://www.w3.org/groups/wg/rdf-star/tools

"RDF-star and SPARQL-star" (Draft Community Group Report; 08 December 2022) https://w3c.github.io/rdf-star/cg-spec/editors_draft.html

GH topics: rdf-star, rdfstar: https://github.com/topics/rdf-star, https://github.com/topics/rdfstar

pyDatalog does datalog with SQLAlchemy and e.g. just the SQLite database: https://github.com/pcarbonn/pyDatalog ; and it is apparently superseded by IDP-Z3: https://gitlab.com/krr/IDP-Z3/

From https://twitter.com/westurner/status/1000516851984723968 :

> A feature comparison of SQL w/ EAV, SPARQL/SPARUL, [SPARQL12 SPARQL-star, [T-SPARQL, SPARQLMT,]], Cypher, Gremlin, GraphQL, and Datalog would be a useful resource for evaluating graph query languages.

> I'd probably use unstructured text search to identify the relevant resources first.

noduerme · 3 years ago

That's a really neat example of something I'm not familiar with. Going up a tree from child to parent is often the heaviest part of dealing with regular datasets, and usually requires a mix of queries and application logic. The idea of flattening the data along some pattern like that is of course always possible in a relational db, but it's not usually efficient, especially not for heavy writing. Lateral joins and window partitions can help. But this seems like an interesting approach to removing the app code completely.

flyingsilverfin · 3 years ago

I work on TypeDB (https://vaticle.com/typedb), and it sits somewhere at this intersection. The exposed query language has elements of both logic programming constructs and graph-like structures. Both amount to a kind of "constraint" programming.

rapnie · 3 years ago

A quick peek shows it seems along similar lines as TerminusDB sorta kinda, but they have WOQL [0]. At this time I start to worry again about all the different kinds and flavours of query languages that are emerging.

[0] https://en.wikipedia.org/wiki/TerminusDB#Query_language

felixyz · 3 years ago

I really like TypeDB! Haven't been able to use it for anything serious yet, but have a couple of project brewing where it might fit :)

cmrdporcupine · 3 years ago

You might be interested in https://relational.ai/

Treats graph edges as binary relations ("graph normal form"), has a Datalog-ish language. Built for managing large interconnected knowledge sets in a declarative way.

I recommend this talk: https://www.youtube.com/watch?v=WRHy7M30mM4

felixyz · 3 years ago

Great project, not open source alas. This is another great talk about RelationalAI (and its precursor), highlighting how using powerful databases can simplify complex applications: https://www.hytradboi.com/2022/experience-report-building-en...

So I struggled with this.

I guess the intention is to be better than SQL but then I was left with "under which circumstances?".

With that question in mind I didn't feel the article addressed the issue.

The author might do better to think in terms of "what burning problem are we trying to fix and how did we fix it".

ekidd · 3 years ago

> I guess the intention is to be better than SQL but then I was left with "under which circumstances?"

Excellent question.

Two of the most common use cases for databases are "transactional processing" (manipulating small numbers of rows in real time) and "analytical processing" (querying enormous numbers of rows, typically in a read-only fashion).

SQL is generally fine for transactional workloads.

But analytical queries sometimes involve multi-page queries, with lots of JOINs and CTEs. And these queries are often automatically generated.

And once you start writing actual multi-page "programs" in SQL, you may decide that it's a fairly clunky and miserable programming language. What Datalog typically buys you is a way to cleanly decompose large queries into "subroutines." And it offers a simpler syntax for many kinds of complex JOINs.

Unfortunately, there isn't really a standard dialect of Datalog, or even a particular dialect with mainstream traction. So choosing Datalog is a bit of a tradeoff: does it buy you enough, for your use case, that it's worth being a bit outside the mainstream? Maybe! But I'd love to see something like Logica gain more traction: https://logica.dev/

sirwhinesalot · 3 years ago

That's a good point, to be honest I'm not sure there's a particularly good answer to that beyond "it's a much better underlying model". SQL is an ugly bastardization of the relational model, whereas Datalog is a clean application of the logical model to the relational setting.

One thing that's really nice about Datalog is that it puts the focus on modelling relations between data instead of "tables" which often end up with a jumble of data and then need to be normalized to work properly. It pushes you into good structures by default, which is a really good property to have. It's also much much simpler than SQL, no need to think in terms of inner joins and outer joins and whatever else, it is all relations.

It also easily produces derived data from existing data, without needing any kind of procedural process, and is highly composable.

It's not so much that it's solving a burning problem than SQL, it's just better than SQL...

(note: the clojure syntax used in that page is much more confusing than the native Datalog syntax IMO)

zh217 · 3 years ago

Another problem with Clojure-based Datalog is performance. Yet another is you are pretty much tied to the Clojure ecosystem. And I don't really like the Clojure-fused Datalog syntax either. These pains spurred me to write my own: https://docs.cozodb.org/en/latest/ (FOSS), and so far I'm satisified with my own work.

crabbone · 3 years ago

I'm not the author, and I don't know how author would answer this question, but some things are really on the surface:

1. Modularity. In Datalog it's very easy to name a common (sub-)query and reuse it in multiple queries. You can have stored procedures / functions in SQL, but the syntax is very different implementation to implementation and users are typically afraid of using the feature (the SQL/PSM does not go far enough to define what users would normally want). Also, SQL/PSM is an imperative extension of otherwise declarative language. The two just don't work well together.

2. Structures other than primitives and tables (you get lists, vectors, hash-maps and sets, but you can also make your own). A lot of practical SQL extensions also offer some of these structures, but they aren't in the standard.

3. Operations on primitive types fall into the same category / work in the same way as operations on complex types. I.e. select / update / insert / delete in SQL only apply to tables, but strings or numbers don't work in the same way. It's more uniform in Datalog (not 100%, but still better).

conor-23 · 3 years ago

A researchy perspective: Datalog was invented to extend relational algebra with recursion. Since it started out as an academic tool, people have been studying recursion-specific optimizations you can do for decades so it is extremely well suited to recursive use-cases e.g. iterative graph algorithms. Using Datalog for network algorithms won the thesis award in databases almost 20 years ago https://boonloo.cis.upenn.edu/papers/boon_interview.pdf .

tejtm · 3 years ago

This is the answer I subscribe to. CTEs and recursive CTEs are SQL's answer to a limitation of plain relational algebra; no loops.

CTEs are a great and most welcome addition to SQL but they are a bolt-on patch as compared with Datalog where it is a core feature.

felixyz · 3 years ago

> Datalog was invented to extend relational algebra with recursion.

I'm not sure that is exactly right. Do you have a reference? (Not trying to put you on the spot, I'm just curious to learn the history!)

mmcdermott · 3 years ago

> I guess the intention is to be better than SQL but then I was left with "under which circumstances?".

The way I see it, SQL's great strength as a query language is selection and projection whereas Datalog's great strength is inference.

It takes a shift in mindset to take advantage of inference. I've started using Datalog when analyzing unfamiliar codebases. So I might set up something like:

    writes_to(microservice1, db1).
    writes_to(microservice2, microservice1).
    writes_to(microservice3, microservice2).

    depends_on(X, Y) :- writes_to(X, Z), depends_on(Z, Y).
    depends_on(X, Y) :- writes_to(X, Y).

Allowing you to query something like:

    depends_on(microservice3, X)

and get back db1, microservice1 and microservice2. Of course, you can ultimately do this in SQL as well, but it's far more compact in Datalog.

Which brings me to the second half of the motivation - the principal advantage to the logical database model is the open world assumption. The relational model in practice tends to be fairly buttoned down - data is expected to suit a particular schema (which has its own advantages as a transactional system). This makes it easy to extend a logical database and ask more questions without changing the fact formats already specified.

Of course, I can ultimately do all the things that feel natural in Datalog in SQL. I can work through queries like the one above by building out the data model and writing recursive CTEs. It's about strengths, not possibilities which I suppose brings me back to why SQL 'won'. A disproportionate amount of code is written for transactional line-of-business systems. It's really not hard to see why the validation layer and transactional focus would win in those areas.

yaantc · 3 years ago

In 2021 Google introduced Logica, a Datalog variant. In their introduction blog they contrast it with SQL, so this may answer your question:

https://opensource.googleblog.com/2021/04/logica-organizing-...

In short, it's easier to compose (and decompose), which helps with complex queries that can be assembled from simpler, independently tested parts.

No personal experience in this, I just remembered that blog contrasting Datalog and SQL.

dgb23 · 3 years ago

One of the biggest advantages is re-use, extensibility and being robust to change, because you define things bottom up. Think of aggregates as "hashmaps with typed keys" rather than "structs with places in them".

Aggregates are implicit groupings, not explicit slots. This wards you from risk of change and speeds up your feedback loop of making changes and experimenting with ideas. Plus you can compose new aggregate variants out of the same primitives (attributes) which is a fairly typical use-case.

Another is query composition. I'm more familiar with SQL and I'm somewhat comfortable with writing nested queries. But we all know how both query builders and similar are often hard to re-use and compose. With these datalog dialects, composition, refactoring and extraction of logical units is much more straight forward, even as a beginner. Queries are "flat", and consist clauses that can be moved around and composed. Think of having more associative and commutative operations.

On top of these, the mentioned technologies (in the article) are temporal and support time travel out of the box. That's something you can do on top of SQL of course but it's quite powerful to have it in-built as a primary feature. Whenever you have an application that does things like (non-exclusive), reporting, undo, revisions etc. You might want that.

anon291 · 3 years ago

Sql doesn't compose at all. You can't take two sql programs and combine them with a single operator and get a new program.

In datalog you can. Either by using the comma or semicolon operator.

crabbone · 3 years ago

You are thinking about the declarative part. SQL/STP, the standard that's the basis for definition of the imperative part of SQL (stored procedures) can compose in a similar way (eg. you can call a stored procedure from a stored procedure), but the standard doesn't go far enough, and so every practical database has its own extensions which deters people from using stored procedures because they'd end up with non-portable code.

felixyz · 3 years ago

> then I was left with "under which circumstances?"

Foremost, Datalog has great compositionality, which is one of SQL:s big weaknesses (it was designed with completely different goals). Also, things like recursive queries are completely natural in Datalog, whereas in SQL they are bolted on with recursive common table expressions.

refset · 3 years ago

> Datalog has great compositionality, which is one of SQL's big weaknesses

There's a good write-up on this aspect (amongst others) here: https://www.scattered-thoughts.net/writing/against-sql/

> recursive queries are completely natural in Datalog, whereas in SQL they are bolted on with recursive common table expressions

YeGoblynQueenne · 3 years ago

>> The author might do better to think in terms of "what burning problem are we trying to fix and how did we fix it".

I have no idea who first came up with the name "datalog", and what exactly was their motivation (I've looked, but couldn't find the original reference to datalog), but the burning problem that datalog does fix is Prolog's semi-decidability, or in other words, its tendency to enter infinite recursions.

One half of the reason for that is Prolog's "compound terms" (known as functions, in First-Order Logic terminology) and the other half is Prolog's use of depth-first search for evaluation. Functions, along with logic variables, mean that a predicate's Herbrand base (its set of logical atoms) can be expanded forever: if f(x) is a term, f(f(x)) is a term, f(f(f(x))) is a term... and so on, ad infinitum. The depth-first search evaluation just gets stuck in left-recursions when the first body literal of a clause is a recursive literal: p(x):- p(x), q(x) loops forever, in Prolog with depth-first search.

Datalog is Prolog without functions other than constants, and it can be evaluated "bottom up", both of which overcome Prolog's semi-decidability (but eliminate its completeness).

But of course this is not a very clear motivation for its use as a database language, only its use as an alternative to Prolog.

More to the point, there are many datalog variants. Some that accept negation, some that do not, some that allow recursion, some that do not, and so on. There's a lot of information on different datalogs, seen from the point of view of database research in this free ebook on databases:

http://webdam.inria.fr/Alice/

See chapters 12 - 15.

And see these lecture notes for a more logic-programm-y discussion of fixpoint semantics of definite logic programs:

https://www.doc.ic.ac.uk/~mjs/teaching/KnowledgeRep491/Fixpo...

falsissime · 3 years ago

> but the burning problem that datalog does fix is Prolog's semi-decidability, or in other words, its tendency to enter infinite recursions.

Prolog's built-in search is not semidecidable. https://en.wikipedia.org/wiki/Decidability_(logic)#Semidecid...

As you observe, Prolog gets stuck in left-recursions. But iterative deepening for Prolog is semidecidable (among other fair methods). This indeed is restricted to the pure, monotonic subset of (full) Prolog together with all pure, monotonic extensions.

bambataa · 3 years ago

I had the exact same reaction.

I did a bit of reading about Datalog and I get that it’s useful in things like static analysis where you get the fixed point analysis for free, but apart from that I am not really sure why I’d use it.

maweki · 3 years ago

The main reasons I'd say are the very intuitive JOINs and queries. Like you give some partial relation and you get every "fit". So as long as you mostly do equalities in JOIN and WHERE, the queries will be very intuitive and obvious to the layperson.

If you work with VIEWs or CTEs a lot, then Datalog is your friend, as every derived relationship is just a VIEW. If you like to encapsulate and reuse queries, then Datalog is ideal. We had plenty discussions on HN about SQL code reuse. Not an issue in Datalog.

And lastly, there are many ways to compile Datalog programs to "executables" that take maybe a few CSV-files as input and give you the results. I'm not saying that this would be always faster than loading the CSVs into SQLite and running SQL against the data, but a lot of work has been done on Datalog query optimization and compilers can emit very efficient code.