Implementing a Merkle Tree for an Immutable Verifiable Log

let $node := jn:doc('mycol.jn','mydoc.jn')=>fieldName[[1]] let $result := for $node-in-rev in jn:all-times($node) return if ((not(exists(jn:previous($node-in-rev)))) or (sdb:hash($node-in-rev) ne sdb:hash(jn:previous($node-in-rev)))) then $node-in-rev else () return [ for $jsonItem in $result return { "node": $jsonItem, "revision": sdb:revision($jsonItem) } ]

I knew that Go had a checksum database but not that it had a tamper-proof log. This seems like something that other package management systems should do as well.

gumby · 4 years ago

What do you think your git repo is?

colburnmh · 4 years ago

True to an extent--meaning mostly true in practice.

You can mutate the Git data if you chose to using commands like "filter-branch". "filter-branch" isn't used frequently, since it causes issues with every up/down-stream replica if the data has been pushed/pulled, but it is possible. But, even some commonly used commands like "amend", "rebase", and "squash" cause limited data mutations which are broadly considered appropriate and useful.

skybrian · 4 years ago

How often do packages refer to their dependencies using git hashes? Very seldom in my experience, and for good reason. (Unless you're using git submodules, which is not the usual way.)

Go's checksum server does something different by making sure that the names used in module files refer to the same things for different people. Also, it works even if some packages don't come from git repos.

jacques_chester · 4 years ago

I know that at least RubyGems, NPM, PyPI and Maven Central are in various stages of using Sigstore for signing and transparent logging.

How is this different from a blockchain (serious question)?

greiskul · 4 years ago

A blockchain is a merkle tree with a distributed consensus algorithm.

charcircuit · 4 years ago

A blockchain is just a data structure. A consensus algorithm can agree what the current state / last block of a block chain is, but they are separate.

Similarly if one says B tree or LSM tree that doesn't mean they are also talking about a consensus algorithm.

Deleted Comment

Groxx · 4 years ago

As others have pointed out: very little is different. Depending on if you include consensus in your use of "blockchain", there may be no difference whatsoever.

Currently it's a massively popular buzzword that frequently means very little in reality. It's just being tacked onto everything whether or not it makes sense because e.g. the Long Island Iced Tea company switched their name to "Long Blockchain" and immediately had massive stock gains: https://arstechnica.com/tech-policy/2017/12/iced-tea-company... . Even in the cases where it is technically accurate, it's rarely a good or useful idea... but it helps get funding.

So in practice, "blockchain" currently means "you will magically get rich". Technically it's almost always a Merkle tree, or a less efficient structure (e.g. Bitcoin's core "chain" is basically a linear version, which is dramatically less efficient to verify to the root for any given block... because it doesn't need that quality. Though it also uses Merkle trees within each block).

a1369209993 · 4 years ago

A blockchain provides a guarantee of (statisical eventual) write availability, since anyone can mine new blocks (eventually). A verifiable log only accepts new entries from a limited, generally closed set of writers (usually just one), and you can only add new entries by going through one of those writers. (Blockchains also tend to have better read availability due to being more widely distributed, but there's no reason why you couldn't destribute a verifiable log that widely, it just tends not to happen in practice.)

This makes a verifiable log unsuitable for financial purposes since a adversary with lead pipe capabilities (or corrupt judge capabilities as the case may be) can block undesirable transactions at a single (or few) point of failure, whereas against a blockchain they'd have to target anyone with significant compute capacity.

On the other hand, verifiable logs don't need proof of work/stake/etc to limit hostile forking, so if the log is describing things rather than acting a ground source of truth (so you can just ignore it if it starts refusing writes), it's much more efficient than a blockchain.

MrStonedOne · 4 years ago

None of those things are features of blockchains, they are all features of cryptocurrencies.

Blockchains refer to any time you use a chain of hashes to prove a block of data and any data that came before it, has not been mutated.

The time keeper service that takes out classified ads in the new york times with hash blocks for the documents they timestamped to prove they were timestamped before a certain date, is a blockchain, because they include the previous day's hash.

Git, is a blockchain.

It's just hashes all the way down.

The random entropy driver in linux, is technically a blockchain, with how it works.

Deleted Comment

est31 · 4 years ago

There is no token/currency associated to it, and it's made for a single specific purpose only, compared to most blockchains which try to run programs, manage wallets, etc. It also doesn't have a decentralized consensus algorithm, nor is there a gossip network to inform about new transactions that are to be added to the log. It's centralized per design.

Which of these differences you see as important to the blockchain / non-blockchain distinction, depends on the precise definition for blockchain that you want to use.

That being said, even the CT website uses the term "ledger" to explain what the logs are: https://certificate.transparency.dev/howctworks/

radicalbyte · 4 years ago

The term "ledger" pre-dates crypto. An append-only log is a ledger if the writers are trusted.

lichtenberger · 4 years ago

We're also storing a similar hash tree (a hash is stored in each node -- an insert/update/delete triggers a recomputation of the hash in all ancestor nodes) to verify if a subtree has changed in SirixDB[1]. However, only pointers to neighbour nodes as well as the content is used for recomputation of the ancestor hashes instead of all children during updates. Furthermore checksums of child page-fragments are stored in parent references.

The hashes are used for fast diff calculations along with an optional DeweyID node scheme to quickly get diffs in a resource in preorder or to check if a subtree has changed without fetching all ancestor nodes (due to the stored DeweyIDs).

For instance you can simply check for updates of a node in a JSON structure with the following query:

It iterates through all revisions of a node and returns the node and the revision (and we could also add the author) when it has been updated.

To make this even easier I could write a native Java function for use in the query.

[1] https://sirix.io

j-pb · 4 years ago

We do something similar to accelerate set operations over the indices of our in-memory immutable binary triple store.

Since our indices are sets, we only have to compute the hashes for the leaves and can then combine them via XOR in our inner nodes. Due to the self inverse nature of XOR we can then efficiently maintain them over the operations.

Normally this would be a big no no, but since we are using sets, we have the invariant that hashes will never be the same for two different leaves. (Under the assumption of no hash collisions, which is justified for the 128bit SipHash that we use.)

I'm using a rolling hash with SHA265 but only take 128 bits, as it should be enough to avoid collisions. Leaf node hashes are computed on demand, only inner node hashes are stored persistently.

Can you elaborate a bit more? You mean bitset indexes?

bradrn · 4 years ago

Just out of curiosity, what language is this? I don’t recognise it at all.

Basically JSONiq, with a few minor syntax differences.

Our query engine/compiler is and can be used by other data stores as well:

http://brackit.io

londons_explore · 4 years ago

> By using a tamper-evident to store compliance records, you can keep them in one place and simplify presenting them to an auditor. You can cryptographically prove they haven't been tampered with.

Is this realistically the case? Won't most auditors instead say "Microsoft Access with a password is fine. But your homegrown cryptographic black box we can't trust."?

jaboutboul · 4 years ago

“Recently I'd been reading on the application of immutable databases for tracking changes in sensitive data, and for auditing purposes.”

Pretty sure he is referring to https://immudb.io.

SemanticStrengh · 4 years ago

there is also https://aws.amazon.com/qldb/ Thoses tech make blockchain obscolete for many uses.

Yes could be. We should point out that QLDB isn’t open source and only runs as an AWS service.

losteric · 4 years ago

Isn't QLDB a blockchain tech? Albeit a real-world practical one.

cryptonector · 4 years ago

Or Git. Or...

This is a common pattern now.

TedDoesntTalk · 4 years ago