I have seen it used to mean WAL before, so I am taking this with a dose of skepticism.
- The order of log entries does not matter.
- Users of the log are peers. No client / server distinction.
- When appending a log entry, you can send a copy of the append to all your peers.
- You can ask your peers to refresh the latest log entries.
- When creating a new entry, it is a very good idea to have a nonce field. (I use nano IDs for this purpose along with a timestamp, which is probabilistically unique.)
- If you want to do database style queries of the data, load all the log entries into an in memory database and query away.
- You can append a log entry containing a summary of all log entries you have so far. For example: you’ve been given 10 new customer entries. You can create a log entry of “We have 10 customers as of this date.”
- When creating new entries, prepare the entry or list of entries in memory, allow the user to edit/revise them as a draft, then when they click “Save”, they are in the permanent record.
- To fix a mistake in an entry, create a new entry that “negates” that entry.
A lot of parallelism / concurrency problems just go away with this design.
I am wondering how portable is parquet format and how interchangeable it is now?
There is likes of comet and blaze that replace execution backend of spark with datafusion and then you have single process alternatives like sail trying to settle in "not so big data" category.
I am watching evolution of projects powered by datafusion and compatible with spark with keen eye. Early days but quite exciting.
I am quite curious about the plans for python dataframe like API for duckdb, and python ecosystem in general.