Bit thin on details and not looking like they’ll open source it, but if someone clicked the post because they’re looking for their “replace ES” thing:
Both https://typesense.org/ and https://duckdb.org/ (with their spatial plugin) are excellent geo performance wise, the latter now seems really production ready, especially when the data doesn’t change that often. Both fully open source including clustered/sharded setups.
These are great projects, we use DuckDB to inspect our data lake and for quick munging.
We will have some more blog posts in the future describing different parts of the system in more detail. We were worried too much density in a single post would make it hard to read.
These are great. I am eternally grateful that projects like this are open source, I do however find it hard to integrate them into your own projects.
A while ago I tried to create something that has duckdb + its spatial and SQLite extensions statically linked and compiled in. I realized I was a bit in over my head when my build failed because both of them required SQLite symbols but from different versions.
Good point and was mostly re Typesense (can't edit the comment anymore).
But given that duckdb handles "take this n GB parquet file/shard from a random location, load it into memory and be ready in < 1 sec" very well I'd argue it's quite easy to build something that scales horizontally.
We use it for both the importer pipeline that processes the 2B row / 200GB compressed GBIF.org parquet dataset and queries like https://www.meso.cloud/plants/pinophyta/cupressales/pinopsid... and the sheer amount of functions[1] beyond simple stuff like "how close is a/b to x/y" or is "n within area x" is just a joy to work with.
Can you share what makes it better than competitors? And what's great about the dev experience? Did you use their cloud offering? The marketing material looks great, but I want to hear a user's experience.
Slightly meta, but I find its a good sign that we're back to designing and blogging about in-house data storage systems/ Query engines again. There was an explosion of these in the 2010's which seemed to slow down/refocus on AI recently.
It slowed down not because of AI, but because it turned out it was mostly pointless. Highly specialized stacks that could usually be matched in performance by tweaking an existing system or scaling a different way.
In-house storage/query systems that are not a product being sold by itself are NIH syndrome by a company with too much engineering resources.
I don't disagree that rock solid is a good choice, but there is a ton of innovation necessary for data stores.
Especially in the context of embedding search, which this article is also trying to do. We need database that can efficiently store/query high-dimensional embeddings, and handle the nuance of real-world applications as well such as filtered-ANN. There is a ton of innovation in this space and it's crucial to powering the next generation architectures of just about every company out there. At this point, data-stores are becoming a bottleneck for serving embedding search and I cannot understate that advancements in this are extremely important for enabling these solutions. This is why there is an explosion of vector-databases right now.
This article is a great example of where the actual data-providers are not providing the solutions companies need right now, and there is so much room for improvement in this space.
Agreed. The only caveat to that being a global rule is: 'At scale in a particular niche, even an excellent generalist platform might not be good enough'
But then the follow on question begs: "Am I really suffering the same problems that a niche already-scaled business is suffering"
A question that is relevant to all decision making. I'm looking at you, people who use the entire react ecosystem to deploy a blog page.
Lol I "love" that the first benefit this company lists in their jobs page is "In-Office Culture". Do people actually believe that having to commute is a benefit?
You can't reduce the in-office or remote experience purely to commuting. It's just one aspect about how and where you work and work life balance in general.
But since you asked, yes, I actually enjoy commuting when it is less than 30 minutes each way and especially when it involves physical activities. My best commutes have been walking and biking commutes of around 20-25 minutes each way. They give me exercise, a chance to clear my head, and provide "space" between work and home.
During 2020, I worked from home the entire time and eventually I found it just mentally wasn't good for me to work and live in the same space. I couldn't go into the office, so I started taking hour long walks at the end of every day to reset. It helped a lot.
That said, I've also done commutes of up to an hour each way by crowed train and highway driving and those are...not good.
I don't get this. This idea that 'work life balance' should mean that the two should be compartmentalised to specific blocks of time seems counterproductive to me. To me it feels like an unnatural way of living. 8 hours in which I should only focus on work, 8 hours I should focus on everything else followed by 8 hours of sleep. I don't think that is how we are supposed to operate. Even the 8 hours of sleep in one block is not natural and a recent invention. Before industrialisation people used to sleep in multiple blocks (wikipedia: polyphasic sleeping)
The idea that you have to be 'on' for 8 hours at a time seems extremely stressful to me. No wonder you need an hour afterwards just to unwind. Interleaving blocks of work and personal time over the day feels much more natural and less stressful to me. WFH makes this possible. If I'm stuck on something, I can do something else for a while, maybe even take a short nap. The ability to focus and do mentally straining work comes in waves for me. Being able to go with my natural flow makes me both happier, more relaxed and more productive.
The key to work/life balance to me is not stricter separation but instead better integration.
This is part of the company culture. If the company respects the boundary between work and personal life, and it's a cultural value, then it shouldn't be a problem for you establishing a space even without going to the office. You just close down your work laptop, put it aside and open it up next time when it's time to work again. Of course, there's stuff like on-call shifts, and there's a temptation to just stay later and finish this one thing, but if the company culture does not expect you to be tethered to work 24x7 then it's doable. If the culture is right, you don't need a physical barrier for this to be doable.
> so I started taking hour long walks at the end of every day to reset. It helped a lot.
A good habit. I dont see why any remote worker couldn't do that.
1. It's extremely cold and dark! I must wear extra clothes when going inside and I get depressed at wasting a day of nice weather in what looks like a WW1 bunker.
2. Terrible accessibility for disabled people! (such as myself)
3. Filthy toilets!
4. Internet is slower than at home!
5. Half the team lives somewhere else so all meetings are on teams anyway!
6. They couldn't afford a decent headset so I get pain in my head after 5 minutes, but I don't have a laptop so I can't move to a meeting room.
The HR really can't understand why after all these great perks I insist on wanting to work from home. I am such an illogical person!
> Do people actually believe that having to commute is a benefit?
Everything is subjective here. I don't love commuting, but I'm remote now and there are days I kind of miss it. I got a lot more podcasting listening in when I did which I really do miss, and I enjoyed getting out of the house, on a schedule, and seeing my city and area.
As for BEING in the office, yes I also miss that. I miss the friendships with people from other parts of the org that I made; I miss the getting together at lunch and talking about both work and non-work stuff; I miss the pinball machines that one enthusiast set up.
THAT SAID, I abhor the _requirement_ to be in an office; it's a top down, heavy handed, hamfisted attempt at trying to force something that IMO can only come naturally, under the guise of "CuLtUrE!", and unless forced to I won't consider any job that requires it. (NB: This, too, is a tradeoff - if it's close to my house and I've got some latitude as to what time to make it there so I can have some freedom to avoid the heaviest of traffic, sure.)
This is just another example of the "open office" concept. When that came out everyone hated it except for the C-suite that didn't have to do it, under the mistaken idea that it forces "collaboration, which is good", when the reality was that the "good" part was emergent, holistic, and natural, and any forcing function kills it. But of course we also know that it was nothing but a cost-savings issue, and the "collaboration" argument was a gaslight retcon of the highest order. Open offices actually worked when PART of the office was open, allowing collaboration _as needed_ and driven by the teams/groups that wanted to do it, not by management. RTO is exactly the same.
This article is lacking detail. For example, how is the data sharded, how much time between indexing and serving, and how does it handle node failure, and other distributed systems questions? How does the latency compare? Etc. etc.
Author here! We were really motivated to turn a "distributed system" problem into a "monolithic system" from an operations perspective and felt this was achievable with current hardware, which is why we went with in-process, embedded storage systems like RocksDB and Tantivy.
Memory-mapping lets us get pretty far, even with global coverage. We are always able to add more RAM, especially since we're running in the cloud.
Backfills and data updates are also trivial and can be performed in an "immutable" way without having to reason about what's currently in ES/Mongo, we just re-index everything with the same binary in a separate node and ship the final assets to S3.
Why not just use a open source solution like paradedb ... .
Paradedb = postgres pg_search plugin (the base is tantivy). Need anything else like vectors or whatever, get the plugins for postgres.
The only thing your missing is a LSM solution like RocksDB. See Orioledb what is supposed to become a plugin storage engine for postgres but not yet out of beta.
In my experience, the care and feeding that goes into an Elastic Search cluster feels like it's often substantially higher than that involved in the primary data store, which has always struck me as a little odd (particularly in cases where the primary data store is an RDBMS).
I'd be very happy to use simpler more bulletproof solutions with a subset of ES's features for different use cases.
To add another data point: After working with ES for the past 10 years in production I have to say that ES is never giving us any headaches. We've had issues with ScyllaDB, Redis etc. but ES is just chugging along and just works.
The one issue I remember is: On ES 5 we once had an issue early on where it regularly went down, turns out that some _very long_ input was being passed into the search by some scraper and killed the cluster.
I'm interested in this detail because a few years back I was involved in a major big data project at a health insurance company and I cooked up a solution that involved ElasticSearch that was workable only to be shot down--it was political, but they had to do it with Kafka, full stop. The problem was, at that time, Kafka wasn't very mature and it wasn't a good solution for the problem, regardless. So our ES version got shelved.
Nice... it's cool to see how different companies are putting together best fit solutions. I'm also glad that they at least started out with off the shelf apps instead of jumping to something like a bespoke solution early on.
Quickwit[1] looks interesting, found via Tantivity reference. Kind of like ES w/ Lucene.
Both https://typesense.org/ and https://duckdb.org/ (with their spatial plugin) are excellent geo performance wise, the latter now seems really production ready, especially when the data doesn’t change that often. Both fully open source including clustered/sharded setups.
No affiliation at all, just really happy camper.
We will have some more blog posts in the future describing different parts of the system in more detail. We were worried too much density in a single post would make it hard to read.
A while ago I tried to create something that has duckdb + its spatial and SQLite extensions statically linked and compiled in. I realized I was a bit in over my head when my build failed because both of them required SQLite symbols but from different versions.
But given that duckdb handles "take this n GB parquet file/shard from a random location, load it into memory and be ready in < 1 sec" very well I'd argue it's quite easy to build something that scales horizontally.
We use it for both the importer pipeline that processes the 2B row / 200GB compressed GBIF.org parquet dataset and queries like https://www.meso.cloud/plants/pinophyta/cupressales/pinopsid... and the sheer amount of functions[1] beyond simple stuff like "how close is a/b to x/y" or is "n within area x" is just a joy to work with.
[1] https://duckdb.org/docs/stable/core_extensions/spatial/funct...
You can also attach DuckDB to Apache Flight which will make it work beyond local operation.
It's a mini-revolution in the OSM world, where most apps have a bad search experience where typos aren't handled.
https://github.com/komoot/photon
In-house storage/query systems that are not a product being sold by itself are NIH syndrome by a company with too much engineering resources.
Especially in the context of embedding search, which this article is also trying to do. We need database that can efficiently store/query high-dimensional embeddings, and handle the nuance of real-world applications as well such as filtered-ANN. There is a ton of innovation in this space and it's crucial to powering the next generation architectures of just about every company out there. At this point, data-stores are becoming a bottleneck for serving embedding search and I cannot understate that advancements in this are extremely important for enabling these solutions. This is why there is an explosion of vector-databases right now.
This article is a great example of where the actual data-providers are not providing the solutions companies need right now, and there is so much room for improvement in this space.
But then the follow on question begs: "Am I really suffering the same problems that a niche already-scaled business is suffering"
A question that is relevant to all decision making. I'm looking at you, people who use the entire react ecosystem to deploy a blog page.
Dead Comment
But since you asked, yes, I actually enjoy commuting when it is less than 30 minutes each way and especially when it involves physical activities. My best commutes have been walking and biking commutes of around 20-25 minutes each way. They give me exercise, a chance to clear my head, and provide "space" between work and home.
During 2020, I worked from home the entire time and eventually I found it just mentally wasn't good for me to work and live in the same space. I couldn't go into the office, so I started taking hour long walks at the end of every day to reset. It helped a lot.
That said, I've also done commutes of up to an hour each way by crowed train and highway driving and those are...not good.
I don't get this. This idea that 'work life balance' should mean that the two should be compartmentalised to specific blocks of time seems counterproductive to me. To me it feels like an unnatural way of living. 8 hours in which I should only focus on work, 8 hours I should focus on everything else followed by 8 hours of sleep. I don't think that is how we are supposed to operate. Even the 8 hours of sleep in one block is not natural and a recent invention. Before industrialisation people used to sleep in multiple blocks (wikipedia: polyphasic sleeping)
The idea that you have to be 'on' for 8 hours at a time seems extremely stressful to me. No wonder you need an hour afterwards just to unwind. Interleaving blocks of work and personal time over the day feels much more natural and less stressful to me. WFH makes this possible. If I'm stuck on something, I can do something else for a while, maybe even take a short nap. The ability to focus and do mentally straining work comes in waves for me. Being able to go with my natural flow makes me both happier, more relaxed and more productive.
The key to work/life balance to me is not stricter separation but instead better integration.
This is part of the company culture. If the company respects the boundary between work and personal life, and it's a cultural value, then it shouldn't be a problem for you establishing a space even without going to the office. You just close down your work laptop, put it aside and open it up next time when it's time to work again. Of course, there's stuff like on-call shifts, and there's a temptation to just stay later and finish this one thing, but if the company culture does not expect you to be tethered to work 24x7 then it's doable. If the culture is right, you don't need a physical barrier for this to be doable.
> so I started taking hour long walks at the end of every day to reset. It helped a lot.
A good habit. I dont see why any remote worker couldn't do that.
Dead Comment
Learning from smart people, making friends, free food and drinks, a DDR machine
My last office job had none of that. Instead it was just sort of like a depressing scaled up version of my home office
1. It's extremely cold and dark! I must wear extra clothes when going inside and I get depressed at wasting a day of nice weather in what looks like a WW1 bunker.
2. Terrible accessibility for disabled people! (such as myself)
3. Filthy toilets!
4. Internet is slower than at home!
5. Half the team lives somewhere else so all meetings are on teams anyway!
6. They couldn't afford a decent headset so I get pain in my head after 5 minutes, but I don't have a laptop so I can't move to a meeting room.
The HR really can't understand why after all these great perks I insist on wanting to work from home. I am such an illogical person!
Everything is subjective here. I don't love commuting, but I'm remote now and there are days I kind of miss it. I got a lot more podcasting listening in when I did which I really do miss, and I enjoyed getting out of the house, on a schedule, and seeing my city and area.
As for BEING in the office, yes I also miss that. I miss the friendships with people from other parts of the org that I made; I miss the getting together at lunch and talking about both work and non-work stuff; I miss the pinball machines that one enthusiast set up.
THAT SAID, I abhor the _requirement_ to be in an office; it's a top down, heavy handed, hamfisted attempt at trying to force something that IMO can only come naturally, under the guise of "CuLtUrE!", and unless forced to I won't consider any job that requires it. (NB: This, too, is a tradeoff - if it's close to my house and I've got some latitude as to what time to make it there so I can have some freedom to avoid the heaviest of traffic, sure.)
This is just another example of the "open office" concept. When that came out everyone hated it except for the C-suite that didn't have to do it, under the mistaken idea that it forces "collaboration, which is good", when the reality was that the "good" part was emergent, holistic, and natural, and any forcing function kills it. But of course we also know that it was nothing but a cost-savings issue, and the "collaboration" argument was a gaslight retcon of the highest order. Open offices actually worked when PART of the office was open, allowing collaboration _as needed_ and driven by the teams/groups that wanted to do it, not by management. RTO is exactly the same.
Memory-mapping lets us get pretty far, even with global coverage. We are always able to add more RAM, especially since we're running in the cloud.
Backfills and data updates are also trivial and can be performed in an "immutable" way without having to reason about what's currently in ES/Mongo, we just re-index everything with the same binary in a separate node and ship the final assets to S3.
Paradedb = postgres pg_search plugin (the base is tantivy). Need anything else like vectors or whatever, get the plugins for postgres.
The only thing your missing is a LSM solution like RocksDB. See Orioledb what is supposed to become a plugin storage engine for postgres but not yet out of beta.
Feels like people reinvent the wheel very often.
I'd be very happy to use simpler more bulletproof solutions with a subset of ES's features for different use cases.
The one issue I remember is: On ES 5 we once had an issue early on where it regularly went down, turns out that some _very long_ input was being passed into the search by some scraper and killed the cluster.
Quickwit[1] looks interesting, found via Tantivity reference. Kind of like ES w/ Lucene.
1. https://github.com/quickwit-oss/quickwit