I wonder if they hired a SEO company and said "Hey, we want to be on top of the organic front page results for any Hadoop ecosystem related searches" and paid them a bunch of money for this.
Sure, Presto/Trino has better market share, but Drill is still a useful project. I think the main issue is one of marketing - Presto has it, and Drill does not.
The other issue I've seen is lacking documentation. I've been a Drill committer for several years and co-authored the O'Reilly book, Learning Apache Drill. I'm STILL finding undocumented functionality in Drill.
For me the ease of use and versatility. Having tried the others, Presto, Impala, Dremio, none really work as well for my use cases. The fact that I can write a query to join data from an API with a PCAP, with zero schema prep is really incredible.
I've also always appreciated that Drill uses standard SQL rather than a proprietary dialect of SQL is a huge win because it means the learning curve is muj lower.
Bottom line, a user can unzip Drill and with very minimal work, explore their data no matter where it is, with minimal config. This works as well on a laptop as a big cluster.
A wide variety of sources queryable through standard SQL, including some not found in alternatives. The competition for querying the common data sources is, as you note, stiff! We want to do those well, but also chase the long tail of more obscure systems, interfaces and formats that are out there and still contain stuff people are interested in. Even people who'd like to run query or two before they're required to declare a schema, though admittedly this latter trick is helpful for exploring but no magic wand once you need start doing computation (and data types really do become indispensable).
For me drill is best suite for many reasons.
1. Its simplicity - Simple installation, simple to use, smaller learning curve. Works both in stand alone and in cluster mode.
2. Documentation - Simple and to the point documentation.
3. Support for multiple data sources - It supports almost all required data sources and use same Standard SQL to query data from. Best part is to join multiple data sources (different databases of different types) in a single query
4. Active community - Raise the question and someone is there to help you resolve the issue.
5. Integration options - It provides both JDBC and REST API to connect to drill data store and accesses it.
So for my use case this really fits and I guess it will be for others.
> so you can't fault anyone to think the project is dead.
Why? It has regular releases including one in 2021. That looks like the opposite of dead to me. The only time there were more common releases, was 2018 when it was new, so the release page by itself screams "mature and stable".
I always judge life by looking at how regular the releases are rather than the frequency. If they release ever year or two or three years consistently for the last few years then that's a good enough sign of life to me. Everyone goes at a different pace.
The last release (Drill 1.19.0) was a very significant release, firstly because it was the first release without MapR. Secondly, however, it had a lot of refactoring and offered some major improvements including connectivity with ElasticSearch, Splunk, Cassandra as well as readers for XML files and more.
The next Drill release will happen before the end of the year and also promises to be a big one. Some significant improvements are major rewrites of the Mongo connector, the ability to write to JDBC data sources, an Iceberg reader, Parquet updates and more.
I used Apache Drill to query json line log files stored in Azure Blob. It is very easy to configure and run it. I used in embedded mode for not so big queries and some visualisation in Apache Superset. It worked really well. I created some views in parquet to speed it up.
Be aware, there is no such thing as schemaless, Drill is schema on-read and if your files contain changing schemas, it is painful to workaround all the errors you face. JSON is too ambiguous when it comes to types.
If the “Director of Customer Solutions” for the company selling presto feels the need to write a blog post declaring Apache drill dead, that might be a sign that it is actually competing pretty well with presto.
Presto has been rebranded to Trino. Trino supports MATCH_RECOGNIZE from the SQL:2016 standard which is pretty cool. The only database engine that does that so far is Oracle.
The other issue I've seen is lacking documentation. I've been a Drill committer for several years and co-authored the O'Reilly book, Learning Apache Drill. I'm STILL finding undocumented functionality in Drill.
Drill is one of those technologies that people don't think to turn to today if they're starting a new project.
What does Drill offer that would make it out-compete the other candidates for a fresh new project?
Bottom line, a user can unzip Drill and with very minimal work, explore their data no matter where it is, with minimal config. This works as well on a laptop as a big cluster.
So for my use case this really fits and I guess it will be for others.
There's been only one release per year in the past so you can't fault anyone to think the project is dead.
https://github.com/apache/drill/releases
Why? It has regular releases including one in 2021. That looks like the opposite of dead to me. The only time there were more common releases, was 2018 when it was new, so the release page by itself screams "mature and stable".
The next Drill release will happen before the end of the year and also promises to be a big one. Some significant improvements are major rewrites of the Mongo connector, the ability to write to JDBC data sources, an Iceberg reader, Parquet updates and more.
reports of the death of the English language are increasing.
Be aware, there is no such thing as schemaless, Drill is schema on-read and if your files contain changing schemas, it is painful to workaround all the errors you face. JSON is too ambiguous when it comes to types.
Deleted Comment