Since starting my career, the tech stack I use has changed numerous times. Technologies have come, gone, evolved. But except for a brief period in 2010 where we thought we could kill it, SQL is the one technology that has stayed around. It might be the only TECHNICAL knowledge I learned in the first year of my career that is still relavent. You're probably not making the wrong decision if you decide to build on top of SQL.
> except for a brief period in 2010 where we thought we could kill it
Unfortunately, we never actually tried to kill SQL. We tried to kill relational databases, with the death of SQL as a side effect. This proved to be a stupid idea - non-relational data stores have value of course, but so do relational databases.
SQL, as you point out, is 48 years old. It shows. The ergonomics are awful. It uses natural-language inspired syntax that has fallen completely out of favor (with good reason imho). The ordering of fields and tables precludes the use of autocomplete. It uses orthogonal syntax for identical constructions, i.e.
And worst of all, it comes with such a woefully inadequate standard that half of the universally supported features are expressed in a completely different fashion for each vendor.
It is very common to argue that its longevity is due to its excellence as a language. I think it's longevity is due to its broad target audience, which includes plenty of people in less technical roles who are high on the org chart. Tech workers are happy to jump on new technology that improves ergonomics (or shiny new toys that are strictly more interesting because they are new). Executives who's primary way of interacting with the tech stack is querying a read only replica are much less open to learning a new language.
SQL expresses relational algebra [0] which is the language defining data relationships and manipulation. It hasn't changed in 50 years for the same reason why B-Trees and sorting algorithms haven't: it's foundational to computer science.
Database-specific "SQL extensions" are in my experience just administration commands, e.g. `VACUUM` or `CREATE EXTENSION` in Postgres. They help operate the DB, but have little to do with the actual data manipulation.
Killing SQL is like trying to kill computer science: you better come up with something better than a synonym.
> Unfortunately, we never actually tried to kill SQL. We tried to kill relational databases, with the death of SQL as a side effect.
I don’t know who “we” is, but, before NoSQL tried to replace RDBMS(well, starting before, these things overlapped it), there were efforts to provide alternatives to SQL for RDBMSs, some examples:
I think your last sentence gets at the crux of it. I think SQL is truly great for that use case of analytical queries over a read only database / data warehouse / parquet / whatever (ie. "OLAP" use cases). I think it's less great for the "OLTP" use cases. It still works fine, which is why it sticks around, but other approaches are more convenient for doing lots of smaller inserts, updates, and lookups and fewer table scans and aggregations.
I also feel like part of the staying power of SQL is because it's the lingua franca of the database. I wonder if it's time to build and popularize some lower level language that could be targeted by ORMs and others to speak to DB engines. If we had a more flexible assembly language we might get more freedom at the programmer language level.
I've been using SQL for ~25 years and don't think the ergonomics are nearly as bad as people say. In your example the insert is done that way for good reason.
I know people hate listing the fields before the tables, but when I deal with data that's how I tend to think. I need X, Y, Z, now where do I get it from and do I need to filter.
From a vendor standpoint, it can take time getting new standards implemented, but it's definitely not half. One of the big issues here is that MySQL was and still is in some ways woefully inadequate. It's shortcomings are also often make people think they are relational database shortcomings in general.
The reason we can't replace SQL is that it is the lingua franca between many client tools and data sources. From programming languages, to Analytics tools, to ETL tools to bulk data transfer tools, everything knows how to define SQL queries. On the datastorage side, almost everything is standardizing on SQL as its API: RDBMSes, but also NoSQL storage (etc, hadoop + hive, etc), data virtualization layers (denodo, aws athena, steam pipe).
You can't change this by providing a better query language in one of the datasource products, or by providing a better ORM. Because of the network effects it will be very hard to come up with something better that is universally adopted. Many have tried though but it's always limited to a small stack: MDX, DAX, Linq, Dplyr, Graphql, etc.
There may be opportunities to replace SQL as soon as we need something that moves away from relational algebra. Currently you see al lot of adoption on graph storage and graph query languages in the Data Fabric space, as users need to build queries reasoning about relationships between different datasets in the enterprise.
The other reason Data Fabrics could offer an opportunity here is that they're basically adding an layer of abstraction between all data sources and data consumers, and they the possibility to translate SQL into something else, e.g. graphql.
I disagree, the people high on the org chart are clamouring for cloud-based low structure stores on microsoft azure cloud maintained in India, while we warn them that we can migrate at a giant cost for a net loss in efficiency.
SQL and dbs are fine, they work, we could discuss beautifying sql for sure, but frankly we shouldnt solving data structuring problems that were already solved 50 years ago: we have real business to support instead.
When I was first learning to code, I once told someone that working with SQL felt like trying to wrestle with a mainframe, and I stand by that assertion.
To my primitive brain, NoSQL always sounded like "I am going to makeup data model and integrity on the fly and let the user figure out how to handle it". No Thanks. But may be I don't get it. GraphQL honestly gives me the same vibes. Happy to be corrected.
I have plenty of beef with SQL, but your insert versus update example doesn't make sense to me. That's a perfectly reasonable difference; insert is always a dense operation while update is usually a sparse operation.
A huge benefit of SQL is that it's a declarative language (you describe what the output should be); whereas with imperative languages like C/Python/Ruby you have to describe how to generate the output with specific instructions and procedures.
Is there a formal specification of what “what” and “how” means in this context? I’m familiar with your description, but it appears subjective to me because the phrases “what the output should be” and “how to generate the output” are poorly defined.
You might notice that the only way to create a declarative language is to use an Imperative language.
Declarative languages are basically a DSL, which (hopefully) translate the desired steps into efficient instructions. Nonetheless, your cpu will execute imperative code at the end.
SQL is an example of a very well established and generally well done declarative language, but that doesn't mean that declarative languages are inherently better.
I'm sure this will ruffle some feathres, but the technical skill that has been the most persistently valuable across the past 25 years of my career is object-oriented programming and design. It's not the only paradigm I use, but being able to organize large programs in terms of OOP has been incredibly valuable and hugely productive for me across several languages and domains.
On the contrary, I think that's a self-fulfilling perception. If assets are already organized as objects, then it'll continue to work well with oop. But if you take the approach from the get-go to use data and FP approaches, things can be quite nice. Examples:
- interface/trait based programming / structural subtyping (widely used in go/rust, increasingly in TS and python)
- terraform
- react
- aws lambda / cloud functions
- flow-based data processing (the whole of deep learning, spark/hadoop)
- and of course, anything declarative DSL based (SQL, jq and friends)
So I would counter that the more valuable skill is "how do I solve problems in terms of applying and composing transforms to data"
To clarify, since everyone has their own definition of OOP, and of the four pillars, Abstraction, Polymorphism, aren't at all unique to OOP, and Encapsulation is just Abstraction: the defining features of OOP are inheritance and poking-and-prodding-state into opaque objects. Inheritance is subsumed by interfaces / structural subtyping, and poking at state is contrasted with reactor patterns, event sourcing, persistent data structures, etc.
Oop really shines at the middlin-low level, in languages without a lifetime (state for things like IO resourcese, at the GUI widget level, and the (micro)service level, which is more like the original smalltalk sort of objects, in which case inheritance isn't a think.
Everyone has a different experience! Object oriented programming and design have had approximately no impact on my own career, while SQL knowledge has been indispensable. I don't think they're really in competition.
I feel my abilities as a developer really took off when I took the time to really learn SQL and take advantage of the features that my database offers. For example, learning to use window functions felt like that galaxy brain meme where you become one with the universe.
Being able to take 50 lines of ruby code (manual (anti-)joins and aggregates, result partitioning, etc...) and replace it with a couple lines of sql that is much faster and less buggy is a life changing experience.
The only other time I had such a dramatic shift in the way that I look at building applications is when I learned Scala (functional programming).
I’ve been using HTML since the mid 1990s. That is probably the oldest technology that I still use on a regular basis. SQL would be a close second (or maybe third - I probably started learning sql around 2001 and I can’t remember when JavaScript came out. Surely before that.)
Honestly, mines has mostly stayed the same. I've played with various techs but I mostly work the same underlying tech. Some items get added but very rarely does something get replaced or removed. The tech gets improved and new features added .
The last time I can remember something getting replaced is vargrant for docker. And honestly, I think that's a mistake for many companies. Before that it was git for svn.
The only way I can see tech coming and going is you're always an early adoptor and need to replace things that failed. If you build on stable tech, that stuff stays around for decades even if it was terrible in the first place. A good example is PHP. It was terrible at first still around and improving year on year.
Can you elaborate on why vagrant to docker is a mistake? I find the docker workflow infinitely better and faster to iterate for making dev environments.
The problem I see here is building SQL on top of the cloud, which is like building SQL on top of a database where the tables are always changing - not just adding columns but the meaning of the columns and rows, or even the existence of the table.
I think this is a good idea, but the cloud doesn't seem to care too much about backwards compatibility. If you use MySQL/PostgreSQL/SQL Server, you can be fairly sure in 20 years time your SQL will still work :-).
The need for this library is more an indictment on "the cloud" than anything. I am mostly an Azure user so AWS/Google might be better but man it changes it's UI and CLI interfaces alot - way too much.
SQL is a technology that engineers should know in an out, and while in early 2000's you could not move an inch without knowing it, I find that the new generation of devs really considers database calls a freebie - hidden behind abstractions and ORMs. It's just a line of code, right?
This leads to highly non-performant software since, unless we have bent the laws of physics in the last 15 years, your data access layer is the most expensive one.
"Since starting my career, the tech stack I use has changed numerous times. Technologies have come, gone, evolved. But except for a brief period in 2010 where we thought we could kill it, SQL is the one technology that has stayed around."
Ethernet comes to mind when I think of technologies that just can't be killed ...
> It might be the only TECHNICAL knowledge I learned in the first year of my career that is still relavent.
Agree. IIRC, I wrote my first SQL is 1996 and wrote some today. While not perfect, it’s amazing at what it does. I suggest reading Joe Celko to really up your SQL skills.
That's not random. SQL is basically math. The syntax is a little old school and arcane (but not terrible) but the principles behind it are set theory and logic.
It's kind of random. There were plenty of competing languages in use until the 90s. Even our beloved Postgres used QUEL for its first decade of life. But Oracle took the dominant lead and its competition either disappeared or started adding SQL support to be compatible with it.
Had a different butterfly flapped its wings all those years ago, things could have turned out very different.
> SQL is basically math.
All programming languages are basically math.
> but the principles behind it are set theory
Thing is, one of the challenges with SQL is that it doesn't adhere to set theory. This leads to some unintuitive traps that could have been avoided had it carefully stuck to the relational algebra.
Such deviation may have been a reasonable tradeoff to help implementations, perhaps, but from a pure language perspective it does make it more terrible than it should have been.
Eh, I’ve seen various things like Git come and go. I spent the majority of my career in the days before git existed and we used things like SVN. I wouldn’t be surprised to see Git supplanted. I would be surprised to see SQL supplanted.
Git is a bit newer, and there have been many attempts to solve the same problem it solves, but it solves that specific problem VERY well... it'll be a long time before it's dethroned in my opinion.
Excited to see Steampipe shared here - thanks kiyanwang! I'm a lead on the project, so sharing some quick info below and happy to answer any questions.
Steampipe is open source [1] and uses Postgres foreign data wrappers under the hood [2]. We have 84+ plugins to SQL query AWS, GitHub, Slack, HN, etc [3]. Mods (written in HCL) provide dashboards as code and automated security & compliance benchmarks [3]. We'd love your help & feedback!
Hey Nathan. Can you comment on some of the known performance pitfalls of steampipe? I'm not super familiar with the Postgres foreign data wrappers API. I assume steampipe inherits many of its technical limitations from this API.
Having done some work in this space, I'm aware that it's no small thing to compile a high-level SQL statement that describes some analytics task to be executed on a dataset into a low-level program that will efficiently perform that task. This is especially true in the context of big data and distributed analytics. Also true if you're trying to blend different data sources that themselves might not have efficient query engines.
Would love to use this tool, but just curious about some of the details of its implementation.
The premise of virtual tables (Postgres FDW) is to not store the data but instead query it from the original source. So, the primary performance challenge is API throttling. Steampipe optimizes these API calls with smart caching, massive parallelization through goroutines and calling the minimum set of APIs to hydrate the exact columns requested. For example, "select instance_id from aws_ec2_instance" will do 2 API calls get 100 instances in 2 pages, while "select instance_id, tags from aws_ec2_instance" would do 2 calls (instance paging) + 100 tag API calls (one per instance). We've also implemented support for qualifiers (i.e. where clauses) so API calls can be reduced even further - e.g. get 1 EC2 instance without pagination etc.
The Postgres planner is not really optimized for foreign tables, but we can give it hints to indicate optimal paths. We've gradually ironed out many cases here in our FDW implementation particularly for joins etc.
To add somewhat of a counterpoint to the other response, I've tried the Steampipe CSV plugin and got 50x slower performance vs OctoSQL[0], which is itself 5x slower than something like DataFusion[1]. The CSV plugin doesn't contact any external API's so it should be a good benchmark of the plugin architecture, though it might just not be optimized yet.
That said, I don't imagine this ever being a bottleneck for the main use case of Steampipe - in that case I think the APIs themselves will always be the limiting part. But it does - potentially - speak to what you can expect if you'd like to extend your usage of Steampipe to more than just DevOps data.
I've used the benchmark available in the OctoSQL README.
Hey!
Steampipe looks great and I think the architecture you chose is very smart. Some feedback from myself:
I've tried setting up steampipe with metabase to do some dashboarding. However, I’ve found that it mostly exposes "configuration" parameters, so to say. I couldn't find dynamic info, like S3 bucket size or autoscaling group instance count.
Have I done something backwards or not noticed a table, or is that a design decision of some sort? That was half a year ago, so things might've changed since then, too.
Steampipe mostly exposes underlying APIs or data through SQL. So, "select * from aws_s3_bucket" will return bucket names, tags, policies, etc all together. But, bucket size is not a direct API so doesn't have a direct table.
It might just be a coincidence, but an hour before this HN post, I discovered it way back in our queue of things to review for https://golangweekly.com/ and featured it in today's issue. Hopefully kiyanwang is one of our readers :-D
Congratulations! This looks incredibly powerful and I'm excited to check it out.
Although this is pitched primarily as a "live" query tool, it feels like we could get the most value out of combining this with our existing warehouse, ELT, and BI toolchain.
Do you see people trying to do this, and any advice on approaches? For example, do folks perform joins in the BI layer? (Notably, Looker doesn't do that well.) Or do people just do bulk queries to basically treat your plugins as a Fivetran/Stitch competitor?
While it's extensible, the primary use case for Steampipe so far is to query DevOps data (e.g. AWS, GitHub, Kubernetes, Terraform files, etc) - so often they are working within the Steampipe DB itself and combining with CSV data etc.
But, because it's just Postgres, it can be integrated into many different data infrastructure strategies. Many users query Steampipe from standard BI tools, others use it to extract data into S3, it has also been connected with BigQuery - https://briansuk.medium.com/connecting-steampipe-with-google...
As opposed to a lake or a warehouse, we think of it as a "Data Rainbow" - structured, ephemeral queries on live API data. Because it doesn't have to import the data it works uniquely well for small, wide data and joining large data sets (e.g. search, querying logs). I spoke about this in detail at the Yow Data conference - https://www.youtube.com/watch?v=2BNzIU5SFaw
I'm biased, but writing plugins is really fun - and then you can create mods / dashboards for your users as well! Please give it a go - we have a guide [1], 84+ open source plugins to copy [2], many authors in our Slack community [3] and we're happy to get your plugin added to the Steampipe Hub [4].
We see users running Steampipe on their desktop, in pipelines (e.g. to scan Terraform files) and also as a central service. See "service mode" in the CLI for a Postgres endpoint and a dashboard web server [1] or try docker / kubernetes [2]. We also have a cloud hosted option with dashboard hosting, multi-user support, etc [3].
My guess is, that it makes sense to get regular reports, e.g. weekly. But you also want to experiment and develop queries. So probably both. Not sure if there is something like a notebook for steampipe.
Steampipe is pretty much "just" PostgreSQL with a foreign data wrapper and some fancy extras on top. The data is in tables from the database's perspective, so pretty much everything you can do with PostgreSQL, you can do with steampipe, including creating your own tables, views, functions and whatnot.
Steampipe killed my startup. I couldn't be happier about it.
I started a company six years ago to do exactly this -- make a SQL interface for the AWS API. Steampipe out-executed us and made a way better product, in part because their technology choices were much smarter than ours. I shut down the startup as soon as I saw steampipe.
They use Postgres foreign data wrappers and caching with live queries against the Api. We were doing periodic scanning and then storing it in elastic. Then trying to convert the SQL to elastic queries.
The elastic part let us do free text queries but converting from SQL was challenging (this was before elastic had it built in). And also everything was only as fresh as our last scan.
Writing custom scripts for even the simplest of queries comes nowhere near the convenience of using PostgreSQL to get what you want. If you wanted to find and report all instances in an AWS account across all regions, with steampipe it's just:
SELECT * FROM aws_ec2_instance;
Even the simplest implementation with a Python script won't be nearly as nice, and once you start combining tables and doing more complicated queries, there's just no competition.
Tried steampipe out to replace my simple python script. It wasn't able to handle our 1k+ accounts and multiple regions, initialization took over 1 hour before I was even able to do a query.
Meanwhile my python script can run any API calls in all our accounts and regions and finish in a few minutes. Maybe 100 lines of python using boto3 and multiprocessing that outputs to json or yaml.
Python and boto3 are very powerful in many cases for sure - glad they work well for you. No doubt that multi-region for 1k+ accounts is a large surface area, so we'd love to learn from your experience! Performance is an ongoing focus area for us, including radical improvements in memory usage and startup time in our most recent version - https://steampipe.io/blog/release-0-16-0
I don't agree that this simple query wouldn't be as nice with boto3.
But combining different services is definitely a nice feature, if you have the use case for it.
the fun part of that query is that it'll search across all regions that you have configured steampipe to use. If you did that in Python, you at the very least have to manually iterate over each region, handle pagination yourself and combine the results.
It gets much more convenient when you want to ask more complicated questions like "What is the most used instance type per region" or "How much EBS capacity do instances use across all regions, grouped by environment tag and disk type?"
Im the author of CloudQuery (https://github.com/cloudquery) and I believe we started at the same time though took some different approaches.
PG FDW - is def an exciting PostgreSQL tech and is great for things like on-demand querying and filtering.
We took a more standard ELT approach where we have a protocol between source plugins and destination plugins so we can support multiple databases, data-lakes, storage layers, kinda similar to Fivetran and airbyte but with focus on infrastructure tools.
Also, as a more standard ELT approach our policies use standard SQL directly and dashboards re-use existing BI tools rather then implementing dashboarding in-house.
I can see the power of all-in-one and FDW and def will be following this project!
This made me curious and I started doing research on other problematic name collisions. Did you know that Valve’s game distribution service has a name that’s already widely used in the scientific community for the gaseous form of heated water? It looks like steam has been around for several centuries at least so Valve really ought to have known better.
Anyway, if you’re interested, I’m putting together a long investigation on whether Microsoft’s flagship operating system is in collision with a very common term used in the construction and glass industry. Hope to have it done soon
Both things called “steampipe” are cloud-based technology services. They even both involve a CDN. They're close enough to be confusing. The name of a gas isn't. This is silly.
How does this handle pagination and also how does it behave towards rate limiters? For example, my API (keygen.sh) has a page size limit of 100, and often times there can be hundreds or even thousands of pages for certain resources. How would Steampipe handle `select count(*) from keygen_licenses` while obeying the rate limiter? What if a heavy query is run twice?
Unfortunately, we never actually tried to kill SQL. We tried to kill relational databases, with the death of SQL as a side effect. This proved to be a stupid idea - non-relational data stores have value of course, but so do relational databases.
SQL, as you point out, is 48 years old. It shows. The ergonomics are awful. It uses natural-language inspired syntax that has fallen completely out of favor (with good reason imho). The ordering of fields and tables precludes the use of autocomplete. It uses orthogonal syntax for identical constructions, i.e.
vs. And worst of all, it comes with such a woefully inadequate standard that half of the universally supported features are expressed in a completely different fashion for each vendor.It is very common to argue that its longevity is due to its excellence as a language. I think it's longevity is due to its broad target audience, which includes plenty of people in less technical roles who are high on the org chart. Tech workers are happy to jump on new technology that improves ergonomics (or shiny new toys that are strictly more interesting because they are new). Executives who's primary way of interacting with the tech stack is querying a read only replica are much less open to learning a new language.
Database-specific "SQL extensions" are in my experience just administration commands, e.g. `VACUUM` or `CREATE EXTENSION` in Postgres. They help operate the DB, but have little to do with the actual data manipulation.
Killing SQL is like trying to kill computer science: you better come up with something better than a synonym.
[0] https://en.wikipedia.org/wiki/Relational_algebra
Regarding your complaint ... those aren't identical constructions.
Insert has to have the syntax it does because it lets you insert multiple rows all with different values in a single statement:
Update doesn't allow that. The syntax for each of the above makes sense for the specific intention of each statement.I don’t know who “we” is, but, before NoSQL tried to replace RDBMS(well, starting before, these things overlapped it), there were efforts to provide alternatives to SQL for RDBMSs, some examples:
Query-by-Example. https://en.wikipedia.org/wiki/Query_by_Example
the D class of data languages. https://en.wikipedia.org/wiki/D_(data_language_specification...
https://prql-lang.org/
I know people hate listing the fields before the tables, but when I deal with data that's how I tend to think. I need X, Y, Z, now where do I get it from and do I need to filter.
From a vendor standpoint, it can take time getting new standards implemented, but it's definitely not half. One of the big issues here is that MySQL was and still is in some ways woefully inadequate. It's shortcomings are also often make people think they are relational database shortcomings in general.
I think SQL is so popular and has staying power for the same reason that our ridiculous gregorian calendar has staying power.
You can't change this by providing a better query language in one of the datasource products, or by providing a better ORM. Because of the network effects it will be very hard to come up with something better that is universally adopted. Many have tried though but it's always limited to a small stack: MDX, DAX, Linq, Dplyr, Graphql, etc.
There may be opportunities to replace SQL as soon as we need something that moves away from relational algebra. Currently you see al lot of adoption on graph storage and graph query languages in the Data Fabric space, as users need to build queries reasoning about relationships between different datasets in the enterprise.
The other reason Data Fabrics could offer an opportunity here is that they're basically adding an layer of abstraction between all data sources and data consumers, and they the possibility to translate SQL into something else, e.g. graphql.
That's irrelevant and a fallacy. Plenty of amazing things are old and have stood the test of time. Vim and SQL have empowered me my entire career.
There is an alternative INSERT syntax matching the UPDATE SET one:
SQL and dbs are fine, they work, we could discuss beautifying sql for sure, but frankly we shouldnt solving data structuring problems that were already solved 50 years ago: we have real business to support instead.
I do like SQL the syntax but it's hard to discard the ecosystem around SQL.
For instance, I made better-sql [1] recently which generate SQL from a language similar to GraphQL and EdgeDB query language.
[1] https://better-sql.surge.sh
Also, surely allowing FROM and SELECT clauses to be swapped in order shouldn't stress out any parser too much.
/s
A huge benefit of SQL is that it's a declarative language (you describe what the output should be); whereas with imperative languages like C/Python/Ruby you have to describe how to generate the output with specific instructions and procedures.
If we take as example a table with a single row:
In imperative (SQL):
INSERT INTO TABLE1 (name, wage) VALUES ('John','50k');
Then one day John gets a raise:
UPDATE TABLE TABLE1 SET wage = '60k' where name = 'John'
In declarative (pseudo code in yaml):
- table: TABLE1 - name: John wage: 50k
Then one day John gets a raise:
- table: TABLE1
Declarative languages are basically a DSL, which (hopefully) translate the desired steps into efficient instructions. Nonetheless, your cpu will execute imperative code at the end.
SQL is an example of a very well established and generally well done declarative language, but that doesn't mean that declarative languages are inherently better.
- interface/trait based programming / structural subtyping (widely used in go/rust, increasingly in TS and python)
- terraform
- react
- aws lambda / cloud functions
- flow-based data processing (the whole of deep learning, spark/hadoop)
- and of course, anything declarative DSL based (SQL, jq and friends)
So I would counter that the more valuable skill is "how do I solve problems in terms of applying and composing transforms to data"
To clarify, since everyone has their own definition of OOP, and of the four pillars, Abstraction, Polymorphism, aren't at all unique to OOP, and Encapsulation is just Abstraction: the defining features of OOP are inheritance and poking-and-prodding-state into opaque objects. Inheritance is subsumed by interfaces / structural subtyping, and poking at state is contrasted with reactor patterns, event sourcing, persistent data structures, etc.
Oop really shines at the middlin-low level, in languages without a lifetime (state for things like IO resourcese, at the GUI widget level, and the (micro)service level, which is more like the original smalltalk sort of objects, in which case inheritance isn't a think.
Being able to take 50 lines of ruby code (manual (anti-)joins and aggregates, result partitioning, etc...) and replace it with a couple lines of sql that is much faster and less buggy is a life changing experience.
The only other time I had such a dramatic shift in the way that I look at building applications is when I learned Scala (functional programming).
you have snatched the pea.
The last time I can remember something getting replaced is vargrant for docker. And honestly, I think that's a mistake for many companies. Before that it was git for svn.
The only way I can see tech coming and going is you're always an early adoptor and need to replace things that failed. If you build on stable tech, that stuff stays around for decades even if it was terrible in the first place. A good example is PHP. It was terrible at first still around and improving year on year.
I think this is a good idea, but the cloud doesn't seem to care too much about backwards compatibility. If you use MySQL/PostgreSQL/SQL Server, you can be fairly sure in 20 years time your SQL will still work :-).
The need for this library is more an indictment on "the cloud" than anything. I am mostly an Azure user so AWS/Google might be better but man it changes it's UI and CLI interfaces alot - way too much.
This leads to highly non-performant software since, unless we have bent the laws of physics in the last 15 years, your data access layer is the most expensive one.
But we are crushing LeetCode, yeah?
Ethernet comes to mind when I think of technologies that just can't be killed ...
Agree. IIRC, I wrote my first SQL is 1996 and wrote some today. While not perfect, it’s amazing at what it does. I suggest reading Joe Celko to really up your SQL skills.
and i wrote my first quel in 1984 at BLI. and sql still won the war.
It's kind of random. There were plenty of competing languages in use until the 90s. Even our beloved Postgres used QUEL for its first decade of life. But Oracle took the dominant lead and its competition either disappeared or started adding SQL support to be compatible with it.
Had a different butterfly flapped its wings all those years ago, things could have turned out very different.
> SQL is basically math.
All programming languages are basically math.
> but the principles behind it are set theory
Thing is, one of the challenges with SQL is that it doesn't adhere to set theory. This leads to some unintuitive traps that could have been avoided had it carefully stuck to the relational algebra.
Such deviation may have been a reasonable tradeoff to help implementations, perhaps, but from a pure language perspective it does make it more terrible than it should have been.
Steampipe is open source [1] and uses Postgres foreign data wrappers under the hood [2]. We have 84+ plugins to SQL query AWS, GitHub, Slack, HN, etc [3]. Mods (written in HCL) provide dashboards as code and automated security & compliance benchmarks [3]. We'd love your help & feedback!
1 - https://github.com/turbot/steampipe 2 - https://steampipe.io/docs/develop/overview 3 - https://hub.steampipe.io/
Having done some work in this space, I'm aware that it's no small thing to compile a high-level SQL statement that describes some analytics task to be executed on a dataset into a low-level program that will efficiently perform that task. This is especially true in the context of big data and distributed analytics. Also true if you're trying to blend different data sources that themselves might not have efficient query engines.
Would love to use this tool, but just curious about some of the details of its implementation.
The Postgres planner is not really optimized for foreign tables, but we can give it hints to indicate optimal paths. We've gradually ironed out many cases here in our FDW implementation particularly for joins etc.
If you can tolerate my Aussie accent, I explain many of the details in this Yow Data talk - https://www.youtube.com/watch?v=2BNzIU5SFaw
That said, I don't imagine this ever being a bottleneck for the main use case of Steampipe - in that case I think the APIs themselves will always be the limiting part. But it does - potentially - speak to what you can expect if you'd like to extend your usage of Steampipe to more than just DevOps data.
I've used the benchmark available in the OctoSQL README.
[0]: https://github.com/cube2222/octosql
[1]: https://github.com/apache/arrow-datafusion
Disclaimer: author of OctoSQL
I've tried setting up steampipe with metabase to do some dashboarding. However, I’ve found that it mostly exposes "configuration" parameters, so to say. I couldn't find dynamic info, like S3 bucket size or autoscaling group instance count.
Have I done something backwards or not noticed a table, or is that a design decision of some sort? That was half a year ago, so things might've changed since then, too.
In some high value cases we've added tables to simplify / abstract / normalize data - for example AWS IAM policies - https://steampipe.io/blog/normalizing-aws-iam-policies-for-a...
This example query returns the number of instances attached to an autoscaling group - https://hub.steampipe.io/plugins/turbot/aws/tables/aws_ec2_a...
BTW, we recently published a Metabase integration guide - https://steampipe.io/docs/cloud/integrations/metabase
It might just be a coincidence, but an hour before this HN post, I discovered it way back in our queue of things to review for https://golangweekly.com/ and featured it in today's issue. Hopefully kiyanwang is one of our readers :-D
Although this is pitched primarily as a "live" query tool, it feels like we could get the most value out of combining this with our existing warehouse, ELT, and BI toolchain.
Do you see people trying to do this, and any advice on approaches? For example, do folks perform joins in the BI layer? (Notably, Looker doesn't do that well.) Or do people just do bulk queries to basically treat your plugins as a Fivetran/Stitch competitor?
But, because it's just Postgres, it can be integrated into many different data infrastructure strategies. Many users query Steampipe from standard BI tools, others use it to extract data into S3, it has also been connected with BigQuery - https://briansuk.medium.com/connecting-steampipe-with-google...
As opposed to a lake or a warehouse, we think of it as a "Data Rainbow" - structured, ephemeral queries on live API data. Because it doesn't have to import the data it works uniquely well for small, wide data and joining large data sets (e.g. search, querying logs). I spoke about this in detail at the Yow Data conference - https://www.youtube.com/watch?v=2BNzIU5SFaw
1 - https://steampipe.io/docs/develop/writing-plugins 2 - https://github.com/topics/steampipe-plugin 3 - https://steampipe.io/community/join 4 - https://hub.steampipe.io/plugins
1 - https://steampipe.io/docs/managing/service 2 - https://steampipe.io/docs/managing/containers 3 - https://steampipe.io/docs/cloud/overview
The real power of SQL is locked until you can join different data sources.
I started a company six years ago to do exactly this -- make a SQL interface for the AWS API. Steampipe out-executed us and made a way better product, in part because their technology choices were much smarter than ours. I shut down the startup as soon as I saw steampipe.
I wish them all the best, it's a great product!
The elastic part let us do free text queries but converting from SQL was challenging (this was before elastic had it built in). And also everything was only as fresh as our last scan.
Writing custom scripts for even the simplest of queries comes nowhere near the convenience of using PostgreSQL to get what you want. If you wanted to find and report all instances in an AWS account across all regions, with steampipe it's just:
Even the simplest implementation with a Python script won't be nearly as nice, and once you start combining tables and doing more complicated queries, there's just no competition.Meanwhile my python script can run any API calls in all our accounts and regions and finish in a few minutes. Maybe 100 lines of python using boto3 and multiprocessing that outputs to json or yaml.
It gets much more convenient when you want to ask more complicated questions like "What is the most used instance type per region" or "How much EBS capacity do instances use across all regions, grouped by environment tag and disk type?"
Im the author of CloudQuery (https://github.com/cloudquery) and I believe we started at the same time though took some different approaches.
PG FDW - is def an exciting PostgreSQL tech and is great for things like on-demand querying and filtering.
We took a more standard ELT approach where we have a protocol between source plugins and destination plugins so we can support multiple databases, data-lakes, storage layers, kinda similar to Fivetran and airbyte but with focus on infrastructure tools.
Also, as a more standard ELT approach our policies use standard SQL directly and dashboards re-use existing BI tools rather then implementing dashboarding in-house.
I can see the power of all-in-one and FDW and def will be following this project!
Immediately thought of CloudQuery when I saw this and sure enough, here you are :)
No way they did not know about that name collision as SteamPipe has been around since 2013-ish.
This made me curious and I started doing research on other problematic name collisions. Did you know that Valve’s game distribution service has a name that’s already widely used in the scientific community for the gaseous form of heated water? It looks like steam has been around for several centuries at least so Valve really ought to have known better.
Anyway, if you’re interested, I’m putting together a long investigation on whether Microsoft’s flagship operating system is in collision with a very common term used in the construction and glass industry. Hope to have it done soon
aCtUaLLy, aPpLes hAvE eXisTeD bEfoRe 1976
https://en.wikipedia.org/wiki/Stephenson%27s_Rocket
Deleted Comment
The plugin SDK provides a default retry/backoff mechanism, and the plugin author can enhance that.
If a heavy query runs twice it'll load from cache the second time, if within the (user-configurable) cache TTL.