I like it, it's readable, unlike some SQL alternatives I've seen it doesn't make me feel like I'm dumb and don't understand what a query even is.
I can't decide if it would be better or worse if it stuck more closely to SQL keywords. You use "from" and "select", but not "where", "order by", "group by". There's some danger of it being in an uncanny valley of SQLish, but I'm pretty sure I'd prefer just using those terms verbatim (including the space in "order by"... that style is less common in modern languages but it's not really that much harder to parse).
I'd like to see more examples of joins and composing SQL. Does this language make it easier to make more general SQL queries? Can I take two queries and squash them together in a reliable way? I feel like I end up with a lot of theme and variation in my queries, often involving optional filters.
I might even like a notion of encapsulation that could help this query language when it's embedded in other languages. Like if I could say, in the language itself, that a query has certain unbound variables (and not just ? or other placeholders). This language seems like it would be better for generating than SQL, and sometimes generation is just necessary (like in any application that supports data exploration), but for most common cases I'd hope to avoid that. Defining inputs and then making whole filter sections or other statements conditional on those inputs would help here.
Yup, I like a lot of things about the way this looks. In particular, I like how friendly this looks to be for things like auto complete (pretty annoying to need to practically type the entire sql query only to go back and fix up the columns in order to get autocomplete to work).
Specific things I'd like to see.
How do you handle column ambiguity. In the examples, they show a join of positions to employee on employee_id == id. But what happens when you have 2 columns with the same name that you are joining on? (like employee_id to employee_id in some mapping table).
Subqueries are pretty important in what I do, so what do those look like (perhaps covered by the "thinking about CTEs section").
How about opportunities for optimization hints? In T-SQL you can hint at which index the optimizer should prefer to a specific query.
Common SQL patterns would also be interesting. Like, how would you do keyset pagination?
Edit: Also, I'd like a discussion about null. SQL null handling rules are terrible. I understand them, I work with them, but at the same time, they are so different from other languages concept of "null" that they are easy to trip over.
> SQL null handling rules are terrible. I understand them, I work with them, but at
> the same time, they are so different from other languages concept of "null" that
> they are easy to trip over.
Could you elaborate? I'm really only versed in the MySQL accent, but I don't find anything unusual or unexpected about NULLS in MySQL. If there are any pitfalls that I should be aware of, I'd love to know about it here before my users start complaining about bugs.
I think supporting variables and functions already solves most of my composability gripes with SQL.
Another problem that I have with composing SQL is that large queries quickly become unreadable, and error messages are also often not terribly helpful. I think having a more expressive type system would help with the error messages.
Do you have any plans on adding a type system to PRQL?
> Can I take two queries and squash them together in a reliable way? I feel like I end up with a lot of theme and variation in my queries, often involving optional filters.
That's essentially what SQL views do. Each view is a query and then you can treat it like a table and filter/join on it.
Of course, then the problem becomes whether or not the query planner can see through the view to the underlying tables to optimize correctly.
> I can't decide if it would be better or worse if it stuck more closely to SQL keywords. You use "from" and "select", but not "where", "order by", "group by". There's some danger of it being in an uncanny valley of SQLish, but I'm pretty sure I'd prefer just using those terms verbatim (including the space in "order by"... that style is less common in modern languages but it's not really that much harder to parse)
I agree 100% here. As a SQL veteran, it would make the transition a lot easier if you used common SQL keywords like group by, order by, limit, etc.
e.g.
from employees
where country = "USA"
derive [
gross_salary: salary + payroll_tax,
gross_cost: gross_salary + benefits_cost
]
where gross_cost > 0
group by:[title, country] [
average salary,
sum salary,
average gross_salary,
sum gross_salary,
average gross_cost,
sum_gross_cost: sum gross_cost,
count,
]
order by:sum_gross_cost
where count > 200
limit 20
I like the flow direction compared to standard SQL. SQL is supposed to read like a sentence I suppose but I have many times looked at it and really wanted things to be in a more logical order.
My main suggestion would be to be a bit less terse and introduce a bit more firm formatting. I'm not a huge fan of the term "split" and feel like jazzing that up to "split over" or even just reviving "group by" would improve readability. Additionally the aliasing could use work, I'd suggest reversing the assignment to be something closer to `use salary + payroll_tax as gross salary`. In terms of firm formatting, unless I'm missing something there isn't any reason to allow a filter statement before any aliases - so you can force two fixed positions for filter clauses which would make it always legal to reference aliases in filters.
On the brief topic of form vs. flexibility. SQL is a thing that, when complex, is written by many people over the course of its lifetime - removing the ability to make bad decisions is better than enabling the ability to write simple things even simpler - those silly do nothing queries like "SELECT * FROM customer WHERE deleted='f'` are written once[1] in a moments time and never inspected again. The complex queries are what you want to optimize for.
1. If they even are - with ORMs available a lot of those dead simple queries just end up being done through an ORM.
> On the brief topic of form vs. flexibility. SQL is a thing that, when complex, is written by many people over the course of its lifetime - removing the ability to make bad decisions is better than enabling the ability to write simple things even simpler
Hallelujah! But, to your footnote, this is a major reason why I despise ORMs. In my mind they make writing simple code slightly easier, but they make complicated SQL statements, especially when you get some weird issue under load and you're trying to debug why your DB is falling over, a ton more difficult and you spend so much time just battling your ORM.
On ORMs, the best use I see of them is for “transparent” queries that you don’t define.
Like fetching a record by id, or a single record and all of its related properties. Or a list of all the record in a table matching a simple filter.
That’s 98% of what we do against the DB, and I’m all for having it basically invisible.
Then let’s just bypass the ORM altogether the minute we think about joining or grouping things together. There are libs in most language that help just sanitize queries, so it’s no difficult really.
I like the flow direction specifically for intellisense/autocomplete. I'm sure it would be easier to provide hints when the table name is known immediately.
Now this is actually nice, unlike the other suggestion posted today[1].
Maybe I'm just too used to non-standard extensions of our database but the SQL example could, at least for our db, be rewritten as
SELECT TOP 20
title,
country,
AVG(salary) AS average_salary,
SUM(salary) AS sum_salary,
AVG(gross_salary) AS average_gross_salary,
SUM(gross_salary) AS sum_gross_salary,
AVG(gross_cost) AS average_gross_cost,
SUM(gross_cost) AS sum_gross_cost,
COUNT(*) as count
FROM (
SELECT
title,
country,
salary,
(salary + payroll_tax) AS gross_salary,
(salary + payroll_tax + healthcare_cost) AS gross_cost
FROM employees
WHERE country = 'USA'
) emp
WHERE gross_cost > 0
GROUP BY title, country
ORDER BY sum_gross_cost
HAVING count > 200
This cuts down the repetition a lot, and can also help the optimizer in certain cases. Could do another nesting to get rid of the HAVING if needed.
Still, think the PRQL looks very nice, especially with a "let" keyword as mentioned in another thread here.
with usa_employees as (
SELECT
title,
country,
salary,
(salary + payroll_tax) AS gross_salary,
(salary + payroll_tax + healthcare_cost) AS gross_cost
FROM employees
WHERE country = 'USA'
AND (salary + payroll_tax + healthcare_cost) > 0
)
select title,
country,
AVG(salary) AS average_salary,
SUM(salary) AS sum_salary,
AVG(gross_salary) AS average_gross_salary,
SUM(gross_salary) AS sum_gross_salary,
AVG(gross_cost) AS average_gross_cost,
SUM(gross_cost) AS sum_gross_cost,
COUNT(*) as emp_count
from usa_employees
group by title, country
having count(*) > 200
order by sum_gross_cost
limit 3
Readability is pretty similar to prql. It would really help in SQL if you could refer to column aliases so you don't have to repeat the expression.
In some cases for removing repeating (intermediate) calculations, I generally find it easier to use a lateral join (in postgres), like
select
title,
country,
avg(salary) as average_salary,
sum(salary) as sum_salary,
avg(gross_salary) as average_gross_salary,
sum(gross_salary) as sum_gross_salary,
avg(gross_cost) as average_gross_cost,
sum(gross_cost) as sum_gross_cost,
count(*) as emp_count
from
employees,
lateral ( select
(salary + payroll_tax) as gross_salary,
(salary + payroll_tax + healthcare_cost) as gross_cost
) employee_ext
where
country = 'usa'
and gross_cost > 0
group by title, country
having count(*) > 200
order by sum_gross_cost
limit 3;
Column aliases would have saved me hundreds of hours over the course of my career. Sorely missing from standard SQL, and would make the need for PRQL less acute.
Snowflake lets you refer to column aliases, and it's great!
There's the slight issue of shadowing of table column names, which they resolve by preferring columns to aliases if both are named the same. So sometimes my aliases end up prefixed with underscores, but that's not a big deal.
Not all database systems can optimize queries well over CTE boundaries. I believe this is still true for PostgreSQL (no longer true, see below -- it was true a few years ago). So there's a potential performance hit for (the otherwise excellent advice of) writing with CTE's.
In Microsoft SQL cross apply can be used for this in even more situations and with less repetition:
select top(20)
title,
country,
...
avg(gross_salary) as average_gross_salary,
...
from employees
cross apply ( select
gross_salary = employees.salary + employees.payroll_tax, -- or "as .."
gross_cost = ...
) v -- some name required but don't need to use it if column names are unique
where ...
Very cool! A couple questions/suggestions off the top of my head:
1. Did you consider using a keyword like `let` for column declarations, e.g. `let gross_salary = salary + payroll_tax` instead of just `gross_salary = salary + payroll_tax`? It's nice to be able to scan for keywords along the left side of the window, even if it's a bit more verbose.
2. How does it handle the pattern where you create two moderately complex CTEs or subqueries (maybe aggregated to different levels of granularity) and then join them to each other? I always found that pattern awkward to deal with in dplyr - you have to either assign one of the "subquery" results to a separate dataframe or parenthesize that logic in the middle of a bigger pipeline. Maybe table-returning functions would be a clean way to handle this?
> Did you consider using a keyword like `let` for column declarations
Yeah, the current design for that is not nice. Good point re the keyword scanning. I actually listed `let` as an option in the notes section. Kusto uses `extend`; dplyr uses `mutate`; pandas uses `assign`.
> 2. How does it handle the pattern where you create two moderately complex CTEs or subqueries (maybe aggregated to different levels of granularity) and then join them to each other? I always found that pattern awkward to deal with in dplyr - you have to either assign one of the "subquery" results to a separate dataframe or parenthesize that logic in the middle of a bigger pipeline. Maybe table-returning functions would be a clean way to handle this?
I don't have an example on the Readme, but I was thinking of something like (toy example):
table newest_employees = (
from employees
sort tenure
take 50
)
from newest_employees
join salary [id]
select [name, salary]
Or were you thinking something more sophisticated? I'm keen to get difficult examples!
There, each variable can be referenced by downstream steps. Generally, the prior step is referenced. Without table variables, your language implicitly pipes the most recent one. With table references, you can explicitly pipe any prior one. That way, you can reference multiple prior steps for a join step.
I haven't thought through that fully, so there may be gotchas in compiling such an approach down to SQL, but you can already do something similar in SQL CTEs anyway, so it should probably work.
Maybe you'd like to check FunSQL.jl, my library for compositional construction of SQL queries. It also follows algebraic approach and covers many analytical features of SQL including aggregates/window functions, recursive queries and correlated subqueries/lateral joins. One thing where it differs from dlpyr and similar packages is how it separates aggregation from grouping (by modeling GROUP BY with a universal aggregate function).
I like the explicit pipelining idea, seems much easier to reason about. Some comments:
I found the "# `|` can be used rather than newlines." a bit odd. So when using let, you're only transforming one column? I think the example would look weird with returns instead of |.
Depending on your intended target, it might help adoption if you stay closer to the naming conventions of that target. If you're targeting mainstream Java/Python/C#/Javascript etc. then functions need parentheses, "take 20" may be worse than slice, etc.
I think annotating microversions would get tiresome fast. I think the right way to think of this is that you put in a single version number like 1, and then only ever change that if you need to do backwards-compatible changes that cannot be handled by clever hacks in the runtime.
Also I think you should try writing one or more native wrappers in your intended target languages to make sure it's easy to interface between the two, even if it means you'd have to use dots in that language.
I could imagine an end game where the ergonomics were so good that a database like Postgres ends up with a native PRQL frontend. Not sure you're there yet, though. IMHO SQL as a query language suffers from a) sometimes really bad ergonomics, b) it's hard to wrap in another programming language (also ergonomics), c) it has far too many concepts - it's not orthogonal.
> I think annotating microversions would get tiresome fast. I think the right way to think of this is that you put in a single version number like 1, and then only ever change that if you need to do backwards-compatible changes that cannot be handled by clever hacks in the runtime.
Thanks good idea, I just changed this to remove the microversions. If we use SemVer, then before `1`, we'd hold versions compatible to the 0.X, and then to the X.
IMO you are at the forefront of where query languages need to and will go.
Some programmers like you see that SQL ordering is backwards to human thinking, except in the simplest cases. But many people with practice and sunk costs in their SQL expertise will be resistant. The resistance usually wins the day.
But sometimes, a useful tool gets created by one person, and a rift is created in that resistance. Think John Resig creating jQuery, leading to LINQ and many other similar patterns. You could be that person for database query languages, but how do you ensure that?
Maybe imagine what made jQuery easy to adopt and indispensable for programmers: easy availability as a simple .js download; solved the problem of DOM differences between browsers. Good luck to you, and thanks for sharing.
wr to linq to sql: the difference is linq works by making objects to tables and dotnet primitives to sql types, often producing really poor queries as a result
This is a nice idea, especially given all the work people have done recently to make in-language querying nicer (Spark comes to mind).
My only gripe is the 'auto-generated' column names for aggregates. This seems like a recipe for disaster - what if there is already (as there almost certainly will be) named "sum_gross_cost"? The behavior also just seems rather unexpected and implicit. My suggestion would be simple syntax that lets you optionally give a name to a particular aggregate column:
...
filter gross_cost > 0
aggregate by:[title, country] [
average salary,
sum gross_salary,
average gross_cost,
let sum_gc = sum gross_cost,
count,
]
sort sum_gc
While it might seem a little uglier, it seems much more sustainable in the long run. If this is really too gross, I'd advocate some token other than underscore that is reserved for aggregation variables; perhaps `sum@gross_cost` or `sum#gross_cost`.
I'm quite opposed to the idea "from should be first".
I want to understand what exactly the query returns, not the implementation detail of the source of this data (that can later be changed).
Literally first example from page - I have no idea what is being returned:
from employees
filter country = "USA" # Each line transforms the previous result.
let gross_salary = salary + payroll_tax # This _adds_ a column / variable.
let gross_cost = gross_salary + benefits_cost # Variables can use other variables.
filter gross_cost > 0
aggregate by:[title, country] [ # `by` are the columns to group by.
average salary, # These are the calcs to run on the groups.
sum salary,
average gross_salary,
sum gross_salary,
average gross_cost,
sum gross_cost,
count,
]
sort sum_gross_cost # Uses the auto-generated column name.
filter count > 200
take 20
of course, similar things are happening to SQL too, with CTEs becoming more widespread and "real" list of the columns hidden somewhere inside, but it's still parseable
Quick, what is this query about? What's ironic is that I think you have it backwards: the columns are the implementation detail, not the table. The table is the context: you can't change that without having to change everything else. But columns are the last step, the selection after the filters, joins, etc. They can be changed at any time without affecting the logic.
This is... An odd choice. I'd assume I'm not without context looking at a query to know why I would want those columns.
And the auto complete story is backwards. Often I know what columns I want, but I'm not clear what table I need to get them from. Such that, if you make a smarter suggest in the from to only include tables that have the columns, I'd be much happier.
It's query asking for the id, name, and author fields. Very straightforward, I have no idea how this is confusing.
> The table is the context: you can't change that without having to change everything else.
Except even in the provided single-table example this isn't true - you're getting subselected/CTEd results. No functional joins are demonstrated unfortunately.
For example:
from employees
left_join positions [id=employee_id]
...is equivalent to...
SELECT * FROM employees LEFT JOIN positions ON id = employee_id
No data is selected from positions in either example, and it's unclear on why we're joining that table (other than just for the heck of it). It's not a workable example.
The big advantage of "from first" like we have in Kusto KQL (a database we use at Microsoft) is that it provides much better autocomplete (if I write the `from` it can easily autocomplete the projection).
If you want an interesting example of how a query language built for developer experience and autocompletions looks definitely check it out!.
That's interesting because it also explains why I was going to say I do like having from first. When trying to reason about a query, I mentally go through the following:
1. What tables are being pulled from? This speaks to the potential domain of the query.
2. What data is being selected (I can now know what is or isn't being pulled from the aforementioned tables...)
3. What operations, aggregations, groupings, etc. are being performed to work on the pulle data
Of course from vs select ordering is completely arguable, but my thinking process seems to follow that of the auto complete--in other words that my cognitive load of looking at the select statement is lessened when I know from what the columns are being selected.
It also follows (at least to me) the mental process of writing the query. First look at the tables, then decide what columns, then decide what functions to apply.
I said it in a sibling, but I feel this is somewhat missed. Auto complete that simply lists the tables is easier if from is first. But... Auto complete that helps me know what tables can give me my requested columns works the other direction.
I think it's quite a common convention in engineering - not just software - that the input to a process "goes in the top and out the bottom". We humans read top->bottom (regardless of left/right/vertical, I don't know any languages that write bottom up). Conventional voltage in circuit diagrams usually flow top to bottom. Gravity loads in schematics flow top to bottom. Chemical pathways are usually written top to bottom. And of course functions take arguments up top and return at the bottom, maybe with some short circuits. I think the only counter example of note is distillation columns.
Where is the data coming from? Employees table. What's coming out? 20 rows of sum_gross_cost.
What could improve this is function signatures. It's kind of nice to have the whole abstraction up top...like an abstract.
I agree that the columns of the results should be more obvious. But I am a proponent of "from should be first". I have never written a SQL query without thinking about the contents of a table or its relations. If it was my way, I would describe where the data I'm pulling from, then describe any filters/joins, then describe the columns that I'm interested in (last).
It's a fair sentiment, but it can be handled without losing directional flow and composability, some of the bigger advantages of reworking SQL.
One idea would be along the lines of a function prototype: a declaration, up front, about the columns and types that a query is expected to return. It's a good place to put documentation, it's redundant information which should protect against mistakes but not so redundant that it would be too taxing - the author should know what the query returns. The prototype would only be used for validation of column names and types.
Another idea would be requiring the last element in a query to be a projection, a bit like the return statement in a function body: here's what I'm returning out of the grand set of symbols available (e.g. via various joins) in scope from previous operations in the flow.
Without the `let` I would imagine having trouble reading it as well, I'm not sure if that would go away with familiarity but my instinct is that it's a useful addition.
This feels like a English-language thing. In english we tend to put our adjectives first, it feels natural, "Where is my red, round ball?", rather than some other languages (like German) where you put the subject first. Equivalent of "Where is my ball, red & round?"
While it inherently feels unnatural I do agree with the others here that the context is actually easier to understand once over the initial uncomfort.
_from_ is kind of one of the most important context about the data being returned. It provides the type information. Columns you select are just properties of that type.
In SQL, where _from_ is placed at the end, we are essentially writing equivalent of 'property.object'; eg.: name.person, age.person
Both CTEs and this idea address the same problem: poor readability of complex SQL queries. Compared to CTEs, the author takes the idea to split the complex query into parts to the next level.
To your point - a solid IDE will show you what's being processed at each line (or returned, if the cursor is on the last line) - in an autocomplete window or a side panel.
I can't decide if it would be better or worse if it stuck more closely to SQL keywords. You use "from" and "select", but not "where", "order by", "group by". There's some danger of it being in an uncanny valley of SQLish, but I'm pretty sure I'd prefer just using those terms verbatim (including the space in "order by"... that style is less common in modern languages but it's not really that much harder to parse).
I'd like to see more examples of joins and composing SQL. Does this language make it easier to make more general SQL queries? Can I take two queries and squash them together in a reliable way? I feel like I end up with a lot of theme and variation in my queries, often involving optional filters.
I might even like a notion of encapsulation that could help this query language when it's embedded in other languages. Like if I could say, in the language itself, that a query has certain unbound variables (and not just ? or other placeholders). This language seems like it would be better for generating than SQL, and sometimes generation is just necessary (like in any application that supports data exploration), but for most common cases I'd hope to avoid that. Defining inputs and then making whole filter sections or other statements conditional on those inputs would help here.
Specific things I'd like to see.
How do you handle column ambiguity. In the examples, they show a join of positions to employee on employee_id == id. But what happens when you have 2 columns with the same name that you are joining on? (like employee_id to employee_id in some mapping table).
Subqueries are pretty important in what I do, so what do those look like (perhaps covered by the "thinking about CTEs section").
How about opportunities for optimization hints? In T-SQL you can hint at which index the optimizer should prefer to a specific query.
Common SQL patterns would also be interesting. Like, how would you do keyset pagination?
Edit: Also, I'd like a discussion about null. SQL null handling rules are terrible. I understand them, I work with them, but at the same time, they are so different from other languages concept of "null" that they are easy to trip over.
Thanks.
I just fleshed out composing CTEs, which is a small step towards the broader goal of making composition easier: https://github.com/max-sixty/prql/commit/dc68fcaaceef26cc078...
Let me know if you have a good case of the sort of composition you find difficult in SQL (either here or in an issue). Thank you!
Another problem that I have with composing SQL is that large queries quickly become unreadable, and error messages are also often not terribly helpful. I think having a more expressive type system would help with the error messages. Do you have any plans on adding a type system to PRQL?
That's essentially what SQL views do. Each view is a query and then you can treat it like a table and filter/join on it.
Of course, then the problem becomes whether or not the query planner can see through the view to the underlying tables to optimize correctly.
I agree 100% here. As a SQL veteran, it would make the transition a lot easier if you used common SQL keywords like group by, order by, limit, etc. e.g.
My main suggestion would be to be a bit less terse and introduce a bit more firm formatting. I'm not a huge fan of the term "split" and feel like jazzing that up to "split over" or even just reviving "group by" would improve readability. Additionally the aliasing could use work, I'd suggest reversing the assignment to be something closer to `use salary + payroll_tax as gross salary`. In terms of firm formatting, unless I'm missing something there isn't any reason to allow a filter statement before any aliases - so you can force two fixed positions for filter clauses which would make it always legal to reference aliases in filters.
On the brief topic of form vs. flexibility. SQL is a thing that, when complex, is written by many people over the course of its lifetime - removing the ability to make bad decisions is better than enabling the ability to write simple things even simpler - those silly do nothing queries like "SELECT * FROM customer WHERE deleted='f'` are written once[1] in a moments time and never inspected again. The complex queries are what you want to optimize for.
1. If they even are - with ORMs available a lot of those dead simple queries just end up being done through an ORM.
Hallelujah! But, to your footnote, this is a major reason why I despise ORMs. In my mind they make writing simple code slightly easier, but they make complicated SQL statements, especially when you get some weird issue under load and you're trying to debug why your DB is falling over, a ton more difficult and you spend so much time just battling your ORM.
Like fetching a record by id, or a single record and all of its related properties. Or a list of all the record in a table matching a simple filter.
That’s 98% of what we do against the DB, and I’m all for having it basically invisible.
Then let’s just bypass the ORM altogether the minute we think about joining or grouping things together. There are libs in most language that help just sanitize queries, so it’s no difficult really.
And I agree with you on both the assignments and `split` being a bit awkward. Kusto just uses `by`, WDYT?
Maybe I'm just too used to non-standard extensions of our database but the SQL example could, at least for our db, be rewritten as
This cuts down the repetition a lot, and can also help the optimizer in certain cases. Could do another nesting to get rid of the HAVING if needed.Still, think the PRQL looks very nice, especially with a "let" keyword as mentioned in another thread here.
[1]: https://news.ycombinator.com/item?id=30053860
I just look at something like this and I immediately know what's going on. If it's nested sub queries it always takes me much longer.
There's the slight issue of shadowing of table column names, which they resolve by preferring columns to aliases if both are named the same. So sometimes my aliases end up prefixed with underscores, but that's not a big deal.
The DB we use supports those, I just learned about them too late so keep forgetting they exist :(
> It would really help in SQL if you could refer to column aliases so you don't have to repeat the expression.
The DB we use supports that, so in your CTE you could write
We do that all the time, which will be a pain now that we're migrating to a different DB server which doesn't.1. Did you consider using a keyword like `let` for column declarations, e.g. `let gross_salary = salary + payroll_tax` instead of just `gross_salary = salary + payroll_tax`? It's nice to be able to scan for keywords along the left side of the window, even if it's a bit more verbose.
2. How does it handle the pattern where you create two moderately complex CTEs or subqueries (maybe aggregated to different levels of granularity) and then join them to each other? I always found that pattern awkward to deal with in dplyr - you have to either assign one of the "subquery" results to a separate dataframe or parenthesize that logic in the middle of a bigger pipeline. Maybe table-returning functions would be a clean way to handle this?
> Did you consider using a keyword like `let` for column declarations
Yeah, the current design for that is not nice. Good point re the keyword scanning. I actually listed `let` as an option in the notes section. Kusto uses `extend`; dplyr uses `mutate`; pandas uses `assign`.
I opened an issue here: https://github.com/max-sixty/prql/issues/2
I don't have an example on the Readme, but I was thinking of something like (toy example):
Or were you thinking something more sophisticated? I'm keen to get difficult examples!Edit: formatting
There, each variable can be referenced by downstream steps. Generally, the prior step is referenced. Without table variables, your language implicitly pipes the most recent one. With table references, you can explicitly pipe any prior one. That way, you can reference multiple prior steps for a join step.
I haven't thought through that fully, so there may be gotchas in compiling such an approach down to SQL, but you can already do something similar in SQL CTEs anyway, so it should probably work.
Let me know any feedback — as you can see it's still at the proposal stage. If it gains some traction I'll write an implementation.
I guess the biggest difference between FunSQL (and similarly dbplyr) and PRQL is that the former needs a Julia (or R) runtime to run.
I really respect the library and keen to see how it develops.
I found the "# `|` can be used rather than newlines." a bit odd. So when using let, you're only transforming one column? I think the example would look weird with returns instead of |.
Depending on your intended target, it might help adoption if you stay closer to the naming conventions of that target. If you're targeting mainstream Java/Python/C#/Javascript etc. then functions need parentheses, "take 20" may be worse than slice, etc.
I think annotating microversions would get tiresome fast. I think the right way to think of this is that you put in a single version number like 1, and then only ever change that if you need to do backwards-compatible changes that cannot be handled by clever hacks in the runtime.
Also I think you should try writing one or more native wrappers in your intended target languages to make sure it's easy to interface between the two, even if it means you'd have to use dots in that language.
I could imagine an end game where the ergonomics were so good that a database like Postgres ends up with a native PRQL frontend. Not sure you're there yet, though. IMHO SQL as a query language suffers from a) sometimes really bad ergonomics, b) it's hard to wrap in another programming language (also ergonomics), c) it has far too many concepts - it's not orthogonal.
Thanks good idea, I just changed this to remove the microversions. If we use SemVer, then before `1`, we'd hold versions compatible to the 0.X, and then to the X.
Also want "OFFSET 20" from PG.
Some programmers like you see that SQL ordering is backwards to human thinking, except in the simplest cases. But many people with practice and sunk costs in their SQL expertise will be resistant. The resistance usually wins the day.
But sometimes, a useful tool gets created by one person, and a rift is created in that resistance. Think John Resig creating jQuery, leading to LINQ and many other similar patterns. You could be that person for database query languages, but how do you ensure that?
Maybe imagine what made jQuery easy to adopt and indispensable for programmers: easy availability as a simple .js download; solved the problem of DOM differences between browsers. Good luck to you, and thanks for sharing.
And then dump the queries via https://stackoverflow.com/questions/1412863/how-do-i-view-th... or https://www.linqpad.net/ ?
My only gripe is the 'auto-generated' column names for aggregates. This seems like a recipe for disaster - what if there is already (as there almost certainly will be) named "sum_gross_cost"? The behavior also just seems rather unexpected and implicit. My suggestion would be simple syntax that lets you optionally give a name to a particular aggregate column:
While it might seem a little uglier, it seems much more sustainable in the long run. If this is really too gross, I'd advocate some token other than underscore that is reserved for aggregation variables; perhaps `sum@gross_cost` or `sum#gross_cost`.For what it's worth, a similar problem already exists with SQL. Something simple like
automatically aliases the column to `count`, even if `my_table` has a column called `count`.In practice, I don't think this is a major problem.
I'm not sure whether we should force naming? When I'm writing a query often I'm fine with something auto-generated when starting out.
I want to understand what exactly the query returns, not the implementation detail of the source of this data (that can later be changed).
Literally first example from page - I have no idea what is being returned:
of course, similar things are happening to SQL too, with CTEs becoming more widespread and "real" list of the columns hidden somewhere inside, but it's still parseableAnd the auto complete story is backwards. Often I know what columns I want, but I'm not clear what table I need to get them from. Such that, if you make a smarter suggest in the from to only include tables that have the columns, I'd be much happier.
> The table is the context: you can't change that without having to change everything else.
Except even in the provided single-table example this isn't true - you're getting subselected/CTEd results. No functional joins are demonstrated unfortunately.
For example:
No data is selected from positions in either example, and it's unclear on why we're joining that table (other than just for the heck of it). It's not a workable example.If you want an interesting example of how a query language built for developer experience and autocompletions looks definitely check it out!.
1. What tables are being pulled from? This speaks to the potential domain of the query. 2. What data is being selected (I can now know what is or isn't being pulled from the aforementioned tables...) 3. What operations, aggregations, groupings, etc. are being performed to work on the pulle data
Of course from vs select ordering is completely arguable, but my thinking process seems to follow that of the auto complete--in other words that my cognitive load of looking at the select statement is lessened when I know from what the columns are being selected.
It also follows (at least to me) the mental process of writing the query. First look at the tables, then decide what columns, then decide what functions to apply.
The language should be right for human understanding, not automated mad-lib generation.
Where is the data coming from? Employees table. What's coming out? 20 rows of sum_gross_cost.
What could improve this is function signatures. It's kind of nice to have the whole abstraction up top...like an abstract.
For example:
One idea would be along the lines of a function prototype: a declaration, up front, about the columns and types that a query is expected to return. It's a good place to put documentation, it's redundant information which should protect against mistakes but not so redundant that it would be too taxing - the author should know what the query returns. The prototype would only be used for validation of column names and types.
Another idea would be requiring the last element in a query to be a projection, a bit like the return statement in a function body: here's what I'm returning out of the grand set of symbols available (e.g. via various joins) in scope from previous operations in the flow.
Reading the comment however, it would seem that `let` adds columns which are implicitly returned in the order they are defined.
I do see benefits in this, and can imagine pitfalls. Hard to judge without kicking the tires.
Update: It's quite possible we saw different syntax!
https://news.ycombinator.com/item?id=30063266
Without the `let` I would imagine having trouble reading it as well, I'm not sure if that would go away with familiarity but my instinct is that it's a useful addition.
While it inherently feels unnatural I do agree with the others here that the context is actually easier to understand once over the initial uncomfort.
In SQL, where _from_ is placed at the end, we are essentially writing equivalent of 'property.object'; eg.: name.person, age.person
To your point - a solid IDE will show you what's being processed at each line (or returned, if the cursor is on the last line) - in an autocomplete window or a side panel.