One of the wildest R features I know of comes as a result of lazy argument evaluation combined with the ability to programmatically modify the set of variable bindings. This means that functions can define local variables that are usable by their arguments (i.e. `f(x+1)` can use a value of `x` that is provided from within `f` when evaluating `x+1`). This is used extensively in practice in the dplyr, ggplot, and other tidyverse libraries.
I think software engineers often get turned off by the weird idiosyncrasies of R, but there are surprisingly unique (arguably helpful) language features most people don't notice. Possibly because most of the learning material is data-science focused and so it doesn't emphasize the bonkers language features that R has.
I saw a funny presentation where Doug Bates said something like: "This kind of evaluation opens the door to do many strange and unspeakable things in R... for some reason Hadley Wickham is very excited about this."
One of the stranger behaviours for me is that R allows you to combine infix operators with assignments, even thou there are no implemented instances of it in R itself. For example:
`%in%<-` <- function(x, y, value) { x[x %in% y] <- value; x}
x <- c("a", "b", "c", "d")
x %in% c("a", "c") <- "o"
x
[1] "o" "b" "o" "d"
Or slightly crazier:
`<-<-` <- function(x, y, value) paste0(y, "_", value)
"a" -> x <- "b"
x
[1] "a_b"
We with Antoine Fabri created a package that uses this behaviour for some clever replacement operators [1], but beyond that I don't see where this could be useful in real practice.
That sounds like asking for trouble. Someone coming from any other programming language could easily forget that expression evaluation is stateful. Better to be explicit and create an object representing a expression. Tell me, at least, that the variable is immutable in that context?
The good news is that most variables in R are immutable with copy-on-write semantics. Therefore, most of the time everything here will be side-effect-free and any weird editing of the variable bindings is confined to within the function. (The cases that would have side effects are very uncommonly used in my experience)
Asking out of lack of experience with R: how does such invocation handle case when `x` is defined with a different value at call site?
In pseudocode:
f =
let x = 1 in # inner vars for f go here
arg -> arg + 1 # function logic goes here
# example one: no external value
f (x+1) # produces 3 (arg := (x+1) = 2; return arg +1)
# example two: x is defined in the outer scope
let x = 4 in
f (x+2) # produces 5 (arg := 4; return arg + 1)? Or 3 if inner x wins as in example one?
If the function chooses to overwrite the value of a variable binding, it doesn't matter how it is defined at the call site (so inner x wins in your example). In the tidyverse libraries, they often populate a lazy list variable (think python dictionary) that allows disambiguating in the case of name conflicts between the call site and programmatic bindings. But that's fully a library convention and not solved by the language.
Well the point is that the function can define its own logic to determine the behaviour. Users can also (with some limits) restrict the variable scope.
A lot of the time you're not actually using what is passed to the function, but instead the name of the argument passed to the function (f(x), instead of f('x')). Which, helps the user with their query (dplyr) or configuration (ggplot2).
> I think software engineers often get turned off by the weird idiosyncrasies of R
That was at least true when I was looking at it. I didn't get it, but the data guys came away loving it. I came away from that whole experience really appreciating how far you can get with an "unclean" design if you persist, and how my gut feeling of good (with all the heuristics for quality that entails) is really very domain specific.
I had a colleague at Google who used to say: "The best thing about R is that is was created by statisticians. The worst thing about R was that it was created by statisticians."
(author here, still getting over the first time I've seen one of my own posts on this site)
The many recommendations for J here are a great nudge for me to give it a proper go. I've taken quite a liking to the traditional APL glyphs ( see a photo of the stickers on my laptop keys in this post https://jcarroll.com.au/2023/12/10/advent-of-array-elegance/ ) so I'm not looking for a way to avoid them.
Another detraction I've seen around is about the ambivalence of APL glyphs (taking either 1 or 2 arguments and doing something different in each case). I don't particularly mind it because I think it becomes more natural to "understand" how a function is being used the more familiar you become with it, but without the limitation on the number of glyphs, I can see the benefit of separating those.
I have not. I've done at least one problem (some many more) in each of 32 languages on Exercism so far, though. Looking at your example, there's some familiar features from the lisp and ML families.
>> find the GCD (greatest common divisor) of the smallest and largest numbers in an array
Just for a short comparison, In J the analogous code is </ +. >/
Where / is for reduce, +. is for the GCD, the LCM is *.
The basic idea of J notation is using some small change to mean the contrary, for example {. for first and {: for last, {. for take and }. for drop (one symbol can be used as a unary or binary operator with different meaning. So if floor is <. you can guess what will be the symbol for roof. For another example /:~ is for sorting in ascending order and I imagine that you can guess what is the symbol for sorting in descending order. In a sense, J notation include some semantic meaning, a LLM could use that notation to try to change an algorithm. So perhaps someone could think about how to expand this idea for LLM to generate new algorithms.
The matrix m, the sum of the rows, and the maximum of the sum of the rows in J (separated by ;)
m ; (+/ m) ; >./ +/ m
┌─────┬───────┬──┐
│0 1 2│9 12 15│15│
│3 4 5│ │ │
│6 7 8│ │ │
└─────┴───────┴──┘
To understand this you need to know that >. and <. are the min and max functions, and that in J three functions separated by spaces, f g h, constitutes a new function mathematically defined by (f g h)(x) = g(f(x), h(x)). An example is (+/ % #) which applied to a list gives the mean of the list. Here +/ gives the total, # gives the number of elements and % is the quotient.
> "So, would APL be “readable” if I was more familiar with it? Let’s find out!"
An alternative test for this hypothesis might have been using the language J, which is an array language based on APL and by the designer of APL but only using ASCII characters.
R itself could be considered a test of this hypothesis, too. It’s been said that elegant, powerful Lisp would be more widely adopted if it wasn’t for all those gosh-darned parenthesis.
Well, at its core R is a Lisp (specifically, Scheme) but with a more traditional syntax (infixed operators, function calls, etc). And it’s fair to say the adoption of R has, indeed, been more widespread than that of Lisp.
I’m not totally convinced that being ‘secretly a lisp’ is what was good about R. I think the easy vectorisation is good, and the consequences of the bizarre function argument evaluation are good. I don’t know of lisps that do the vectorisation stuff so naturally, and while I guess fexprs are a thing, I think they are possibly too general in the syntax they can accept – basically the simplicity of lisp syntax allows macros to have more tree-structured input in a way you wouldn’t want for a language with non-lisp syntax (where the head lives outside the list), and I think the flexibility makes the syntax more confusingly non-uniform.
I'm not sure I would come to this conclusion. R has some adoption, but it's also really not used as a generic programming language, which most Lisp dialects are.
As someone who loved learning lisp and regrets that the long course of my programming career has never led me to use it in a professional capacity, I just don't buy it when people say that parentheses are the reason people didn't adopt lisp more widely. I would say the main reasons are:
1) The language is so frikkin massive. Common lisp is a huge language with hundreds and hundreds of built-in functions etc and the standard came very late in its evolution so there is a bunch of back compat cruft and junk that everyone has to live with. The object system is a whole epic journey in itself. You could probably kill or at least seriously injure someone with the impact if they were lying down and you dropped a copy of Guy Steele's excellent book[1] on them from a standing height.
2) The ecosystem is so fragmented. First you have Common Lisp, which isn't very common at all. Then you have all the vendor lisps. Then you have whether they have or don't have clos to contend with. Elisp is a lisp but is not common lisp and differs in some important ways that I don't quite remember. Then there's scheme, and guile scheme (which isn't quite the same) then clojure, etc etc.
3) That meant that the tooling was basically all simultaneously amazing and awful. As an example my uncle wrote a tcp/ip stack in lisp for the symbolics lisp machine[2] for a project when he worked at xerox. He told me in the late 80s about features in the symbolics debugger that just totally blew my mind and are only now available in IDEs for other languages, like being able to step backwards, alter variables, then step forward again, jump to any stack frame and just resume execution from there etc etc. On the other hand he had to write the TCP/IP stack himself because they didn't have one. I think that perfectly encapsulates the lisp experience for me around 2000 when I last used it - some things worked amazingly and were way better than anything else (eg I remember at the time the things you could do with serialization being just extraordinary compared to other languages) but a bunch of basic stuff was painful, janky or just completely missing.
4) Some of the concepts are very powerful but result in programs that are incredibly hard to understand. Macros, continuation passing, multiple dispatch.. etc etc. This puts a lot of people off because they just hit the learning cliff face-first and give up.
This is part of why python saw such wide adoption in my opinion. Not because it was in any sense the best language, but it was a very easy, practical choice for doing a bunch of things.
I personally think APL is wonderful simply because of the original APL specific keyboard [1]
I've looked briefly at R and found the syntax and semantics to be less than stellar. Obviously there's going to be some bias in that sentiment due me not generally doing "array programming", but I don't believe the things that irked me were entirely as a result of that.
The more annoying stuff for R is entirely second hand. As far as I can tell R (or at least R studio) maintains implicit state between runs which means you can get to a position where the same code works on some runs, and then not on later runs. My friend was having to do a lot of bioinformatics processing (many of the libraries for this are in R) and was constantly fighting to have code she wrote to process the data or produce charts (publications in bioinformatics have an acceptance bias for "looks like it came from R" that is similar to what CS [used to?] have for gnu plot). But you could run the same scripts on the same input and have it fail where previously it worked. This is before you deal with inter-version compatibility problems which also seemed frequent.
What was irksome to me looking at a lot of the stuff that were doing is that it was fundamentally mostly basic scripting stuff you could do in other languages trivially (and more cleanly imo) but there were a bunch of functions (builtin or from libraries?) that did the work, but those functions weren't in R, so the claims that R was "necessary" seemed fairly bogus to me.
You can save your workspace (state) in R. It's generally bad practice to do so.
R is VERY VERY good at handling tabular data. Python can get kind of close with Pandas but IMO, it's still more awkward than base R data frames and way worse than data.table.
R also has a lot of built-ins geared for statistics and built by statisticians. If you're do it statistics there's value in not having to find a library or libraries that do that.
> [R/RStudio] maintains implicit state between runs...
That can be turned off and is, in fact, widely recommended to not keep one's workspace between runs.
> This is before you deal with inter-version compatibility problems which also seemed frequent.
Yeah, that can be a problem with libraries (as it is with python dependencies). It really afflicts long-running projects. R has taken a cue from the python world there. renv the best way (IMHO) to maintain a reproduceable environment in R (https://rstudio.github.io/renv/articles/renv.html).
R is nicely cogent in syntax and largely "just works" once you accept its idiosyncrasies.
R has a lot high quality packages which implement e.g. frequently used sophisticated regression analysis algorithms. Python has these too but in my experience they are not that well tested and suffer from bugs.
> what if we just generate all products from the set of numbers 2:n and exclude those as "not prime" from all the numbers up to n?
It's fun to translate terse APL to somewhat terse numpy. The result still can be very compact and you can parse it easily if you're used to looking at numpy:
s = arange(2, 50); p = outer(s, s).ravel(); sorted(set(s) - set(p))
What's interesting there is that numpy is inspired (more than a little) by APL and aims to bring that 'array' thinking to python. I agree that thinking in this 'array' way helps to better construct a solution in any language, so I'm leaning towards 'designing' with APL glyphs, even if that's not the language I'm implementing the thing in.
You can ignore this, given that I haven't used either APL/J seriously, but if I were to truly dive in, I'd lean towards APL exactly because of its non-ascii/symbolic leanings. the only similitude I know of is operator overloading, and whenever that is used, I have to relearn what each operator does in a certain context. it is only if you use it regularly like regex which while changing the meaning of the operators, since its an entire DSL, is too different for me to think + means sum. If another entirely different symbol was introduced, then I'm not assigning any functionality to it, which is why I think it should be easier.
I think software engineers often get turned off by the weird idiosyncrasies of R, but there are surprisingly unique (arguably helpful) language features most people don't notice. Possibly because most of the learning material is data-science focused and so it doesn't emphasize the bonkers language features that R has.
[1]: https://github.com/moodymudskipper/inops
https://blog.moertel.com/posts/2006-01-20-wondrous-oddities-...
In pseudocode:
Deleted Comment
That was at least true when I was looking at it. I didn't get it, but the data guys came away loving it. I came away from that whole experience really appreciating how far you can get with an "unclean" design if you persist, and how my gut feeling of good (with all the heuristics for quality that entails) is really very domain specific.
The many recommendations for J here are a great nudge for me to give it a proper go. I've taken quite a liking to the traditional APL glyphs ( see a photo of the stickers on my laptop keys in this post https://jcarroll.com.au/2023/12/10/advent-of-array-elegance/ ) so I'm not looking for a way to avoid them.
Another detraction I've seen around is about the ambivalence of APL glyphs (taking either 1 or 2 arguments and doing something different in each case). I don't particularly mind it because I think it becomes more natural to "understand" how a function is being used the more familiar you become with it, but without the limitation on the number of glyphs, I can see the benefit of separating those.
Just for a short comparison, In J the analogous code is </ +. >/
The matrix m, the sum of the rows, and the maximum of the sum of the rows in J (separated by ;)To understand this you need to know that >. and <. are the min and max functions, and that in J three functions separated by spaces, f g h, constitutes a new function mathematically defined by (f g h)(x) = g(f(x), h(x)). An example is (+/ % #) which applied to a list gives the mean of the list. Here +/ gives the total, # gives the number of elements and % is the quotient.
Based on the examples, no, I cannot. It could be either of <: and >.
An alternative test for this hypothesis might have been using the language J, which is an array language based on APL and by the designer of APL but only using ASCII characters.
Well, at its core R is a Lisp (specifically, Scheme) but with a more traditional syntax (infixed operators, function calls, etc). And it’s fair to say the adoption of R has, indeed, been more widespread than that of Lisp.
1) The language is so frikkin massive. Common lisp is a huge language with hundreds and hundreds of built-in functions etc and the standard came very late in its evolution so there is a bunch of back compat cruft and junk that everyone has to live with. The object system is a whole epic journey in itself. You could probably kill or at least seriously injure someone with the impact if they were lying down and you dropped a copy of Guy Steele's excellent book[1] on them from a standing height.
2) The ecosystem is so fragmented. First you have Common Lisp, which isn't very common at all. Then you have all the vendor lisps. Then you have whether they have or don't have clos to contend with. Elisp is a lisp but is not common lisp and differs in some important ways that I don't quite remember. Then there's scheme, and guile scheme (which isn't quite the same) then clojure, etc etc.
3) That meant that the tooling was basically all simultaneously amazing and awful. As an example my uncle wrote a tcp/ip stack in lisp for the symbolics lisp machine[2] for a project when he worked at xerox. He told me in the late 80s about features in the symbolics debugger that just totally blew my mind and are only now available in IDEs for other languages, like being able to step backwards, alter variables, then step forward again, jump to any stack frame and just resume execution from there etc etc. On the other hand he had to write the TCP/IP stack himself because they didn't have one. I think that perfectly encapsulates the lisp experience for me around 2000 when I last used it - some things worked amazingly and were way better than anything else (eg I remember at the time the things you could do with serialization being just extraordinary compared to other languages) but a bunch of basic stuff was painful, janky or just completely missing.
4) Some of the concepts are very powerful but result in programs that are incredibly hard to understand. Macros, continuation passing, multiple dispatch.. etc etc. This puts a lot of people off because they just hit the learning cliff face-first and give up.
This is part of why python saw such wide adoption in my opinion. Not because it was in any sense the best language, but it was a very easy, practical choice for doing a bunch of things.
[1] https://www.cs.cmu.edu/Groups/AI/html/cltl/cltl2.html . Paul Graham (yes that Paul Graham) wrote a good lisp book also, although for me Steele is the one.
[2] https://en.wikipedia.org/wiki/Symbolics
I've looked briefly at R and found the syntax and semantics to be less than stellar. Obviously there's going to be some bias in that sentiment due me not generally doing "array programming", but I don't believe the things that irked me were entirely as a result of that.
The more annoying stuff for R is entirely second hand. As far as I can tell R (or at least R studio) maintains implicit state between runs which means you can get to a position where the same code works on some runs, and then not on later runs. My friend was having to do a lot of bioinformatics processing (many of the libraries for this are in R) and was constantly fighting to have code she wrote to process the data or produce charts (publications in bioinformatics have an acceptance bias for "looks like it came from R" that is similar to what CS [used to?] have for gnu plot). But you could run the same scripts on the same input and have it fail where previously it worked. This is before you deal with inter-version compatibility problems which also seemed frequent.
What was irksome to me looking at a lot of the stuff that were doing is that it was fundamentally mostly basic scripting stuff you could do in other languages trivially (and more cleanly imo) but there were a bunch of functions (builtin or from libraries?) that did the work, but those functions weren't in R, so the claims that R was "necessary" seemed fairly bogus to me.
[1] https://en.wikipedia.org/wiki/APL_(programming_language)#/me...
R is VERY VERY good at handling tabular data. Python can get kind of close with Pandas but IMO, it's still more awkward than base R data frames and way worse than data.table.
R also has a lot of built-ins geared for statistics and built by statisticians. If you're do it statistics there's value in not having to find a library or libraries that do that.
R is nicely cogent in syntax and largely "just works" once you accept its idiosyncrasies.
It's fun to translate terse APL to somewhat terse numpy. The result still can be very compact and you can parse it easily if you're used to looking at numpy:
https://www.jsoftware.com/indexno.html
https://code.jsoftware.com/wiki/System/Installation <- install
https://code.jsoftware.com/wiki/Guides/Getting_Started <- help