Bad scientific code beats code following "best practices" (2014)

This is partly because, in my opinion, some "best practices" are superstitions.

Some practice was best because of some issue with 80s era computing, but is now completely obsolete; problem has been solved in better ways or has completely disappeared thanks e.g. to better tooling or better, well, practices. e.g. Hungarian notation. Yet it is still passed down as a best practice and followed blindly because that's what they teach in schools. But nobody can tell why it is "good", because it is actually not relevant anymore.

Scientific code has no superstitions (as expected I would say), but not for the best reasons; they didn't learn the still relevant good practices either.

tchalla · 2 years ago

I wish we communicated the intent of the “best practice” instead of the practice itself.

DragonStrength · 2 years ago

Actually, when I’ve followed those guidelines, it’s because the tech lead graduated in the 1980s, almost certainly learned it all on the job, but has always done it that way. Others just do what they’ve done before. School talked about those things, but not in a “this is the right way” sort of thing.

quickthrower2 · 2 years ago

There is no best practice. It is good to know the tools. In dojo, do that crazy design pattern shit and do crazy one long function. Do some C#, Java, JS, Go, Typescript, Haskell, Ruby, Rust (not necessarily those but a big variety). I want the next person to understand my code - this is very important. Probably more important than time spent or performance. If spending another 10% refactoring to make it easier to understand, even if just adding good comments, it is well worth it. Make illegal state impossible, if you can (e.g. don't store the calculated value, and if you do then design it so it can't be wrong!). Make it robust. Pretend it'll page you at 2am if it breaks!

jayd16 · 2 years ago

Such as what? I don't really know of any such superstitions that are based on nothing.

I see a lot of opinion/taste presented as something more, but I really can't think of superstitions.

thethimble · 2 years ago

OOP madness? XML? Web scale databases?

Perhaps not superstition but certainly fundamentalist/hype-based thinking.

astrobe_ · 2 years ago

I saw an example the other day that escapes annoyingly escapes my mind now, as it has been sort of overwritten by the "why the heck do some people name Makefiles with a capital M!?" pet peeve.

But I'd say a bit of everything listed in TFA. For instance global variables are the type of thing which makes a little voice say "if you do that, something bad will eventually happen". The voice of experience sometimes say things like that, though.

ndriscoll · 2 years ago

I don't know if I'd call it a superstition exactly, but there's a subset of people who are fine with foo1.plus(foo2) and bar1.plus(bar2) where foo and bar are different types, but for some reason, "foo1 + foo2" and "bar1 + bar2" is "confusing" or somehow evil. It feels a bit like they're superstitious about it. I get a similar vibe from people who have an aversion to static type inference.

chaxor · 2 years ago

It is important to have popular and powerful tools that can reduce amount of code for things like caching and building.

For example, Snakemake (os-independent make) with data version control based off of torrent (removing complication of having to pay for AWS, etc) for the caching of build steps, etc would be a HUGE win in the field. *No one has done it yet* (some have danced around the idea), but if done well and correctly, it could reduce the amount of code and pain in reproducing work by thousands of lines of code in some projects.

It's important for the default of a data version control to be either ipfs or torrent, because it's prohibitive to make everyone set up all these accounts and pay these storage companies to run some package. Ipfs, torrent, or some other centralized solution is the only real solution.

leptons · 2 years ago

Today's "best practice" is tomorrow's worst practice.

Dead Comment

I'm a scientist programmer working in a field comprised by biologists and computer scientists, and what I've experienced is almost exactly the opposite of the author.

I've found the problems that biologists cause are mostly:

* Not understanding dependencies, public/private, SCM or versioning, making their own code uninstallable after a few months

* Writing completely unreadable code, even to themselves, making it impossible to maintain. This means they always restart from zero, and projects grow into folders of a hundred individual scripts with no order, depending on files that no longer exists

* Foregoing any kind of testing or quality control, making real and nasty bugs rampant.

IMO the main issue with the software people in our field (of which I am one, even though I'm formally trained in biology) is that they are less interested in biology than in programming, so they are bad at choosing which scientific problems to solve. They are also less productive when coding than the scientists because they care too much about the quality of their work and not enough about getting shit done.

frostix · 2 years ago

>They are also less productive when coding than the scientists because they care too much about the quality of their work and not enough about getting shit done.

Ultimately I’d say the core issue here is that research is complex and those environments are often resource strapped relative to other environments. As such this idea of “getting shit done” takes priority over everything. To some degree it’s not that much different than startup business environments that favor shipping features over writing maintainable and well (or even partially) documented code.

The difference in research that many fail to grasp is that the code is often as ephemeral as the specific exploratory path of research it’s tied to. Sometimes software in research is more general purpose but more often it’s tightly coupled to a new idea deep seated in some theory in some fashion. Just as exploration paths into the unknown are rapidly explored and often discarded, much of the work around them is as well, including software.

When you combine that understanding with an already resource strapped environment, it shouldn’t be surprising at all that much work done around the science, be it some physical apparatus or something virtual like code is duct taped together and barely functional. To some degree that’s by design, it’s choosing where you focus your limited resources which is to explore and test and idea.

Software very rarely is the end goal, just like in business. The exception with business is that if the software is viewed as a long term asset more time is spent trying to reduce long term costs. In research and science if something is very successful and becomes mature enough that it’s expected to remain around for awhile, more mature code bases often emerge. Even then there’s not a lot of money out there to create that stuff, but it does happen, but only after it’s proven to be worth the time investment.

EMCymatics · 2 years ago

>Ultimately I’d say the core issue here is that research is complex and those environments are often resource strapped relative to other environments. As such this idea of “getting shit done” takes priority over everything.

That conforms to my experience

hyperthesis · 2 years ago

maintainable prototypes are overengineered

Deleted Comment

coldtea · 2 years ago

>I've found the problems that biologists cause are mostly 1. Not understanding dependencies, public/private, SCM or versioning, making their own code uninstallable after a few months

That's not on them though. That's on the state of the tooling in the industry.

Most of the time, dependencies could just be a folder you delete, and that's that (node_modules isn't very far from that). Instead it's a nightmare - and not for any good reason, except historical baggage.

The biologists writing scientific programs don't want "shared libraries" and other such BS. But the tooling often doesn't give them the option.

And the higher level abstractions like conda and pip and poetry and whatever, are just patches on top of a broken low level model.

None of those should be needed for isolated environments, only for dependency installation and update. Isolated environments should just come for free based on lower level implementation.

Master_Odin · 2 years ago

While I agree tooling could be better, while in grad school I found that a lot of academics / grad students don't know that any of the tooling even exists and never bothered to learn if and such tooling existed that could improve their life. Ditto with updating their language runtimes. It really seemed like they viewed code as a necessary evil they had to do to achieve their research goal.

marmalade2413 · 2 years ago

I was going to write a response but you've put what I would have said perfectly. The problem, at least in academia, is the pressure to publish. There is very little incentive to write maintainable code and finalise a project to be something accessible to an end user. The goal is to come up with something new, publish and move on or develop the idea further. This alone is not enough reason not to partake in practices such as unit tests, containerisation and versatile code but most academic code is written by temporary "employees". PhD's a in a department for 3-4 years, Post Doc's are there about the same amount of time.

For someone to shake these bad practices, they need to fight an uphill battle and ultimately sacrifice their research time so that others will have an easier time understanding and using their codes. Another battle that people trying to write "good" code would need to fight is that a lot of academics aren't interested in programming and see coding as simply as means to an end to solve a specific problem.

Also, another bad practice few bad practices to add to the list:

* Not writing documentation.

* Copying, cutting, pasting and commenting out lines of code in lieu of version control.

* Not understanding the programming language their using and spending time solving problems that the language has a built in solution for.

This is at least based on my own experience as a PhD student in numerical methods working with Engineers, Physicists, Biologists and Mathematicians.

jwagenet · 2 years ago

Sometimes I don’t blame people for committing the ‘sin’ of leaving commented code; unless you know that code used to exist in a previous version, it may well have never existed.

civilized · 2 years ago

These patterns appear in many fields. I take it as a sign that the tooling in the field is underdeveloped.

This leads to a split between domain problem solvers, who are driven to solve the field's actual problems at all costs (including unreliable code that produces false results) and software engineers, who keep things tidy but are too risk-averse to attempt any real problems.

I encourage folks with interests in both software and an area of application to look at what Hadley Wickham did for tabular data analysis and think about what it would look like to do that for your field.

mannykannot · 2 years ago

Unreliable code that produces false results does not solve the field's actual problems, and is likely to contribute to the reproducibility problem. It might solve the author's immediate problem of needing to publish something.

Update: I guess I misinterpreted OP's intent here, with "unreliable code that produces false results" being part of the field's actual problems rather than one of the costs to be borne.

noobermin · 2 years ago

May be biology (or really, may be not) but honestly it's just the nature of the beast. Literally fortran is the oldest language, it's just the attitude and spirit is different than that of software development.

pas · 2 years ago

journals, research universities/institutions, and grant orgs have the resources and gatekeeping role to encourage and enforce standards, train and support investigators in conducting real science not just pseudoscience, but these entities are actively disowning their responsibility in the name of empty "empowerment" (of course because rationally no one has a real chance of successfully pushing through a reform, so the smart choice is to just not rock the boat)

fifilura · 2 years ago

Can you elaborate on your thoughts regarding Wickham?

movpasd · 2 years ago

I work in an R&D environment with a lot of people from scientific backgrounds who have picked up some programming but aren't software people at heart. I couldn't agree more with your assessment, and I say that without any disrespect to their competence. (Though, perhaps with some frustration for having to deal with bad code!)

As ever, the best work comes when you're able to have a tight collaboration between a domain expert and a maintainability-minded person. This requires humility from both: the expert must see that writing good software is valuable and not an afterthought, and the developer must appreciate that the expert knows more about what's relevant or important than them.

aleph_minus_one · 2 years ago

> As ever, the best work comes when you're able to have a tight collaboration between a domain expert and a maintainability-minded person. This requires humility from both: the expert must see that writing good software is valuable and not an afterthought, and the developer must appreciate that the expert knows more about what's relevant or important than them.

I do work in such an environment (though in some industry, and not in academia).

An important problem in my opinion is that many "many software-minded people" have a very different way of using a computer than typical users, and are always learning/thinking about new things, while the typical user has a much less willingness to be permanently learning (both in their subject matter area and computers).

So, the differences in the mindsets and usage of computers are in my opinion much larger than your post suggest. What you list are in my experience differences that are much easier to resolve, and - if both sides are open - not really a problem practice.

ameminator · 2 years ago

> They are also less productive when coding than the scientists because they care too much about the quality of their work and not enough about getting shit done.

You can't solve the first 3 issues without having people who care about software quality. People not caring about the quality of the software is what caused those initial 3 problems in the first place.

jampekka · 2 years ago

And you can't fix any of this as long as "software quality" (the "best practices") means byzantine enterprise architecture mammoths that don't even actually fix any of the quality issues.

hkon · 2 years ago

Yeah, if only scientists would put the same care into the quality of their science...

MrJohz · 2 years ago

I only worked briefly in software for research, and what you described matched my experience, but with a couple of caveats.

Firstly, a lot of the programs people were writing were messy, but didn't need to last longer than their current research project. They didn't necessarily need to be maintained long-term, and therefore the mess was often a reasonable trade-off for speed.

Secondly, almost none of the software people had any experience writing code in any industry outside of research. Many of them were quite good programmers, and there were a lot of "hacker" types who would fiddle with stuff in their spare time, but in terms of actual engineering, they had almost no experience. There were a lot of people who were just reciting the best practice rules they'd learned from blog posts, without really having the experience to know where the advice was coming from, or how best to apply it.

The result was often too much focus on easy-to-fix, visible, but ultimately low-impact changes, and a lot of difficulty in looking at the bigger picture issues.

Regic · 2 years ago

> There were a lot of people who were just reciting the best practice rules they'd learned from blog posts, without really having the experience to know where the advice was coming from, or how best to apply it

This is exactly my experience too. Also, the problem with learning things from youtube and blogs is that whatever the author decides to cover is what we end up knowing, but they never intended to give a comprehensive lecture about these topics. The result is people who dogmatically apply some principles and entirely ignore others - neither of those really work. (I'm also guilty of this in ML topics.)

antisthenes · 2 years ago

> Not understanding dependencies, public/private, SCM or versioning, making their own code uninstallable after a few months

I'm not sure what "uninstallable" code is, but why does it matter? Do scientists really need to know about dependencies when they need the same 3 libraries over and over? Pandas, numpy, Apache arrow, maybe OpenCV. Install them and keep them updated. Maybe let the IT guys worry about dependencies if it needs more complexity than that.

> Writing completely unreadable code, even to themselves, making it impossible to maintain. This means they always restart from zero, and projects grow into folders of a hundred individual scripts with no order, depending on files that no longer exists

This is actually kind of a benefit. Instead of following sunk cost and trying to address tech debt on years-old code, you can just toss a 200-liner script out of the window along with its tech debt, presumably because the research it was written for is already complete.

> Foregoing any kind of testing or quality control, making real and nasty bugs rampant.

Scientific code only needs to transform data. If it's written in a way that does that (e.g. uses the right function calls and returns a sensible data array) then it succeeded in its goal.

> They are also less productive when coding than the scientists because they care too much about the quality of their work and not enough about getting shit done.

Sooo...another argument in favor of the way scientists write code then? Isn't "getting shit done" kind of the point?

cbolton · 2 years ago

Yeah these problems with "engineer code" the author describes, they are real, but it's a well known thing in software engineering. It's exactly what you can expect from junior developers trying to do their best. More experienced programmers have gone through the suffering of having to work on such code, like the author himself, and don't do these mistakes. Meanwhile, experienced scientists still write terrible code...

mazelife · 2 years ago

I'm a software engineer working with scientist-turned-programmers, and what I've experienced is also exactly the opposite of the author. The code written by the physicists, geoscientists and data scientists I work with often suffers from the following issues:

* "Big ball of mud" design [0]: No thought given to how the software should be architected or what the entities that comprise the design space of the problem are and how they fit together. The symptoms of this lack of thinking are obvious: multi-thousand-line swiss-army-knife functions, blocks of code repeated in dozens of places with minor variations, and a total lack of composability of any components. This kind of software design (or lack of design, really) ends up causing a serious hit to productivity because it's often useless outside of the narrow problem it was written to solve and because it's exceedingly hard to maintain or add new features to.

* Lack of tests: some of this is that the scientist-turned-programmer doesn't want to "waste time" writing tests, but more often it's that they don't know _how_ to write good tests. Or they have designed the code in such a way (see above) that it's really hard to test. In any case--unsurprisingly--their code tends to be buggy.

* Lack of familiarity with common data structures and algorithms: this often results in overly-complicated brute-force solutions to problems being used when they needn't have and in sub-par performance.

This quote from the author stood out to me:

> I claim to have repented, mostly. I try rather hard to keep things boringly simple.

...because it's really odd to me. Writing code that is as simple as it can be is precisely what good programmers do! But in order to get to the simplest possible solution to a non-trivial problem you need to think hard about the design of the code and ensure that the abstractions you implement are the right ones for the problem space. Following the "unix philosophy" of building small, simple components that each do one thing well but are highly composable is undoubtedly the more "boringly simple" approach in terms of the final result, but it's a harder to do (in the sense that it may take more though and more experience) than diving into the problem without thinking and cranking out a big ball of mud. Similarly reaching for the correct data structure or algorithm often results in a massively simpler solution to your problem, but you have to know about it or be willing to research the problem a bit to find it.

The author did at least try to support his thesis with examples of "bad things software engineers do", but a lot of them seem like things that--in almost every organization I've worked at in the last ten years--would definitely be looked down on/would not pass code review. Or are things ("A forest of near-identical names along the lines of DriverController, ControllerManager, DriverManager, ManagerController, controlDriver") that are narrowly tailored to a specific language at a specific window in time.

> they care too much about the quality of their work and not enough about getting shit done.

I think the appearance of "I'm just getting shit done" is often a superficial one, because it doesn't factor in the real costs: other scientists and engineers can't use their solutions because they're not designed in a way that makes them work in any other setting than the narrow one they were solving for. Or other scientists and engineers have trouble using the person's solutions because they are hard to understand and badly-documented. Or other scientists and engineers spend time going back and fixing the person's solutions later because they are buggy or slow. The mindset of "let's just get shit done and crank this out as fast as we can" might be fine in a research setting where, once you've solved the problem, you can abandon it and move on to the next thing. But in a commercial setting (i.e. at a company that builds and maintains software critical for the organization to function) this mindset often starts to impose greater and greater maintenance costs over time.

[0] https://en.wikipedia.org/wiki/Anti-pattern#Big_ball_of_mud

LeonardoTolstoy · 2 years ago

> Lack of familiarity with common data structures and algorithms

This part I 100% agree with. I adapt a lot of scientific code as my day-to-day and most of the issues in them tend to be making things 100x slower than they need to be and then even implementing insane approximations to "fix" the speed issue instead of actually fixing it

>"Big ball of mud" design

Funny enough this was explicitly how my PI at my current job wants to implement software. In his opinion the biggest roadblock in scientific software is actually convincing scientists to use the software. And what scientists want is a big ball of mud which they can iterate on easily and basically requires no installation. In his opinion a giant Python file with a requirement.txt file and a Python version is all you need. I find the attitude interesting. For the record he is a software engineer turned scientist, not the other way around, but our mutual hatred for Conda makes me wonder if he is onto something ...

>I think the appearance of "I'm just getting shit done" is often a superficial one, because it doesn't factor in the real costs: other scientists and engineers can't use their solutions because they're not designed in a way that makes them work in any other setting than the narrow one they were solving for.

For the record my experience is the exact opposite. The crazy trash software probably written in Python that is produced by scientists are often the ones more easily iterated on and used by other scientists. The software scientists and researchers can't use are the over-engineered stuff written in a language they don't know (e.g. Scala or Rust) that requires them to install a hundred things before they are able to use it.

karmelapple · 2 years ago

> The mindset … might be fine in a research setting

A vast amount of software is written for research papers that would be useful to people other than the paper’s authors. A lot of software that is in common use by commercial teams started off in academia.

One of the major issues I see is the lack of maintenance of this software, especially given all the problems written in your post and the one above. If the software is a big ball of mud, good luck to anyone trying to come in and make a modification for their similar research paper, or commercial application.

I don’t know the answer to this, but I think additional funding to biology labs to have something like a software developer who is devoted to making sure their lab’s software follows reasonably close to software development best practices would be a great start. If it’s a full time position where they’d likely stick around for many years, some of the maintenance issues would resolve themselves, too. This software-minded person at a lab would still be there even after the biology researchers have moved on elsewhere, and this software developer could answer questions from other people interested about code written years ago.

op00to · 2 years ago

This was my exact experience working in biomedical hpc.

chaxor · 2 years ago

    >    * Not understanding dependencies, public/private, SCM or versioning, making their own code uninstallable after a few months

This is definitely true, but I've searched *far and wide* , and unfortunately it's not a simple task to get this right.

Ultimately, if there were a simple way to get data in the correct state in an os-independent, machine independent (from raspberry pi to HPC the code should always work), concise, and idempotent way - people would use it. There isn't. But the certainly could be.

The solution we desperately need is a basically a pull request to a simple build tool (make, Snakemake, just, task, etc) that makes this idempotent and os-independent setup simple. Snakemake works on windows and Unix, so that's a decent start.

One big point is matching data outputs to source code and input state. *Allowing ipfs or torrent backends to Snakemake can solve this problem.*

The idea would be to simply wrap `input/output: "/my/file/here"` in `ipfs()`, wherein this would silently check if the file is locally cached to return, but if not go to IPFS as a secondary location to check for the file, then if the file isn't at either place, calculate it with the run command specified in Snakemake. It's useful to have this type of decentralized cache, because it's extremely common to run commands that may take several months on a supercomputer that give files that may only be a few MBs (exchange correlation functional) or only a few GBs (NN weights) so downloading the file is *immensely* cheaper to do than re-running the code - and the output is specified by the input source code (hence git commit hash maps to data hash).

The reason IPFS or torrent is the answer here is for several reasons: 1) The data location is specied by the hash of the content - which can be used to make a hash map of git commit hashes of source code state that map to data outputs (the code uniquely specifies the data in almost all cases, and input data can be included for the very rare cases it doesn't) 2) The availability and speed of download scales with popularity. Right now, were at the mercy of centralized storage systems, wherein the download rate can be however low they want it to be. However, LLM NN weights on IPFS can be downloaded very fast when millions of people *and* many centralized storage providers have the file hosted. 3) The data is far more robust to disappearing. Almost all scientific data output links point to nothing (MAG, sra/geomdb - the examples are endless). This is for many reasons such as academics moving and the storage location no longer being funded, accounts being moved, or simply the don't have enough storage space for emails on their personal Google drive and they delete the database files from their research. However, these are often downloaded many times by others in the field - the data exists somewhere - so it just needs to be accessible by decentralizing the data and allowing the community to download the file from the entire community which has it.

One of the important aspects to include in this buildtool would be to ensure that, every time someone downloads a certain file (specified by the git commit hash - data hash map) or uploads a file after computing it, they host the file as well. This way the community grows automatically by having a very low resource and extremely secure IPFS daemon host all of the important data files for different projects.

Having this all achieved by the addition of just 6 characters in a Snakemake file might actually solve this problem for the scientific / data science community, as it would be the standard and hard to mess up.

The next issue to solve would be popularize a standard way to get a package to work on all available cores/gpu/resources, etc on a raspberry pi to HPC without any changes or special considerations. Pyspark almost does this, but there's still more config than desirable for the community, and the requirement of installing OS-level dependencies (Java stuff) to work on python can often halt it's use completely (if the package using pyspark is a dependency of a dependency of a dependency, wet lab biologists [the real target users] *will not* figure out how to fix that problem if it doesn't "just work"[TM])

dijksterhuis · 2 years ago

What you’re describing sounds like DVC (at a higher-ish—80%-solution level although my brain switched off at the mention of IPFS).

https://dvc.org/

See pachyderm too.