Readit News logoReadit News
lynndotpy · 2 years ago
Scientist and programmer here, and my experiences are the opposite. I value keeping things "boringly simple", but I desperately wish there was any kind of engineering discipline.

First is the reproducibility issue. I think I've spent about as much time simply _trying_ to get the dependencies of research code to run as I have done writing or doing research in my PhD. The simple thing is to write a requirements.txt file! (For Python, at least.)

Second, two anecdotes where not following best practices ruined the correctness of research code:

- Years ago, I was working on research code which simulated a power-grid. We needed to generate randomized load profiles. I noticed that each time it ran, we got the same results. As a software engineer, I figured I had to re-set the `random` seed, but that didn't work. I dug into the code, talked to the researcher, and found the load-profile algorithm: It was not randomly generated, but a hand-coded string of "1" and "0".

- I later had the pleasure of adapting someone's research code. They had essentially hand-engineered IPC. It worked by calling a bash script from Python, which would open other Python processes and generate a random TCP/IP socket, the value of which was saved to an ENV variable. Assuming the socket was open, the Python scripts would then share the socket names of other filenames for the other processes to read and open. To prevent concurrency issues, sleep calls were used throughout the Python and Bash script. This was four Python scripts and two shell scripts, and to this day, I do not understand the reason this wasn't just one Python script.

cauch · 2 years ago
My problem with this discussion is that a lot of people just say "I'm a scientist (or I'm working with scientists) and I'm observing X so I can say 'scientists blahblahblah'".

Different scientific research fields are using widely different computer software environment, and have their own habits and traditions. The way a biologist uses programming has no reason to be similar to the way an astrophysicist does: they have not at all experienced the same software environment. It may even be useless to talk about "scientist" in the same field as two different labs working in the same field may have very different approaches (but it's more difficult if there are shared framework).

So, I'm not at all surprised that you observe opposite experience. The same way I'm not surprised to see someone saying they had the opposite experience if someone says "European people are using a lot of 'g' and 'k' in their words" just because they observed what happened in Germany.

pphysch · 2 years ago
I don't think there is much variance in quality of software among (radically different) fields of science.

One of the most poorly engineered products I work with was created by a few academic CS guys. The core algorithms are sophisticated and ostensibly implemented well, but the overall product is a horrible mess.

The incentives of academia make this obvious. You need to write some code that plausibly works just enough to get a manuscript out of it, but not much else. Reproducibility is not taken that seriously, and "productization"/portability/hardening is out of the question.

pennomi · 2 years ago
Absolutely my experience as well. Scientists write code that works, but is a pain to reproduce in any sort of scalable way. However it’s been getting better over time as programming is becoming a less niche skill.
BobbyJo · 2 years ago
The problem I've run into over and over with research code is fragility. We ran it on test A, but when we try test B nothing works and we have no idea why because god forbid there is any error handling, validation, or even just comprehensible function names.
astrobe_ · 2 years ago
This is partly because, in my opinion, some "best practices" are superstitions.

Some practice was best because of some issue with 80s era computing, but is now completely obsolete; problem has been solved in better ways or has completely disappeared thanks e.g. to better tooling or better, well, practices. e.g. Hungarian notation. Yet it is still passed down as a best practice and followed blindly because that's what they teach in schools. But nobody can tell why it is "good", because it is actually not relevant anymore.

Scientific code has no superstitions (as expected I would say), but not for the best reasons; they didn't learn the still relevant good practices either.

tchalla · 2 years ago
I wish we communicated the intent of the “best practice” instead of the practice itself.
DragonStrength · 2 years ago
Actually, when I’ve followed those guidelines, it’s because the tech lead graduated in the 1980s, almost certainly learned it all on the job, but has always done it that way. Others just do what they’ve done before. School talked about those things, but not in a “this is the right way” sort of thing.
quickthrower2 · 2 years ago
There is no best practice. It is good to know the tools. In dojo, do that crazy design pattern shit and do crazy one long function. Do some C#, Java, JS, Go, Typescript, Haskell, Ruby, Rust (not necessarily those but a big variety). I want the next person to understand my code - this is very important. Probably more important than time spent or performance. If spending another 10% refactoring to make it easier to understand, even if just adding good comments, it is well worth it. Make illegal state impossible, if you can (e.g. don't store the calculated value, and if you do then design it so it can't be wrong!). Make it robust. Pretend it'll page you at 2am if it breaks!
jayd16 · 2 years ago
Such as what? I don't really know of any such superstitions that are based on nothing.

I see a lot of opinion/taste presented as something more, but I really can't think of superstitions.

thethimble · 2 years ago
OOP madness? XML? Web scale databases?

Perhaps not superstition but certainly fundamentalist/hype-based thinking.

astrobe_ · 2 years ago
I saw an example the other day that escapes annoyingly escapes my mind now, as it has been sort of overwritten by the "why the heck do some people name Makefiles with a capital M!?" pet peeve.

But I'd say a bit of everything listed in TFA. For instance global variables are the type of thing which makes a little voice say "if you do that, something bad will eventually happen". The voice of experience sometimes say things like that, though.

ndriscoll · 2 years ago
I don't know if I'd call it a superstition exactly, but there's a subset of people who are fine with foo1.plus(foo2) and bar1.plus(bar2) where foo and bar are different types, but for some reason, "foo1 + foo2" and "bar1 + bar2" is "confusing" or somehow evil. It feels a bit like they're superstitious about it. I get a similar vibe from people who have an aversion to static type inference.
chaxor · 2 years ago
It is important to have popular and powerful tools that can reduce amount of code for things like caching and building.

For example, Snakemake (os-independent make) with data version control based off of torrent (removing complication of having to pay for AWS, etc) for the caching of build steps, etc would be a HUGE win in the field. *No one has done it yet* (some have danced around the idea), but if done well and correctly, it could reduce the amount of code and pain in reproducing work by thousands of lines of code in some projects.

It's important for the default of a data version control to be either ipfs or torrent, because it's prohibitive to make everyone set up all these accounts and pay these storage companies to run some package. Ipfs, torrent, or some other centralized solution is the only real solution.

leptons · 2 years ago
Today's "best practice" is tomorrow's worst practice.

Dead Comment

jusssi · 2 years ago
Two more to the scientists' tab:

1. No tests of any kind. "I know what the output should look like." Over time people who know what it should look like leave, and then it's untouchable.

2. No regard to the physical limits of hardware. "We can always get more RAM on everyone's laptops, right?". (You wouldn't need to if you just processed the JSONs one at a time, instead of first loading all of them to the memory and then processing them one at a time.)

Also the engineers' tab has a strong smell of junior in it. When you have spent some time maintaining such code, you'll learn not to make that same mess yourself. (You'll overcorrect and make another, novel kind of mess; some iterations are required to get it right.)

lozenge · 2 years ago
Yes, the claim that the scientists' hacked-together code is well tested and even uses valgrind gave me pause. It's more likely there are no tests at all. They made a change, they saw that a linear graph became exponential, and they went bug hunting. But there's no way they have spotted every regression caused by every change.
Asraelite · 2 years ago
Agree with those two problems on the scientist side. I would also add that they often don't use version control.

I think a single semester of learning the basics of software development best practices would save a lot of time and effort in the long term if it was included in physics/maths university courses.

squarepizza · 2 years ago
> I would also add that they often don't use version control.

Working for corporate R&D, I once received a repo on a flash drive. The team would merge changes manually by copy-pasting.

I should've just turned around and left.

2devnull · 2 years ago
1 and 2 are features. Re 1, if someone doesn’t know what the output should look like they shouldn’t be reusing the code. Re 2, just think a bit more about it and you’ll realize fretting over ram that isn’t needed until it’s needed is actually just premature optimization.
gregopet · 2 years ago
Sounds like the non-programmers are good at what they are supposed to be good at (solving the actual problem, if perhaps not always in the most elegant manner) while the programmers should be producing a highly maintainable, understandable, testable and reliable code base (and potentially have problems with advanced algorithms that rely on complicated theorems), but they are not. The OP has a case of bad programmers - the techniques listed as bad can be awesome if used with prudence.

A good programmer has a very deep knowledge of the various techniques they can use and the wisdom to actually choose the right ones in a given situation.

The bad programmers learn a few techniques and apply them everywhere, no matter what they're working on, with whom they are working with. Good programmers learn from their mistakes and adapt, bad programmers blame others.

I've worked with my share of bad programmers and they really suck. A good programmer's code is a joy to work with.

galaxyLogic · 2 years ago
Right and I think "scientists" simply are more intelligent than average Joe Coder. Intelligent people produce better software.

It is easy to learn some coding, not so easy to become a scientist.

To becomes a scientist you must write and get your PhD-thesis approved, which must already be about scientific discoveries you have made while doing that thesis. Only people with above average IQ can accomplish something like that, I think.

9dev · 2 years ago
Being intelligent in one domain doesn’t automatically make you good in any others. Exceptional biologists can be astoundingly bad at maths, and the other way around. Like most skills, being good at writing software requires not only intelligence, but lots of experience too. Maybe smarter people will pick it up faster, but they aren’t intrinsically better.

It’s a bit surprising you’d have to explain such a basic conclusion here.

_Wintermute · 2 years ago
In my experience getting a PhD doesn't require above average intelligence, it does require a lot of perseverance and a good amount of organisation though.

I honestly think most skilled tradespeople are more intelligent than me and my PhD holding colleagues.

palata · 2 years ago
> Right and I think "scientists" simply are more intelligent than average Joe Coder. Intelligent people produce better software.

The vast majority of papers I read on topics I know are complete bullshit. Maybe making a PhD was more elitist before, but now it surely isn't.

If we define "scientist" as anyone who publishes papers, then they have the same problem as software engineering: it's mostly made by juniors.

laserbeam · 2 years ago
I agree with the feelings of the author, most software is overengineered (including most of my software).

That being said, most scientific code I've encountered doesn't compile/run. It ran once at some point, it produced results, it worked for the authors and published a paper. The goal for that code was satisfied and than that code somehow rusted out (doesn't work with other compilers, hadn't properly documented how it gets build, unclear what dependencies were used, dependencies were preprocessed at some point and you can't find the preprocessed versions anywhere to reproduce the code, has hardcoded data files which are not in the published repos etc.). I wouldn't use THAT as my compass on how to write higher quality code.

ShamelessC · 2 years ago
Yeah somehow I suspect this author hadn't yet had to deal with colab notebooks.
noobermin · 2 years ago
Yeah, well gnome 2 also doesn't compile or run on my machine. It ran once at some point, but one is considered a "worse" class of software.
jakobnissen · 2 years ago
I'm a scientist programmer working in a field comprised by biologists and computer scientists, and what I've experienced is almost exactly the opposite of the author.

I've found the problems that biologists cause are mostly:

* Not understanding dependencies, public/private, SCM or versioning, making their own code uninstallable after a few months

* Writing completely unreadable code, even to themselves, making it impossible to maintain. This means they always restart from zero, and projects grow into folders of a hundred individual scripts with no order, depending on files that no longer exists

* Foregoing any kind of testing or quality control, making real and nasty bugs rampant.

IMO the main issue with the software people in our field (of which I am one, even though I'm formally trained in biology) is that they are less interested in biology than in programming, so they are bad at choosing which scientific problems to solve. They are also less productive when coding than the scientists because they care too much about the quality of their work and not enough about getting shit done.

frostix · 2 years ago
>They are also less productive when coding than the scientists because they care too much about the quality of their work and not enough about getting shit done.

Ultimately I’d say the core issue here is that research is complex and those environments are often resource strapped relative to other environments. As such this idea of “getting shit done” takes priority over everything. To some degree it’s not that much different than startup business environments that favor shipping features over writing maintainable and well (or even partially) documented code.

The difference in research that many fail to grasp is that the code is often as ephemeral as the specific exploratory path of research it’s tied to. Sometimes software in research is more general purpose but more often it’s tightly coupled to a new idea deep seated in some theory in some fashion. Just as exploration paths into the unknown are rapidly explored and often discarded, much of the work around them is as well, including software.

When you combine that understanding with an already resource strapped environment, it shouldn’t be surprising at all that much work done around the science, be it some physical apparatus or something virtual like code is duct taped together and barely functional. To some degree that’s by design, it’s choosing where you focus your limited resources which is to explore and test and idea.

Software very rarely is the end goal, just like in business. The exception with business is that if the software is viewed as a long term asset more time is spent trying to reduce long term costs. In research and science if something is very successful and becomes mature enough that it’s expected to remain around for awhile, more mature code bases often emerge. Even then there’s not a lot of money out there to create that stuff, but it does happen, but only after it’s proven to be worth the time investment.

EMCymatics · 2 years ago
>Ultimately I’d say the core issue here is that research is complex and those environments are often resource strapped relative to other environments. As such this idea of “getting shit done” takes priority over everything.

That conforms to my experience

hyperthesis · 2 years ago
maintainable prototypes are overengineered

Deleted Comment

coldtea · 2 years ago
>I've found the problems that biologists cause are mostly 1. Not understanding dependencies, public/private, SCM or versioning, making their own code uninstallable after a few months

That's not on them though. That's on the state of the tooling in the industry.

Most of the time, dependencies could just be a folder you delete, and that's that (node_modules isn't very far from that). Instead it's a nightmare - and not for any good reason, except historical baggage.

The biologists writing scientific programs don't want "shared libraries" and other such BS. But the tooling often doesn't give them the option.

And the higher level abstractions like conda and pip and poetry and whatever, are just patches on top of a broken low level model.

None of those should be needed for isolated environments, only for dependency installation and update. Isolated environments should just come for free based on lower level implementation.

Master_Odin · 2 years ago
While I agree tooling could be better, while in grad school I found that a lot of academics / grad students don't know that any of the tooling even exists and never bothered to learn if and such tooling existed that could improve their life. Ditto with updating their language runtimes. It really seemed like they viewed code as a necessary evil they had to do to achieve their research goal.
marmalade2413 · 2 years ago
I was going to write a response but you've put what I would have said perfectly. The problem, at least in academia, is the pressure to publish. There is very little incentive to write maintainable code and finalise a project to be something accessible to an end user. The goal is to come up with something new, publish and move on or develop the idea further. This alone is not enough reason not to partake in practices such as unit tests, containerisation and versatile code but most academic code is written by temporary "employees". PhD's a in a department for 3-4 years, Post Doc's are there about the same amount of time.

For someone to shake these bad practices, they need to fight an uphill battle and ultimately sacrifice their research time so that others will have an easier time understanding and using their codes. Another battle that people trying to write "good" code would need to fight is that a lot of academics aren't interested in programming and see coding as simply as means to an end to solve a specific problem.

Also, another bad practice few bad practices to add to the list:

* Not writing documentation.

* Copying, cutting, pasting and commenting out lines of code in lieu of version control.

* Not understanding the programming language their using and spending time solving problems that the language has a built in solution for.

This is at least based on my own experience as a PhD student in numerical methods working with Engineers, Physicists, Biologists and Mathematicians.

jwagenet · 2 years ago
Sometimes I don’t blame people for committing the ‘sin’ of leaving commented code; unless you know that code used to exist in a previous version, it may well have never existed.
civilized · 2 years ago
These patterns appear in many fields. I take it as a sign that the tooling in the field is underdeveloped.

This leads to a split between domain problem solvers, who are driven to solve the field's actual problems at all costs (including unreliable code that produces false results) and software engineers, who keep things tidy but are too risk-averse to attempt any real problems.

I encourage folks with interests in both software and an area of application to look at what Hadley Wickham did for tabular data analysis and think about what it would look like to do that for your field.

mannykannot · 2 years ago
Unreliable code that produces false results does not solve the field's actual problems, and is likely to contribute to the reproducibility problem. It might solve the author's immediate problem of needing to publish something.

Update: I guess I misinterpreted OP's intent here, with "unreliable code that produces false results" being part of the field's actual problems rather than one of the costs to be borne.

noobermin · 2 years ago
May be biology (or really, may be not) but honestly it's just the nature of the beast. Literally fortran is the oldest language, it's just the attitude and spirit is different than that of software development.
pas · 2 years ago
journals, research universities/institutions, and grant orgs have the resources and gatekeeping role to encourage and enforce standards, train and support investigators in conducting real science not just pseudoscience, but these entities are actively disowning their responsibility in the name of empty "empowerment" (of course because rationally no one has a real chance of successfully pushing through a reform, so the smart choice is to just not rock the boat)
fifilura · 2 years ago
Can you elaborate on your thoughts regarding Wickham?
movpasd · 2 years ago
I work in an R&D environment with a lot of people from scientific backgrounds who have picked up some programming but aren't software people at heart. I couldn't agree more with your assessment, and I say that without any disrespect to their competence. (Though, perhaps with some frustration for having to deal with bad code!)

As ever, the best work comes when you're able to have a tight collaboration between a domain expert and a maintainability-minded person. This requires humility from both: the expert must see that writing good software is valuable and not an afterthought, and the developer must appreciate that the expert knows more about what's relevant or important than them.

aleph_minus_one · 2 years ago
> As ever, the best work comes when you're able to have a tight collaboration between a domain expert and a maintainability-minded person. This requires humility from both: the expert must see that writing good software is valuable and not an afterthought, and the developer must appreciate that the expert knows more about what's relevant or important than them.

I do work in such an environment (though in some industry, and not in academia).

An important problem in my opinion is that many "many software-minded people" have a very different way of using a computer than typical users, and are always learning/thinking about new things, while the typical user has a much less willingness to be permanently learning (both in their subject matter area and computers).

So, the differences in the mindsets and usage of computers are in my opinion much larger than your post suggest. What you list are in my experience differences that are much easier to resolve, and - if both sides are open - not really a problem practice.

ameminator · 2 years ago
> They are also less productive when coding than the scientists because they care too much about the quality of their work and not enough about getting shit done.

You can't solve the first 3 issues without having people who care about software quality. People not caring about the quality of the software is what caused those initial 3 problems in the first place.

jampekka · 2 years ago
And you can't fix any of this as long as "software quality" (the "best practices") means byzantine enterprise architecture mammoths that don't even actually fix any of the quality issues.
hkon · 2 years ago
Yeah, if only scientists would put the same care into the quality of their science...
MrJohz · 2 years ago
I only worked briefly in software for research, and what you described matched my experience, but with a couple of caveats.

Firstly, a lot of the programs people were writing were messy, but didn't need to last longer than their current research project. They didn't necessarily need to be maintained long-term, and therefore the mess was often a reasonable trade-off for speed.

Secondly, almost none of the software people had any experience writing code in any industry outside of research. Many of them were quite good programmers, and there were a lot of "hacker" types who would fiddle with stuff in their spare time, but in terms of actual engineering, they had almost no experience. There were a lot of people who were just reciting the best practice rules they'd learned from blog posts, without really having the experience to know where the advice was coming from, or how best to apply it.

The result was often too much focus on easy-to-fix, visible, but ultimately low-impact changes, and a lot of difficulty in looking at the bigger picture issues.

Regic · 2 years ago
> There were a lot of people who were just reciting the best practice rules they'd learned from blog posts, without really having the experience to know where the advice was coming from, or how best to apply it

This is exactly my experience too. Also, the problem with learning things from youtube and blogs is that whatever the author decides to cover is what we end up knowing, but they never intended to give a comprehensive lecture about these topics. The result is people who dogmatically apply some principles and entirely ignore others - neither of those really work. (I'm also guilty of this in ML topics.)

antisthenes · 2 years ago
> Not understanding dependencies, public/private, SCM or versioning, making their own code uninstallable after a few months

I'm not sure what "uninstallable" code is, but why does it matter? Do scientists really need to know about dependencies when they need the same 3 libraries over and over? Pandas, numpy, Apache arrow, maybe OpenCV. Install them and keep them updated. Maybe let the IT guys worry about dependencies if it needs more complexity than that.

> Writing completely unreadable code, even to themselves, making it impossible to maintain. This means they always restart from zero, and projects grow into folders of a hundred individual scripts with no order, depending on files that no longer exists

This is actually kind of a benefit. Instead of following sunk cost and trying to address tech debt on years-old code, you can just toss a 200-liner script out of the window along with its tech debt, presumably because the research it was written for is already complete.

> Foregoing any kind of testing or quality control, making real and nasty bugs rampant.

Scientific code only needs to transform data. If it's written in a way that does that (e.g. uses the right function calls and returns a sensible data array) then it succeeded in its goal.

> They are also less productive when coding than the scientists because they care too much about the quality of their work and not enough about getting shit done.

Sooo...another argument in favor of the way scientists write code then? Isn't "getting shit done" kind of the point?

cbolton · 2 years ago
Yeah these problems with "engineer code" the author describes, they are real, but it's a well known thing in software engineering. It's exactly what you can expect from junior developers trying to do their best. More experienced programmers have gone through the suffering of having to work on such code, like the author himself, and don't do these mistakes. Meanwhile, experienced scientists still write terrible code...
mazelife · 2 years ago
I'm a software engineer working with scientist-turned-programmers, and what I've experienced is also exactly the opposite of the author. The code written by the physicists, geoscientists and data scientists I work with often suffers from the following issues:

* "Big ball of mud" design [0]: No thought given to how the software should be architected or what the entities that comprise the design space of the problem are and how they fit together. The symptoms of this lack of thinking are obvious: multi-thousand-line swiss-army-knife functions, blocks of code repeated in dozens of places with minor variations, and a total lack of composability of any components. This kind of software design (or lack of design, really) ends up causing a serious hit to productivity because it's often useless outside of the narrow problem it was written to solve and because it's exceedingly hard to maintain or add new features to.

* Lack of tests: some of this is that the scientist-turned-programmer doesn't want to "waste time" writing tests, but more often it's that they don't know _how_ to write good tests. Or they have designed the code in such a way (see above) that it's really hard to test. In any case--unsurprisingly--their code tends to be buggy.

* Lack of familiarity with common data structures and algorithms: this often results in overly-complicated brute-force solutions to problems being used when they needn't have and in sub-par performance.

This quote from the author stood out to me:

> I claim to have repented, mostly. I try rather hard to keep things boringly simple.

...because it's really odd to me. Writing code that is as simple as it can be is precisely what good programmers do! But in order to get to the simplest possible solution to a non-trivial problem you need to think hard about the design of the code and ensure that the abstractions you implement are the right ones for the problem space. Following the "unix philosophy" of building small, simple components that each do one thing well but are highly composable is undoubtedly the more "boringly simple" approach in terms of the final result, but it's a harder to do (in the sense that it may take more though and more experience) than diving into the problem without thinking and cranking out a big ball of mud. Similarly reaching for the correct data structure or algorithm often results in a massively simpler solution to your problem, but you have to know about it or be willing to research the problem a bit to find it.

The author did at least try to support his thesis with examples of "bad things software engineers do", but a lot of them seem like things that--in almost every organization I've worked at in the last ten years--would definitely be looked down on/would not pass code review. Or are things ("A forest of near-identical names along the lines of DriverController, ControllerManager, DriverManager, ManagerController, controlDriver") that are narrowly tailored to a specific language at a specific window in time.

> they care too much about the quality of their work and not enough about getting shit done.

I think the appearance of "I'm just getting shit done" is often a superficial one, because it doesn't factor in the real costs: other scientists and engineers can't use their solutions because they're not designed in a way that makes them work in any other setting than the narrow one they were solving for. Or other scientists and engineers have trouble using the person's solutions because they are hard to understand and badly-documented. Or other scientists and engineers spend time going back and fixing the person's solutions later because they are buggy or slow. The mindset of "let's just get shit done and crank this out as fast as we can" might be fine in a research setting where, once you've solved the problem, you can abandon it and move on to the next thing. But in a commercial setting (i.e. at a company that builds and maintains software critical for the organization to function) this mindset often starts to impose greater and greater maintenance costs over time.

[0] https://en.wikipedia.org/wiki/Anti-pattern#Big_ball_of_mud

LeonardoTolstoy · 2 years ago
> Lack of familiarity with common data structures and algorithms

This part I 100% agree with. I adapt a lot of scientific code as my day-to-day and most of the issues in them tend to be making things 100x slower than they need to be and then even implementing insane approximations to "fix" the speed issue instead of actually fixing it

>"Big ball of mud" design

Funny enough this was explicitly how my PI at my current job wants to implement software. In his opinion the biggest roadblock in scientific software is actually convincing scientists to use the software. And what scientists want is a big ball of mud which they can iterate on easily and basically requires no installation. In his opinion a giant Python file with a requirement.txt file and a Python version is all you need. I find the attitude interesting. For the record he is a software engineer turned scientist, not the other way around, but our mutual hatred for Conda makes me wonder if he is onto something ...

>I think the appearance of "I'm just getting shit done" is often a superficial one, because it doesn't factor in the real costs: other scientists and engineers can't use their solutions because they're not designed in a way that makes them work in any other setting than the narrow one they were solving for.

For the record my experience is the exact opposite. The crazy trash software probably written in Python that is produced by scientists are often the ones more easily iterated on and used by other scientists. The software scientists and researchers can't use are the over-engineered stuff written in a language they don't know (e.g. Scala or Rust) that requires them to install a hundred things before they are able to use it.

karmelapple · 2 years ago
> The mindset … might be fine in a research setting

A vast amount of software is written for research papers that would be useful to people other than the paper’s authors. A lot of software that is in common use by commercial teams started off in academia.

One of the major issues I see is the lack of maintenance of this software, especially given all the problems written in your post and the one above. If the software is a big ball of mud, good luck to anyone trying to come in and make a modification for their similar research paper, or commercial application.

I don’t know the answer to this, but I think additional funding to biology labs to have something like a software developer who is devoted to making sure their lab’s software follows reasonably close to software development best practices would be a great start. If it’s a full time position where they’d likely stick around for many years, some of the maintenance issues would resolve themselves, too. This software-minded person at a lab would still be there even after the biology researchers have moved on elsewhere, and this software developer could answer questions from other people interested about code written years ago.

op00to · 2 years ago
This was my exact experience working in biomedical hpc.
chaxor · 2 years ago

    >    * Not understanding dependencies, public/private, SCM or versioning, making their own code uninstallable after a few months
This is definitely true, but I've searched *far and wide* , and unfortunately it's not a simple task to get this right.

Ultimately, if there were a simple way to get data in the correct state in an os-independent, machine independent (from raspberry pi to HPC the code should always work), concise, and idempotent way - people would use it. There isn't. But the certainly could be.

The solution we desperately need is a basically a pull request to a simple build tool (make, Snakemake, just, task, etc) that makes this idempotent and os-independent setup simple. Snakemake works on windows and Unix, so that's a decent start.

One big point is matching data outputs to source code and input state. *Allowing ipfs or torrent backends to Snakemake can solve this problem.*

The idea would be to simply wrap `input/output: "/my/file/here"` in `ipfs()`, wherein this would silently check if the file is locally cached to return, but if not go to IPFS as a secondary location to check for the file, then if the file isn't at either place, calculate it with the run command specified in Snakemake. It's useful to have this type of decentralized cache, because it's extremely common to run commands that may take several months on a supercomputer that give files that may only be a few MBs (exchange correlation functional) or only a few GBs (NN weights) so downloading the file is *immensely* cheaper to do than re-running the code - and the output is specified by the input source code (hence git commit hash maps to data hash).

The reason IPFS or torrent is the answer here is for several reasons: 1) The data location is specied by the hash of the content - which can be used to make a hash map of git commit hashes of source code state that map to data outputs (the code uniquely specifies the data in almost all cases, and input data can be included for the very rare cases it doesn't) 2) The availability and speed of download scales with popularity. Right now, were at the mercy of centralized storage systems, wherein the download rate can be however low they want it to be. However, LLM NN weights on IPFS can be downloaded very fast when millions of people *and* many centralized storage providers have the file hosted. 3) The data is far more robust to disappearing. Almost all scientific data output links point to nothing (MAG, sra/geomdb - the examples are endless). This is for many reasons such as academics moving and the storage location no longer being funded, accounts being moved, or simply the don't have enough storage space for emails on their personal Google drive and they delete the database files from their research. However, these are often downloaded many times by others in the field - the data exists somewhere - so it just needs to be accessible by decentralizing the data and allowing the community to download the file from the entire community which has it.

One of the important aspects to include in this buildtool would be to ensure that, every time someone downloads a certain file (specified by the git commit hash - data hash map) or uploads a file after computing it, they host the file as well. This way the community grows automatically by having a very low resource and extremely secure IPFS daemon host all of the important data files for different projects.

Having this all achieved by the addition of just 6 characters in a Snakemake file might actually solve this problem for the scientific / data science community, as it would be the standard and hard to mess up.

The next issue to solve would be popularize a standard way to get a package to work on all available cores/gpu/resources, etc on a raspberry pi to HPC without any changes or special considerations. Pyspark almost does this, but there's still more config than desirable for the community, and the requirement of installing OS-level dependencies (Java stuff) to work on python can often halt it's use completely (if the package using pyspark is a dependency of a dependency of a dependency, wet lab biologists [the real target users] *will not* figure out how to fix that problem if it doesn't "just work"[TM])

dijksterhuis · 2 years ago
What you’re describing sounds like DVC (at a higher-ish—80%-solution level although my brain switched off at the mention of IPFS).

https://dvc.org/

See pachyderm too.

mglz · 2 years ago
I just handed in my PhD in computer science. Our department teaches "best practices" but adherence to them is hardly possible in research:

1) Requirements change constantly, since... it's research. We don't know where exactly we're going and what problems we encounter.

2) Buying faster hardware is usually an option.

3) Time spent on documentation, optimization or anything else that does not directly lead to results is directly detrimental to your progress. The published paper counts, nothing else. If a reviewer ask about reproducibility, just add a git repository link.

4) Most PhD students never worked in industry, and directly come from the Master's to the PhD. Hence there is no place where they'd encounter the need to create scalable systems.

I guess Nr. 3 is has the worst impact. I would love to improve my project w.r.t. stability and reusability, but I would shoot myself into the foot: It's no publishable, I can't mention it a lot in my thesis, and the professorship doesn't check.

pfisherman · 2 years ago
Putting some effort into (3) can increase your citations (h-index). If people can’t use your software then they will just find some other method to benchmark against or build on.

Here you are not improving your time to get out an article, but reducing it for others - which will make your work more influential.

alexmolas · 2 years ago
> 3) Time spent on documentation, optimization or anything else that does not directly lead to results is directly detrimental to your progress.

Here's is where I disagree. It's detrimental in the short term, but to ensure reproducibility and development speed in the future you need to follow best practices. Good science requires good engineering practices.

jacobolus · 2 years ago
The point is, it's not prioritized since it's not rewarded. Grad students are incentivized to get their publications in and move on, not generate long-term stable engineering platforms for future generations.
bad_alloc · 2 years ago
Never had a paper rejected for lack of reproducibility though. And as long as I am working for the PhD and not the long term career, it's still better to focus on the short term. I don't like it, but I feel that's where I ended up :(
quickthrower2 · 2 years ago
I agree. Been doing devops recently but back at some coding at work and I wrote the function as simple as I could, adding complexity but only as needed.

So it started as a MVC controller function that was as long as your arm. Then it got split up into separate functions, and eventually I moved those functions to another file.

I had some genuine need for async, so added some stuff to deal with that, timeouts, error handling etc.

But I hopefully created code that is easy to understand, easy to debug/change.

I think years ago I would have used a design pattern. Definitely a bridge - because that would impress Kent Beck or Martin Fowler! But now I just want to get the job done, and the code to tell a story.

I think I pretend I am a Go programmer even if I am not using Go!

c048 · 2 years ago
Congrats, you used design patterns.
quickthrower2 · 2 years ago
Still have the self made pat on my back