This article brings up scientific code from 10 years ago, but how about code from .. right now? Scientists really need to publish their code artifacts, and we can no longer just say "Well they're scientists or mathematicians" and allow that as an excuse for terrible code with no testing specs. Take this for example:
This was used by the Imperial College for COVID-19 predictions. It has race conditions, seeds the model multiple times, and therefore has totally non-deterministic results[0]. Also, this is the cleaned up repo. The original is not available[1].
A lot of my homework from over 10 years ago still runs (Some require the right Docker container: https://github.com/sumdog/assignments/). If journals really care about the reproducibility crisis, artifact reviews need to be part of the editorial process. Scientific code needs to have tests, a minimal amount of test coverage, and code/data used really need to be published and run by volunteers/editors in the same way papers are reviewed, even for non-computer science journals.
I am all for open science, but you understand that the links in your post are the exact worry people have when it comes to releasing code: people claiming that their non-software engineering grade code invalidates the results of their study.
I'm an accelerator physicist and I wouldn't want my code to end up on acceleratorskeptics.com with people that don't understand the material making low effort critiques of minor technical points. I'm here to turn out science, not production ready code.
As an example, you seem to be complaining that their Monte Carlo code has non-deterministic output when that is the entire point of Monte Carlo methods and doesn't change their result.
By the way, yes I tested my ten year old code and it does still work. What I'm saying is that scientific code doesn't need to handle every special case or be easily usable by non-experts. In fact the time spent making it that way is time that a scientist spends doing software engineering instead of science, which isn't very efficient.
Let's be clear - scientific-grade code is a substandard of production-grade code. But it is still a real standard.
Does scientific-grade code need to handle a large number of users running it at the same time? Probably not a genuine concern, since those users will run their own copies of the code on their own hardware, and it's not necessary or relevant for users to see the same networked results from the same instance of the program running on a central machine.
Does scientific-grade code need to publish telemetry? Eh, usually no. Set up alerting so that on-call engineers can be paged when (not if) it falls over? Nope.
Does scientific-grade code need to handle the authorization and authentication of users? Nope.
Does scientific-grade code need to be reproducible? Yes. Fundamentally yes. The reproducibility of results is core to the scientific method. Yes, that includes Monte Carlo code, when there is no such thing as truly random number generation on contemporary computers, only pseudorandom number generation, and what matters for cryptographic purposes is that the seed numbers for the pseudorandom generation are sufficiently hidden / unknown. For scientific purposes, the seed numbers should be published on purpose, so that a) the exact results you found, sufficiently random as they are for the purpose of your experiment, can still be independently verified by a peer reviewer, b) a peer reviewer can intentionally decide to pick a different seed value, which will lead to different results but should still lead to the same conclusion if your decision to reject / refuse to reject the null hypothesis was correct.
Monte-Carlo can and should be deterministic and repeatable. It’s a matter of correctly initializing you random number generators and providing a known/same random seed from run to run. If you aren’t doing that, you aren’t running your Monte-Carlo correctly. That’s a huge red flag.
Scientists need to get over this fear about their code. They need to produce better code and need to actually start educating their students on how to write and produce code. For too long many in the physics community have trivialized programming and seen it as assumed knowledge.
Having open code will allow you to become better and you’ll produce better results.
Side note: 25 years ago I worked in accelerator science too.
Doesn't it concern you that it would be possible for critics to look at your scientific software and find mistakes (some of which the OP mentioned are not "minor") so easily?
Given that such software forms the very foundation of the results of such papers, why shouldn't it fall under scrutiny, even for "minor" points? If you are unable to produce good technical content, why are you qualified to declare what is or isn't minor? Isn't the whole point that scrutiny is best left to technical experts (and not subject experts)?
> exact worry people have when it comes to releasing code: people claiming that their non-software engineering grade code invalidates the results of their study.
If code is what is substantiating a scientific claim, then code needs to stand up to scientific scrutiny. This is how science is done.
I came from physics, but systems and computer engineering was always an interest of mine, even before physics, I thought it was kooky-dooks that CS people can release papers w/o code, fine if the paper contains all the proofs but otherwise it shouldn't even be looked at. PoS (proof-of-science) or GTFO.
We are the point in human and scientific civilization that knowledge needs to prove itself correct. Papers should be self contained execution environments that generate PDFs and resulting datasets. The code doesn't need to be pretty, or robust, but it needs to be sealed inside of a container so that it can be re-run, re-validated and someone else can confirm the result X years from now. And it isn't about trusting or not trusting the researcher, we need to fundamentally trust the results.
> I'm an accelerator physicist and I wouldn't want my code to end up on acceleratorskeptics.com with people that don't understand the material making low effort critiques of minor technical points. I'm here to turn out science, not production ready code.
Specifically, to that point, I want to cite the saying:
"The dogs bark, but the caravan passes."
(There is a more colorful German variant which is, translated: "What does it bother the mighty old oak tree if a dog takes a piss...").
Of course, if you publish your code, you expose it to critics. Some of this will be unqualified. And as we have seen in the case e.g. of climate scientists, some might be even nasty. But who cares? What matters is open discussion which is a core value of science.
That's not how the game is played. If you cannot the release the code because the code is too ugly or untested or has bugs, how do you expect anyone with the right expertise to assess your findings?
It reminds me of Kerckhoffs's principle in cryptography, which states: A cryptosystem should be secure even if everything about the system, except the key, is public knowledge.
Nit: implementations of Monte Carlo methods are not necessarily nondeterministic. Whenever I implement one, I always aim for a deterministic function of (input data, RNG seed, parallelism, workspace size).
I have done research on Evolutionary Algorithm and numerical optimization. It was nigh impossible to reproduce poorly described algorithms from state of the art research at the time and researchers would very often not bother to reply to inquiries for their code. Even if you did get the code it would be some arcane C only compatible with a GCC from 1996.
Code belongs with the paper. Otherwise we can just continue to make up numbers and pretend we found something significant.
Our first job as scientists is to make sure we're not fooling ourselves. I wouldn't just use any old scale to take a measurement. I want a calibrated scale, adjusted to meet a specific standard of accuracy. Such standards and calibrations ensure we can all get "the same" result doing "the same" thing, even if we use different equipment from different vendors. The concerns about code are exactly the same. It's even scarier to me because I realize that unlike a scale, most scientists have no idea how to calibrate their code to ensure accurate, reproducible results. Of course with the scales, the calibration is done by a specialized professional who's been trained to calibrate scales. Not sure how we solve this issue with the code.
I’m very puzzled by this attitude. As an accelerator physicist, would you want you accelerator to be held together by duct tape, and producing inconsistent results? Would you complain that you’re not a professional machinist when somebody pointed it out? Why is software any different than hardware in this respect?
> I wouldn't want my code to end up on acceleratorskeptics.com with people that don't understand the material making low effort critiques of minor technical points. I'm here to turn out science, not production ready code.
In what way do idiots making idiotic comments about your correct code invalidate your scientific production? You can still turn out science and let people read and comment freely on it.
> As an example, you seem to be complaining that their Monte Carlo code has non-deterministic output when that is the entire point of Monte Carlo methods and doesn't change their result.
I guess you would not need to engage personally with the idiots at "acceleratorskeptics.com", but likely most of their critique would be easily shut off by a simple sentence such as this one. Since most of your readers would not be idiots, they could scrutinize your code and even provide that reply on your behalf. This is called the scientific method.
I agree that you produce science, not merely code. Yet, the code is part of the science and you are not really publishing anything if you hide that part. Criticizing scientific code because it is bad software engineering is like criticizing it because it uses bad typography. You should not feel attacked by that.
Race conditions and certain forms of non-determinism could invalidate the results of a given study. Code is essentially a better-specified methods section, it just says what they did. Scientists are expected to include a methods section for exactly this reason, and any scientist worried about including a methods section in their paper would be rightly rejected.
However, a methods section is always under-specified. Code provides the unique opportunity to actually see the full methods on display and properly review their work. It should be mandated by all reputable journals and worked into the peer review process.
While you're running experiments, it doesn't matter, but publishing any sort of result or using your code in parts of other publishable code IS production code, and you should treat it as such.
> people claiming that their non-software engineering grade code invalidates the results of their study.
But that's exactly the problem.
Are you familiar with that bug in early Civ games where an overflow was making Ghandi nuke the crap out of everyone? What if your code has a similar issue?
What if you have a random value right smack in the middle of your calculations and you just happened to be lucky when you run your code?
I'm not that familiar with Monte Carlo, my understanding is that this is just a way to sample the data. And I won't be testing your data sampling, but I will expect that given the same data to your calculations part (eg, after the sampling happens), I get exactly the same results every time I run the code and on any computer. And if there are differences I expect you to be able to explain why they don't matter, which will show you were aware of the differences in the first place and you were not just lucky.
And then there is the matter of magic values that plaster research code.
Researchers should understand that the rules for "software engineering grade code" are not there just because we want to complicate things, but because we want to make sure the code is correct and does what we expect it to do.
/edit: The real problem is not getting good results with faulty code, is ignoring good solutions because faulty code.
> What I'm saying is that scientific code doesn't need to handle every special case or be easily usable by non-experts. In fact the time spent making it that way is time that a scientist spends doing software engineering instead of science, which isn't very efficient.
If the proof on which the paper is based is in the code that produced the evidence, you absolutely need to be able to let a lambda user run it without specific knowledge to abide to the reproducible principle. Asking a reviewer to fiddle about like a IT professional to get something working is bound to promote lazy reviewing and either will result into dismissing the result or approval without real review.
And by the way producing a paper could be argued it isn't really science either, but if you are working with MSFT Office, you know there is a fair amount of non science work hours that has been put into that as well.
> As an example, you seem to be complaining that their Monte Carlo code has non-deterministic output when that is the entire point of Monte Carlo methods and doesn't change their result.
Not so fast. Monte Carlo code turns arbitrary RNG seeds into outputs. That process can, and arguably should be, deterministic.
To do your study, you feed your Monte Carlo code 'random enough' seeds. Coming up with the seeds does not need to be deterministic. But once the seeds are fixed, the rest can be deterministic. Your paper should probably also publish the seeds used, so that people can reproduce everything. (And so they can check whether your seeds are carefully chosen, or really produce typical outcomes.)
I'm an accelerator physicist and I wouldn't want my code to end up on acceleratorskeptics.com with people that don't understand the material making low effort critiques of minor technical points. I'm here to turn out science, not production ready code.
Sure, and that rationale works OK when your code operates in a limited, specialized domain.
But if you're modeling climate change or infectious diseases, and you expect your work to affect millions of human lives and trillions of dollars in spending, then you owe us a full accounting of it.
>when that is the entire point of Monte Carlo methods and doesn't change their result.
Two nitpicks: a) it shouldn't change the conclusions, but MC calculations will get different results depending on the seed. and b) it is considered good practice in reproducible science to fix the seed so that the results of subsequent runs give exactly the same results.
Ultimately, I think there is a balance: really poor code can lead to incorrect conclusions... but you don't need production ready code for scientific exploration.
Sorry to be pedantic, but although Monte Carlo simulations are based on pseudo-randomness, I still think it is good practice that they have deterministic results (i.e., use a given seed) so that the exact results can be replicated. If the precise numbers can be reproduced then a) it helps me as a reviewer see that everything is kosher with their code and b) it means that if I tweak the code to try something out my results will be fully compatible with theirs.
Why is "doing software engineering" not "doing science"?
Anybody who has conducted experimental research will say they spent 80% of the time using a hammer or a spanner. Repairing faulty lasers or power supplies. This process of reliable and repeatable experimentation is the basis of science itself.
Computational experiments must be held to the same standards as physical experiments. They must be reproducible and they should be publicly available (if publicly funded).
What are the frameworks used in scientific endeavours? Given that scaling is not an issue, something like Rails for science seems like it could potentially return many $(B/M)illions of dollars for humanity.
edit: please read the grandchild comment before going off on the idea that some random programmer on the Internet dares to criticize scientific code he does not understand. What is crucial in the argument here is indeed the distinction between methods employing pseudo-randomness, like Monte Carlo simulation, and non-determinism caused by undefined behavior.
> I'm an accelerator physicist and I wouldn't want my code to end up on acceleratorskeptics.com with people that don't understand the material making low effort critiques of minor technical points.
The person which wrote the linked blog post writes that it was a software engineer at google. Unfortunately, that claim is not falsifiable as the person decided to remain anonymous.
> As an example, you seem to be complaining that their Monte Carlo code has non-deterministic output when that is the entire point of Monte Carlo methods and doesn't change their result.
The claim is that even with the same random seed for the random generator, the program produces different results, and this is explained by the allegation that it runs non-deterministic (in the sense of undefined behavior) in multiple threads. It claims also that it produces significantly different results depending on which output file format is chosen.
If this is true, the code would have race conditions, and as being impacted by race conditions is a form of undefined behavior, this would make any result of the program questionable, as the program would not be well-defined.
Personally, I am very doubtful whether this is true, this would be incredibly sloppy by the imperial college scientists. Some more careful analysis by a recognized programmer might be warranted.
However it underlines well the importance of the main topic that scientific code should be open to analysis.
> What I'm saying is that scientific code doesn't need to handle every special case or be easily usable by non-experts.
Fully agree with this. But it should try to document its limitations.
I want science to be held to a very high standard. Maybe even higher than "software engineering grade". Especially if it's being used as a justification for public policy.
At the risk of just mirroring points which have already been made:
> you understand that the links in your post are the exact worry people have when it comes to releasing code: people claiming that their non-software engineering grade code invalidates the results of their study.
It's profoundly unscientific to suggest that researchers should be given the choice to withhold details of their experiments that they fear will not withstand peer review. That's much of the point of scientific publication.
Researchers who are too ashamed of their code to submit it for publication, should be denied the opportunity to publish. If that's the state of their code, their results aren't publishable. Unpublishable garbage in, unpublishable garbage out. Simple enough. Journals just shouldn't permit that kind of sloppiness. Neither should scientists be permitted to take steps to artificially make it difficult to reproduce (in some weak sense) an experiment. (Independently re-running code whose correctness is suspect, obviously isn't as good as comparing against a fully independent reimplementation, but it still counts for something.)
If a mathematician tried to publish the conclusion of a proof but refused to show the derivation, they'd be laughed out of the room. Why should we hold software-based experiments to such a pitifully low standard by comparison?
It's not as if this is a minor problem. Software bugs really can result in incorrect figures being published. In the case of C and C++ code in particular, a seemingly minor issue can result in undefined behaviour, meaning the output of the program is entirely unconstrained, with no assurance that the output will resemble what the programmer expects. This isn't just theoretical. Bizarre behaviour really can happen on modern systems, when undefined behaviour is present.
A computer scientist once told me a story of some students he was supervising. The students had built some kind of physics simulation engine. They seemed pretty confident in its correctness, but in truth it hadn't been given any kind of proper testing, it merely looked about right to them. The supervisor had a suggestion: Rotate the simulated world by 19 degrees about the Y axis, run the simulation again, and compare the results. They did so. Their program showed totally different results. Oh dear.
Needless to say, not all scientific code can so easily be shown to be incorrect. All the more reason to subject it to peer review.
> I'm an accelerator physicist and I wouldn't want my code to end up on acceleratorskeptics.com with people that don't understand the material making low effort critiques of minor technical points.
Why would you care? Science is about advancing the frontier of knowledge, not about avoiding invalid criticism from online communities of unqualified fools.
I sincerely hope vaccine researchers don't make publication decisions based on this sort of fear.
> people claiming that their non-software engineering grade code invalidates the results of their study.
How exactly is this a bad thing?
> I'm an accelerator physicist and I wouldn't want my code to end up on acceleratorskeptics.com with people that don't understand the material making low effort critiques of minor technical points. I'm here to turn out science, not production ready code.
But it should be noted that what you didn't say is that you're here to turn out accurate science.
This is the software version of statistics. Imagine if someone took a random sampling of people at a Trump rally and then claimed that "98% of Americans are voting for Trump". And now imagine someone else points out that the sample is biased and therefore the conclusion is flawed, and the response was "Hey, I'm just here to do statistics".
---
Do you see the problem now? The poster above you pointed out that the conclusions of the software can't be trusted, not that the coding style was ugly. Most developers would be more than willing to say "the code is ugly, but it's accurate". What we don't want is to hear "the conclusions can't be trusted and 100 people have spent 10+ years working from those unreliable conclusions".
As a theoretical physicist doing computer simulations, I am trying to publish all my code whenever possible. However all my coauthors are against that. They say things like "Someone will take this code and use it without citing us", "Someone will break the code, obtain wrong results and blame us", "Someone will demand support and we do not have time for that", "No one is giving away their tools which make their competitive advantage". This is of course all nonsense, but my arguments are ignored.
If you want to help me (and others who agree with me), please sign this petition: https://publiccode.eu. It demands that all publicly funded code must be public.
>"Someone will demand support and we do not have time for that",
Well ... that part isn't nonsense, though I agree it shouldn't be a dealbreaker. And it means we should work towards making such support demands minimal or non-existent via easy containerization.
I note with frustration that even the Docker people, whose entire job is containerization, can get this part wrong. I remember when we containerized our startup's app c. 2015, to the point that you should be able to run it locally just by installing docker and running `docker-compose up`, and it still stopped working within a few weeks (which we found when onboarding new employees), which required a knowledgeable person to debug and re-write.
(They changed the spec for docker-compose so that the new version you'd get when downloading Docker would interpret the yaml to mean something else.)
As a theoretical physicist your results should be reproducible based on the content of your papers, where you should detail/state the methods you use. I would make the argument that releasing code in your position has the potential to be scientifically damaging; if another researcher interested in reproducing your results reads your code, then it is possible their reproduction will not be independent. However they will likely still publish it as such.
> "No one is giving away their tools which make their competitive advantage"
This hits close to home. Back in college, I developed software, for a lab, for a project-based class. I put the code up on GitHub under the GPL license (some code I used was licensed under GPL as well), and when the people from the lab found out, they lost their minds. A while later, they submitted a paper and the journal ended up demanding the code they used for analysis. Their solution? They copied and pasted pieces of my project they used for that paper and submitted it as their own work. Of course, they also completely ignored the license.
> Scientists really need to publish their code artifacts, and we can no longer just say "Well they're scientists or mathematicians" and allow that as an excuse for terrible code with no testing specs.
You are blaming scientists but speaking from my personal experience as a computational scientist, this exists because there are few structures in place that incentivize strong programming practices.
* Funding agencies do not provide support for verification and validation of scientific software (typically)
* Few journals require assess code reproducibility and few require public code (few require even public data)
* There are few funded studies to reproduce major existing studies
Until these structural challenges are addressed, scientists will not have sufficient incentive to change their behavior.
> Scientific code needs to have tests, a minimal amount of test coverage, and code/data used really need to be published and run by volunteers/editors in the same way papers are reviewed, even for non-computer science journals.
Second this. Research code is already hard, and with misaligned incentives from the funding agencies and grad school pipelines, it's an uphill battle. Not to mention that professors with an outdated mindset might discourage graduate students from committing too much time to work on scientific code. "We are scientists, not programmers. Coding doesn't advance your career" is often an excuse for that.
In my opinion, enforcing standards without addressing this root cause is not gonna fix the problem. Worse, students and early career researchers will bear the brunt of increased workload and code compliance requirements from journals. Big, well-funded labs that can afford a research engineer position is gonna have an edge over small labs that cannot do so.
After a paper has been accepted, authors can submit a repository containing a script which automatically replicates results shown in the paper. After a reviewer confirms that the results were indeed replicable, the paper gets a small badge next to its title.
While there could certainly be improvements, I think it's a step in the right direction.
> If journals really care about the reproducibility crisis
All is well and good then, because journals absolutely don't care about science. They care about money and prestige. From personal experience, I'd say this intersects with the interests of most high-ranking academics. So the only unhappy people are idealistic youngsters and science "users".
I am in 100% agreement and would like to point out that many papers based on code don't even come with code bases, and if they do those code bases are not going to contain or be accompanied by any documentation whatsoever. This is frequently by design as many labs consider code to be IP and they don't want to share it because it gives them a leg up on producing more papers and the shared code won't yield an authorship.
There are some efforts in this vein within academia, but they are very weak in the United States. The U.S. Research Software Engineer Association (https://us-rse.org/) represents one such attempt at increasing awareness about the need for dedicated software engineers in scientific research and advocates for a formal recognition that software engineers are essential to the scientific process.
Realistically though even if the necessity of research software engineering were acknowledged at the institutional level at the bulk of universities, there would still be the problem of universities paying way below market rate for software engineering talent...
To some degree, universities alone cannot effect the change needed to establish a professional class of software engineers that collaborate with researchers. Funding agencies such as the NIH and NSF are also responsible, and need to lead in this regard.
Noone expects them to be software engineers, but we do expect them to be _scientists_ - to publish results that are reproducible and verifiable. And that has to hold for code as well.
John Carmack, who did some small amount of work on the code, had a short rebuttal of the "Lockdown Skeptics" attack on the Imperial College code that probably mirrors the feelings of some of us here:
Can you describe a bit more about what is going on in the project? The file you linked is over 2.5k lines of c++ code, and that is just the “setup” file. As you say, this is supposed to be a statistical model, I expected this to be R, Python or one of the standard statistical packages.
It is essentially a detailed simulation of viral spread, not just a programmed distribution or anything. It's all in C++ because it's pretty performance-critical.
Because much of this code was written in the 80's, I suspect. In general, there's a bunch of really old scientific codebases in particular disciplines because people have been working on these problems for a looooonnngg time.
In computer science a lot of researcher already publish their code (at least in the domain of software engineering) but my biggest problem is not the absence of tests but the absence of any documentation how to run it. In the best case you can open it in an IDE and it will figure out how to run it but I rarely see any indications what the dependencies are. So if you figure out how to run the code you run it until you get the first import exception, get the dependency until you get the next import exception and so on. I spent way too much time on that instead of doing real research.
The criticisms of the code from Imperial College are strange to me. Non-deterministic code is the least of your problems when it comes to modeling the spread of a brand new disease. Whatever error is introduced by race conditions or multiple seeds is completely dwarfed by the error in the input parameters. Like, it's hard to overstate how irrelevant that is to the practical conclusions drawn from the results.
Skeptics could have a field day tearing apart the estimates for the large number of input parameters to models like that, but they choose not to? I don't get it.
I do research for a private company, and open-source as much of my work as I can. It's always a fight. So I'll take their side for the moment.
Many years ago, a paper on the PageRank algorithm was written, and the code behind that paper was monetized to unprecedented levels. Should computer science journals also require working proof of concept code, even if that discourages companies from sharing their results; even if it prevents students from monetizing the fruits of their research?
A seasoned software developer encountering scientific code can be a jarring experience. So many code smells. Yet, most of those code smells are really only code smells in application development. Most scientific programming code only ever runs once, so most of the axioms of software engineering are inapplicable or a distraction from the business at hand.
Scientists, not programmers, should be the ones spear-heading the development of standards and rules of thumb.
Still, there are real problematic practices that an emphasis on sharing scientific code would discourage. One classic one is the use of a single script that you edit each time you want to re-parameterize a model. Unless you copy the script into the output, you lose the informational channel between your code and its output. This can have real consequences. Several years ago I started up a project with a collaborator to follow up on their unpublished results from a year prior. Our first task was to take that data and reproduce the results they obtained before, because the person no longer had access to the exact copy of the script that they ran. We eventually determined that the original result was due to a software error (which we eventually identified). My colleague took it well, but the motivation to continue the project was much diminished.
You can blame all the scientists, but shouldn't we blame the CS folks for not coming up with suitable languages and software engineering methods that will prevent software from rotting in the first place?
Why isn't there a common language that all other languages compile to, and that will be supported on all possible platforms, for the rest of time?
(Perhaps WASM could be such a language, but the point is that this would be just coincidental and not a planned effort to conservate software)
And why aren't package managers structured such that packages will live forever (e.g. in IPFS) regardless of whether the package management system is online? Why is Github still a single point of failure in many cases?
It's hard for me to publish my code in healthcare services research because most of it is under lock-and-key due to HIPAA concerns. I can't release the data, and so 90% of the work of munging and validating the data is un-releasable. So, should I release my last 10% of code where I do basic descriptive stats, make tables, make visualizations, or do some regression modeling? Certainly, I can make that available in de-identified ways, but without data, how can anyone ever verify its usefulness? And does anyone want to see how I calculated the mean, median, SD, IQR?...because it's with base R or tidyverse, that's not exactly revolutionary code.
One of the things I come across is scientists who believe they're capable of learning code quickly because they're capable in another field.
After they embark on solving problems, it does become an eyeopening experience, and one that becomes now about keeping things running.
For those who have a STEM discipline in addition to a software development background >5Y, would you agree with seeing the above?
I would have thought the scientists among us would approach someone with familiarity with software development expertise. (something abstract and requiring a different set of muscles)
One positive emerging is the variety of low/no-code tooling that can replace a lot of this hornets nest coding.
It's generally not plausible to "approach someone with familiarity with software development expertise" for organizational and budget reasons. Employing dedicated software developers is simply not a thing that happens; research labs overwhelmingly have the coding done by researchers and involved students without having any dedicated positions for software development.
In any case you'd need to teach them the problem domain, and it's considered cheaper (and simpler from organizational perspective) to get some phd students or postdocs from your domain to spend half a year getting up to speed on coding (and they likely had a few courses in programming and statistics anyway) than to hire an experienced software developer and have them learn the basics of your domain (which may well take a third or half of the appropriate undergraduate bachelor's program).
> I would have thought the scientists among us would approach someone with familiarity with software development expertise.
Is there a pool of skilled software architects willing to provide consultations at well-below market wages? Or a Q&A forum full of people interested in giving this kind of advice? (StackOverflow isn't useful for this; the allowed question scope is too narrow.) I guess one incentive to publish one's code is to get it criticized on places like Hacker News. The best way to get the right answer on the internet is to post the wrong answer, after all.
My work position was created because scientists are not engineers. I had to explain -to my disappointment- why non-deterministic algorithms are bad, how to write tests, and how to write SQL queries, more than once.
However, when working as equals scientists and engineers can create truly transformative projects. Algorithms accounts for 10% of the solution. The code, infrastructure and system design accounts for 20% of the final result. The remaining 70% of the value, is directly coming from its impact. A projects that nobody uses is a failure. Something that perfectly solves a problem that nobody cares about is useless.
> This was used by the Imperial College for COVID-19 predictions. It has race conditions, seeds the model multiple times, and therefore has totally non-deterministic results[0].
This does not looks like a good example at all, as it appears that the blog author there just tries to discredit the program because he does not like the results. He also writes that all epidemiological research should be defunded.
There is a fundamental reason not to publish scientific code.
If someone is trying to reproduce someone else's results, the data and methods are the only ingredients they need. If you add code into this mix, all you do is introduce new sources of bias.
This is an easy argument to make because it was already made for you in popular press months ago.
Show me the grant announcements that identify reproducible long term code as a key deliverable, and I’ll show you 19 out of 20 scientists who start worrying about it.
Short answer: Yes, my 30 year old Fortran code runs (with a few minor edits between f77 and modern fortran), as did my ancient Perl codes.
Watching the density functional theory based molecular dynamics zip along at ~2 seconds per time step on my 2 year old laptop, versus the roughly 6k seconds per time step on an old Sun machine back in 1991. I remember the same code getting down to 60 seconds per time step on my desktop R8k machine in the late 90s.
Whats been really awesome about that has been the fact that I've written some binary data files on big endian machines in the early 90s, and re-read them on the laptop (little endian) adding a single compiler switch.
Perl code that worked with big XML file input in the mid 2000s continues to work, though I've largely abandoned using XML for data interchange.
C code I wrote in the mid 90s compiled, albeit with errors that needed to be corrected. C++ code was less forgiving.
Over the past 4 months, I had to forward port a code from Boost 1.41 to Boost 1.65. Enough changes over 9 years (code was from 2011) that it presented a problem. So I had to follow the changes in the API and fix it.
I am quite thankful I've avoided the various fads in platforms and languages over the years. Keep inputs in simple textual format that can be trivially parsed.
> Whats been really awesome about that has been the fact that I've written some binary data files on big endian machines in the early 90s, and re-read them on the laptop (little endian) adding a single compiler switch.
I want to second the idea of just dumping your floating point data as binary. It's basically the CSV of HPC data. It doesn't require any libraries, which could break or change, and even if the endianness changes you can still read it decades later. I've been writing a computational fluid dynamics code recently and decided to only write binary output for those reasons. I'm not convinced of the long-term stability of other formats. I've seen colleagues struggle to read data in proprietary formats even a few years after creating it. Binary is just simple and avoids all of that. Anybody can read it if needed.
Counter argument: Binary dumps are horrible because usually the documentation that allows you to read the data is missing. Using a self-documenting format such as HDF5 is far superior. It will tell you of the bit are floating point numbers in single or double precision, which endianess and what the layout of the 3d array was. (No surprise that HDF was invented for the Voyager mission where they had to ensure readability of the data for half a century).
Yes, I know a couple of Fortran 77 apps and libraries which were developed more than 25 years ago and which are still in use today.
My C++ Qt GUI application for NMR spectrum analysis (https://github.com/rochus-keller/CARA) runs since 20 years now with continuing high download and citation rates.
So obviously C++/Qt or Fortran 77 are very well suited to outlast time.
As someone who worked with bits of scientific code: Does the code you write right now work on another machine might be the more appropriate challenge. If seen a lot of hardcoded paths, unmentioned dependencies and monkey-patched libraries downloaded from somewhere; just getting the new code to work is hard enough. And let's not even begin to talk about versioning or magic numbers.
Similar to other comments I don't mean to fault scientists for that - their job is not coding and some of the dependencies come from earlier papers or proprietary cluster setups and are therefore hard to avoid - but the situation is not good.
To me, that's like a theoretical physicist saying "My job is not to do mathematics" when asked for a derivation of a formula he put in the paper.
Or an experimental physicist saying "My job is not mechanical engineering" when asked for details of their lab equipment (almost all of which is typically custom built for the experiment).
On one hand, yes. But on the other hand, reuseable code, dependency management, linting, portability etc are not that easy problems and something junior developers tend to struggle with (and its not like that problem never pops up for seniors, either). I really can't fault non-compsci scientist for not handling that problem well. Of course, part of it (like publishing the relevant code) is far easier and should be done, but some aspects are really hard.
IMO the incentive problem in science (basically number of papers and new results is what counts) also plays into this, as investing tons of time in your code gives you hardly any reward.
The point is that as a scientist your code is a tool to get the job done and not the product. I can't spend 48 hours writing unit tests for my library (even though I want to) if it's not going to give me results. It's literally not my job and is not an efficient use of my time
>Yeah, we build it with duck-tape and there's hot glue holding the important bits that kept falling off. Don't put anything metal in that, we use it as a tea heater, but there's 1000A running through it so it's shoots spoons out when we turn the main machine on.
Lots of people saying, it is the scientist's job to produce reproducible code. It is, and the benefits of reproducible code are many. I have been a big proponent of it in my own work.
But not with the current mess of software frameworks. If I am to produce reproducible scientific code, I need an idiot-proof method of doing it. Yes, I can put in the 50-100 hours to learn how to do it [1], but guess what, in about 3-5 years a lot of that knowledge will be outdated. People comparing it with math, but the math proofs I produce will still be readable and understandable a century from now.
Regularly used scientific computing frameworks like matlab/R/Python ecosystem/mathematica need a dumb guided method of producing releasable and reproducable code. I want to go through a bunch of next buttons, that help me fix the problems you indicate, and finally release a final version that has all the information necessary for someone else to reproduce the results.
[1] I have. I would put myself in the 90th percentile of physicists familiar with best practices for coding. I speak for the 50% percentile.
(1) Use a package manager, which stores hashsums in a lock file. (2) Install your dependencies from a lock file as spec. (3) Do not trust version numbers. Trust hash sums. Do not believe in "But I set the version number!". (4) Do not rely on downloads Again, trust hash sums, not URLs. (5) Hashsums!!! (6) Wherever there is randomness as in random number generators, use a seed. If the interface does not allow to specify the seed, thtow the trash away and use another generator. Careful when concurrency is involved. It might destroy reproducibility. For example this was the case with Tensorflow. Not sure it still is. (7) Use a version control system.
Definitely makes you question it more. Does the paper not explain the contents of the MATLAB code? That's all that is usually needed for reproducibility. You should be able to get the same results no matter who writes the code to do what is explained in their methods.
Of course, I have no idea about the paper you're talking about and just want to say that reproducibility isn't dependent on releasing code. There could even be a case were it's better if someone reproduces a result without having been biased by someone else's code.
I think the idea that scientific code should be judged by the
same standards as production code is a bit unfair. The point when
the code works the first time is when an industry programmer
starts to refactor it -- because he expects to use and work on
it in the future. The point when the code works the first time
is when a scientists abandons it -- because it has fulfilled its
purpose. This is why the quality is lower: lots of scientific
code is the first iteration that never got a second.
(Of course, not all scientific code is discardable, large quantities
of reusable code is reused every day; we have many frameworks,
and the code quality of those is completely different).
But it often is. For most non-CS papers (mostly biosciences) I've read, there are specific authors whose contribution to a large degree was mainly "coding".
The gold standard for a scientific finding is not whether an particular experiment can be repeated, it is whether a different experiment can confirm the finding.
The idea is that you have learned something about how the universe works. Which means that the details of your experiment should not change what you find... assuming it's a true finding.
Concerns about software quality in science are primarily about avoiding experimental error at the time of publication, not the durability of the results. If you did the experiment correctly, it doesn't matter if your code can run 10 years later. Someone else can run their own experiment, write their own code, and find the same thing you did.
And if you did the experiment incorrectly, it also doesn't matter if you can run your code 10 years later; running wrong code a decade later does not tell you what the right answer is. Again--conducting new research to explore the same phenomenon would be better.
When it comes to hardware, we get this. Could you pick up a PCR machine that's been sitting in a basement for 10 years and get it running to confirm a finding from a decade ago? The real question is, why would you bother? There are plenty of new PCR machines available today, that work even better.
And it's the same for custom hardware. We use all sorts of different telescopes to look at Jupiter. Unless the telescope is broken, it looks the same in all of them. Software is also a tool for scientific observation and experimentation. Like a telescope, the thing that really matters is whether it gives a clear view of nature at the time we look through it.
Reproducibility is about understanding the result. It is the modern version of "showing your work".
One of the unsung and wonderful properties of reproducible workflows is the fact that it can allow science to be salvaged from an analysis that contains an error. If I had made an error in my thesis data analysis (and I did, pre-graduation), the error can be corrected and the analysis re-run. This works even if the authors are dead (which I am not :) ).
Reproducibility abstracts the analysis from data in a rigorous (and hopefully in the future, sustainable) fashion.
>Reproducibility is about understanding the result. It is the modern version of "showing your work".
That is something no one outside of highschool cares about. The idea that you can show work in general is ridiculous. Do I need to write a few hundred pages of set theory to start using addition in a physics paper? No. The work you need to show is the work a specialist in the field would find new, which is completely different to what a layman would find new.
Every large lab, the ones that can actually reproduce results, has decades of specialist code that does not interface with anything outside the lab. Providing the source code is then as useful as giving a binary print out of an executable for an OS you've never seen before.
> running wrong code a decade later does not tell you what the right answer is.
It can tell, however, exactly where the error lies (if the error is in software at all). Like a math teacher that can circle where the student made a mistake in an exam.
Yes, this argument, along with the practices of cross checking within one project, is what saves science from the total doom its software practices would otherwise deliver.
However, reproducibility is a precondition to automation, and automation is a real nice thing to have.
Yes. 110% attributed to learning about unit-tests and gems/CPAN in grad school.
IMO there is a big fallacy about the "just get it to work" approach. Most serious scientific code, i.e. supporting months-years of research, is used and modified a lot. It's also not really one-off, it's a core part of a dissertation, or research program, if it fails- you do. I'd argue that (and I found that), using unit-tests, a deployment strategy, etc. ultimately allowed me to do more, and better science because in the long run I didn't spend as much time figuring out why my code didn't run when I tweaked stuff. This is really liberating stuff. I suspect this is all obvious to those who have gone down that path.
Frankly, every reasonably tricky problem benefits from unit-tests as well for another reason. Don't know how to code it, but know the answer? Assert lots of stuff, not just one at a time red-green style. Then code, and see what happens. So powerful for scientific approaches.
The longest-running code I wrote as a scientist was a sandwich ordering system. I worked for a computer graphics group at UCSF and while taking a year off from grad school while my simulations ran on a supercomputer, and we had a weekly group meeting where everybody ordered sandwiches from a local deli.
It was 2000, so I wrote a cgi-bin in Python (2?) with a MySQL backend. The menu was stored in MySQL, as were the orders. I occasionally check back to see if it's still running, and it is- a few code changes to port to Python3, a data update since they changed vendors, and a mysql update or two as well.
An interesting concern is that there often is no single piece
of code that has produced the results of a given paper.
Often it is a mixture of different (and evolving) versions of
different scripts and programs, with manual steps in between.
Often one starts the calculation with one version of the code,
identifies edge cases where it is slow or inaccurate, develops
it further while the calculations are running, does the next
step (or re-does a previous one) with the new version, possibly
modifying intermediate results manually to fit the structure of
the new code, and so on -- the process it interactive, and not
trivially repeatable.
So the set of code one has at the end is not the code the results
were obtained with: it is just the code with the latest edge case
fixed. Is it able to reproduce the parts of the results that were
obtained before it was written? One hopes so, but given that
advanced research may take months of computer time and machines
with high memory/disk/CPU/GPU/network speed requirements only
available in a given lab -- it is not at all easy to verify.
>the process it interactive, and not trivially repeatable.
The kind of interaction you're describing should be frowned upon. It requires the audience to trust the manual data edits are no different than rerunning the analysis. But the researcher should just rerun the analysis.
Also, mixing old and new results is a common problem in manually updated papers. It can be avoided by using reproducible research tools like R Markdown.
If it can't be trivially repeated, then you should publish what you have with an explanation of how you got it. Saying that "the researcher should just rerun the analysis" is not taking into account the fact that this could be very expensive and that you can learn a lot from observations that come from messy systems. Science is about more than just perfect experiments.
No, you should publish this research and be clear with how it all worked out and someone will reproduce it in their own way.
Reproducibility isn't usually about having a button to press that magically gives you the researchers' results. It's also not always a set of perfect instructions. More often it is a documentation of what happen and what was observed as the researcher's believe is important to the understanding of the research questions. Sometimes we don't know what's important to document so we try to document as much as possible. This isn't always practical and sometimes it is obviously unnecessary.
Back in the 80s/90s I was heavily into TeX/LaTeX—I was responsible for a major FTP archive that predated CTAN, wrote ports for some of the utilities to VM/CMS and VAX/VMS and taught classes in LaTeX for the TeX Users Group. I wrote most of a book on LaTeX based on those classes that a few years back I thought I'd resurrect. Even something as stable as LaTeX has evolved enough that just getting the book to recompile with a contemporary TeX distribution was a challenge. (On the other hand, I've also found that a lot of what I knew from 20+ years ago is still valid and I'm able to still be helpful on the TeX stack exchange site).
https://github.com/mrc-ide/covid-sim/blob/e8f7864ad150f40022...
This was used by the Imperial College for COVID-19 predictions. It has race conditions, seeds the model multiple times, and therefore has totally non-deterministic results[0]. Also, this is the cleaned up repo. The original is not available[1].
A lot of my homework from over 10 years ago still runs (Some require the right Docker container: https://github.com/sumdog/assignments/). If journals really care about the reproducibility crisis, artifact reviews need to be part of the editorial process. Scientific code needs to have tests, a minimal amount of test coverage, and code/data used really need to be published and run by volunteers/editors in the same way papers are reviewed, even for non-computer science journals.
[0] https://lockdownsceptics.org/code-review-of-fergusons-model/
[1] https://github.com/mrc-ide/covid-sim/issues/179
I'm an accelerator physicist and I wouldn't want my code to end up on acceleratorskeptics.com with people that don't understand the material making low effort critiques of minor technical points. I'm here to turn out science, not production ready code.
As an example, you seem to be complaining that their Monte Carlo code has non-deterministic output when that is the entire point of Monte Carlo methods and doesn't change their result.
By the way, yes I tested my ten year old code and it does still work. What I'm saying is that scientific code doesn't need to handle every special case or be easily usable by non-experts. In fact the time spent making it that way is time that a scientist spends doing software engineering instead of science, which isn't very efficient.
Does scientific-grade code need to handle a large number of users running it at the same time? Probably not a genuine concern, since those users will run their own copies of the code on their own hardware, and it's not necessary or relevant for users to see the same networked results from the same instance of the program running on a central machine.
Does scientific-grade code need to publish telemetry? Eh, usually no. Set up alerting so that on-call engineers can be paged when (not if) it falls over? Nope.
Does scientific-grade code need to handle the authorization and authentication of users? Nope.
Does scientific-grade code need to be reproducible? Yes. Fundamentally yes. The reproducibility of results is core to the scientific method. Yes, that includes Monte Carlo code, when there is no such thing as truly random number generation on contemporary computers, only pseudorandom number generation, and what matters for cryptographic purposes is that the seed numbers for the pseudorandom generation are sufficiently hidden / unknown. For scientific purposes, the seed numbers should be published on purpose, so that a) the exact results you found, sufficiently random as they are for the purpose of your experiment, can still be independently verified by a peer reviewer, b) a peer reviewer can intentionally decide to pick a different seed value, which will lead to different results but should still lead to the same conclusion if your decision to reject / refuse to reject the null hypothesis was correct.
Scientists need to get over this fear about their code. They need to produce better code and need to actually start educating their students on how to write and produce code. For too long many in the physics community have trivialized programming and seen it as assumed knowledge.
Having open code will allow you to become better and you’ll produce better results.
Side note: 25 years ago I worked in accelerator science too.
Given that such software forms the very foundation of the results of such papers, why shouldn't it fall under scrutiny, even for "minor" points? If you are unable to produce good technical content, why are you qualified to declare what is or isn't minor? Isn't the whole point that scrutiny is best left to technical experts (and not subject experts)?
If code is what is substantiating a scientific claim, then code needs to stand up to scientific scrutiny. This is how science is done.
I came from physics, but systems and computer engineering was always an interest of mine, even before physics, I thought it was kooky-dooks that CS people can release papers w/o code, fine if the paper contains all the proofs but otherwise it shouldn't even be looked at. PoS (proof-of-science) or GTFO.
We are the point in human and scientific civilization that knowledge needs to prove itself correct. Papers should be self contained execution environments that generate PDFs and resulting datasets. The code doesn't need to be pretty, or robust, but it needs to be sealed inside of a container so that it can be re-run, re-validated and someone else can confirm the result X years from now. And it isn't about trusting or not trusting the researcher, we need to fundamentally trust the results.
Specifically, to that point, I want to cite the saying:
"The dogs bark, but the caravan passes."
(There is a more colorful German variant which is, translated: "What does it bother the mighty old oak tree if a dog takes a piss...").
Of course, if you publish your code, you expose it to critics. Some of this will be unqualified. And as we have seen in the case e.g. of climate scientists, some might be even nasty. But who cares? What matters is open discussion which is a core value of science.
It reminds me of Kerckhoffs's principle in cryptography, which states: A cryptosystem should be secure even if everything about the system, except the key, is public knowledge.
I do not think "non-experts" should be able to use your code, but I do think an expert who was not involved in writing it should be.
Code belongs with the paper. Otherwise we can just continue to make up numbers and pretend we found something significant.
In what way do idiots making idiotic comments about your correct code invalidate your scientific production? You can still turn out science and let people read and comment freely on it.
> As an example, you seem to be complaining that their Monte Carlo code has non-deterministic output when that is the entire point of Monte Carlo methods and doesn't change their result.
I guess you would not need to engage personally with the idiots at "acceleratorskeptics.com", but likely most of their critique would be easily shut off by a simple sentence such as this one. Since most of your readers would not be idiots, they could scrutinize your code and even provide that reply on your behalf. This is called the scientific method.
I agree that you produce science, not merely code. Yet, the code is part of the science and you are not really publishing anything if you hide that part. Criticizing scientific code because it is bad software engineering is like criticizing it because it uses bad typography. You should not feel attacked by that.
However, a methods section is always under-specified. Code provides the unique opportunity to actually see the full methods on display and properly review their work. It should be mandated by all reputable journals and worked into the peer review process.
But that's exactly the problem.
Are you familiar with that bug in early Civ games where an overflow was making Ghandi nuke the crap out of everyone? What if your code has a similar issue?
What if you have a random value right smack in the middle of your calculations and you just happened to be lucky when you run your code?
I'm not that familiar with Monte Carlo, my understanding is that this is just a way to sample the data. And I won't be testing your data sampling, but I will expect that given the same data to your calculations part (eg, after the sampling happens), I get exactly the same results every time I run the code and on any computer. And if there are differences I expect you to be able to explain why they don't matter, which will show you were aware of the differences in the first place and you were not just lucky.
And then there is the matter of magic values that plaster research code.
Researchers should understand that the rules for "software engineering grade code" are not there just because we want to complicate things, but because we want to make sure the code is correct and does what we expect it to do.
/edit: The real problem is not getting good results with faulty code, is ignoring good solutions because faulty code.
If the proof on which the paper is based is in the code that produced the evidence, you absolutely need to be able to let a lambda user run it without specific knowledge to abide to the reproducible principle. Asking a reviewer to fiddle about like a IT professional to get something working is bound to promote lazy reviewing and either will result into dismissing the result or approval without real review.
And by the way producing a paper could be argued it isn't really science either, but if you are working with MSFT Office, you know there is a fair amount of non science work hours that has been put into that as well.
Not so fast. Monte Carlo code turns arbitrary RNG seeds into outputs. That process can, and arguably should be, deterministic.
To do your study, you feed your Monte Carlo code 'random enough' seeds. Coming up with the seeds does not need to be deterministic. But once the seeds are fixed, the rest can be deterministic. Your paper should probably also publish the seeds used, so that people can reproduce everything. (And so they can check whether your seeds are carefully chosen, or really produce typical outcomes.)
Sure, and that rationale works OK when your code operates in a limited, specialized domain.
But if you're modeling climate change or infectious diseases, and you expect your work to affect millions of human lives and trillions of dollars in spending, then you owe us a full accounting of it.
Two nitpicks: a) it shouldn't change the conclusions, but MC calculations will get different results depending on the seed. and b) it is considered good practice in reproducible science to fix the seed so that the results of subsequent runs give exactly the same results.
Ultimately, I think there is a balance: really poor code can lead to incorrect conclusions... but you don't need production ready code for scientific exploration.
Anybody who has conducted experimental research will say they spent 80% of the time using a hammer or a spanner. Repairing faulty lasers or power supplies. This process of reliable and repeatable experimentation is the basis of science itself.
Computational experiments must be held to the same standards as physical experiments. They must be reproducible and they should be publicly available (if publicly funded).
Deleted Comment
Sounds like I should just become a scientist then.
Do you guys write unit tests or is that beneath you too?
> I'm an accelerator physicist and I wouldn't want my code to end up on acceleratorskeptics.com with people that don't understand the material making low effort critiques of minor technical points.
The person which wrote the linked blog post writes that it was a software engineer at google. Unfortunately, that claim is not falsifiable as the person decided to remain anonymous.
> As an example, you seem to be complaining that their Monte Carlo code has non-deterministic output when that is the entire point of Monte Carlo methods and doesn't change their result.
The claim is that even with the same random seed for the random generator, the program produces different results, and this is explained by the allegation that it runs non-deterministic (in the sense of undefined behavior) in multiple threads. It claims also that it produces significantly different results depending on which output file format is chosen.
If this is true, the code would have race conditions, and as being impacted by race conditions is a form of undefined behavior, this would make any result of the program questionable, as the program would not be well-defined.
Personally, I am very doubtful whether this is true, this would be incredibly sloppy by the imperial college scientists. Some more careful analysis by a recognized programmer might be warranted.
However it underlines well the importance of the main topic that scientific code should be open to analysis.
> What I'm saying is that scientific code doesn't need to handle every special case or be easily usable by non-experts.
Fully agree with this. But it should try to document its limitations.
GPT-3 FTW!
> you understand that the links in your post are the exact worry people have when it comes to releasing code: people claiming that their non-software engineering grade code invalidates the results of their study.
It's profoundly unscientific to suggest that researchers should be given the choice to withhold details of their experiments that they fear will not withstand peer review. That's much of the point of scientific publication.
Researchers who are too ashamed of their code to submit it for publication, should be denied the opportunity to publish. If that's the state of their code, their results aren't publishable. Unpublishable garbage in, unpublishable garbage out. Simple enough. Journals just shouldn't permit that kind of sloppiness. Neither should scientists be permitted to take steps to artificially make it difficult to reproduce (in some weak sense) an experiment. (Independently re-running code whose correctness is suspect, obviously isn't as good as comparing against a fully independent reimplementation, but it still counts for something.)
If a mathematician tried to publish the conclusion of a proof but refused to show the derivation, they'd be laughed out of the room. Why should we hold software-based experiments to such a pitifully low standard by comparison?
It's not as if this is a minor problem. Software bugs really can result in incorrect figures being published. In the case of C and C++ code in particular, a seemingly minor issue can result in undefined behaviour, meaning the output of the program is entirely unconstrained, with no assurance that the output will resemble what the programmer expects. This isn't just theoretical. Bizarre behaviour really can happen on modern systems, when undefined behaviour is present.
A computer scientist once told me a story of some students he was supervising. The students had built some kind of physics simulation engine. They seemed pretty confident in its correctness, but in truth it hadn't been given any kind of proper testing, it merely looked about right to them. The supervisor had a suggestion: Rotate the simulated world by 19 degrees about the Y axis, run the simulation again, and compare the results. They did so. Their program showed totally different results. Oh dear.
Needless to say, not all scientific code can so easily be shown to be incorrect. All the more reason to subject it to peer review.
> I'm an accelerator physicist and I wouldn't want my code to end up on acceleratorskeptics.com with people that don't understand the material making low effort critiques of minor technical points.
Why would you care? Science is about advancing the frontier of knowledge, not about avoiding invalid criticism from online communities of unqualified fools.
I sincerely hope vaccine researchers don't make publication decisions based on this sort of fear.
How exactly is this a bad thing?
> I'm an accelerator physicist and I wouldn't want my code to end up on acceleratorskeptics.com with people that don't understand the material making low effort critiques of minor technical points. I'm here to turn out science, not production ready code.
But it should be noted that what you didn't say is that you're here to turn out accurate science.
This is the software version of statistics. Imagine if someone took a random sampling of people at a Trump rally and then claimed that "98% of Americans are voting for Trump". And now imagine someone else points out that the sample is biased and therefore the conclusion is flawed, and the response was "Hey, I'm just here to do statistics".
---
Do you see the problem now? The poster above you pointed out that the conclusions of the software can't be trusted, not that the coding style was ugly. Most developers would be more than willing to say "the code is ugly, but it's accurate". What we don't want is to hear "the conclusions can't be trusted and 100 people have spent 10+ years working from those unreliable conclusions".
If you want to help me (and others who agree with me), please sign this petition: https://publiccode.eu. It demands that all publicly funded code must be public.
P.S. Yes, my 10-year-old code is working.
Well ... that part isn't nonsense, though I agree it shouldn't be a dealbreaker. And it means we should work towards making such support demands minimal or non-existent via easy containerization.
I note with frustration that even the Docker people, whose entire job is containerization, can get this part wrong. I remember when we containerized our startup's app c. 2015, to the point that you should be able to run it locally just by installing docker and running `docker-compose up`, and it still stopped working within a few weeks (which we found when onboarding new employees), which required a knowledgeable person to debug and re-write.
(They changed the spec for docker-compose so that the new version you'd get when downloading Docker would interpret the yaml to mean something else.)
This hits close to home. Back in college, I developed software, for a lab, for a project-based class. I put the code up on GitHub under the GPL license (some code I used was licensed under GPL as well), and when the people from the lab found out, they lost their minds. A while later, they submitted a paper and the journal ended up demanding the code they used for analysis. Their solution? They copied and pasted pieces of my project they used for that paper and submitted it as their own work. Of course, they also completely ignored the license.
You are blaming scientists but speaking from my personal experience as a computational scientist, this exists because there are few structures in place that incentivize strong programming practices.
* Funding agencies do not provide support for verification and validation of scientific software (typically)
* Few journals require assess code reproducibility and few require public code (few require even public data)
* There are few funded studies to reproduce major existing studies
Until these structural challenges are addressed, scientists will not have sufficient incentive to change their behavior.
> Scientific code needs to have tests, a minimal amount of test coverage, and code/data used really need to be published and run by volunteers/editors in the same way papers are reviewed, even for non-computer science journals.
I completely agree.
In my opinion, enforcing standards without addressing this root cause is not gonna fix the problem. Worse, students and early career researchers will bear the brunt of increased workload and code compliance requirements from journals. Big, well-funded labs that can afford a research engineer position is gonna have an edge over small labs that cannot do so.
After a paper has been accepted, authors can submit a repository containing a script which automatically replicates results shown in the paper. After a reviewer confirms that the results were indeed replicable, the paper gets a small badge next to its title.
While there could certainly be improvements, I think it's a step in the right direction.
All is well and good then, because journals absolutely don't care about science. They care about money and prestige. From personal experience, I'd say this intersects with the interests of most high-ranking academics. So the only unhappy people are idealistic youngsters and science "users".
Let's get back to non-profit journals.
In terms of tangible results, Princeton at least has created a dedicated team of software engineers as part of their research computing unit (https://researchcomputing.princeton.edu/software-engineering).
Realistically though even if the necessity of research software engineering were acknowledged at the institutional level at the bulk of universities, there would still be the problem of universities paying way below market rate for software engineering talent...
To some degree, universities alone cannot effect the change needed to establish a professional class of software engineers that collaborate with researchers. Funding agencies such as the NIH and NSF are also responsible, and need to lead in this regard.
https://mobile.twitter.com/id_aa_carmack/status/125819213475...
Why is there so much c++ code?
Skeptics could have a field day tearing apart the estimates for the large number of input parameters to models like that, but they choose not to? I don't get it.
Many years ago, a paper on the PageRank algorithm was written, and the code behind that paper was monetized to unprecedented levels. Should computer science journals also require working proof of concept code, even if that discourages companies from sharing their results; even if it prevents students from monetizing the fruits of their research?
Scientists, not programmers, should be the ones spear-heading the development of standards and rules of thumb.
Still, there are real problematic practices that an emphasis on sharing scientific code would discourage. One classic one is the use of a single script that you edit each time you want to re-parameterize a model. Unless you copy the script into the output, you lose the informational channel between your code and its output. This can have real consequences. Several years ago I started up a project with a collaborator to follow up on their unpublished results from a year prior. Our first task was to take that data and reproduce the results they obtained before, because the person no longer had access to the exact copy of the script that they ran. We eventually determined that the original result was due to a software error (which we eventually identified). My colleague took it well, but the motivation to continue the project was much diminished.
Why isn't there a common language that all other languages compile to, and that will be supported on all possible platforms, for the rest of time?
(Perhaps WASM could be such a language, but the point is that this would be just coincidental and not a planned effort to conservate software)
And why aren't package managers structured such that packages will live forever (e.g. in IPFS) regardless of whether the package management system is online? Why is Github still a single point of failure in many cases?
After they embark on solving problems, it does become an eyeopening experience, and one that becomes now about keeping things running.
For those who have a STEM discipline in addition to a software development background >5Y, would you agree with seeing the above?
I would have thought the scientists among us would approach someone with familiarity with software development expertise. (something abstract and requiring a different set of muscles)
One positive emerging is the variety of low/no-code tooling that can replace a lot of this hornets nest coding.
In any case you'd need to teach them the problem domain, and it's considered cheaper (and simpler from organizational perspective) to get some phd students or postdocs from your domain to spend half a year getting up to speed on coding (and they likely had a few courses in programming and statistics anyway) than to hire an experienced software developer and have them learn the basics of your domain (which may well take a third or half of the appropriate undergraduate bachelor's program).
Is there a pool of skilled software architects willing to provide consultations at well-below market wages? Or a Q&A forum full of people interested in giving this kind of advice? (StackOverflow isn't useful for this; the allowed question scope is too narrow.) I guess one incentive to publish one's code is to get it criticized on places like Hacker News. The best way to get the right answer on the internet is to post the wrong answer, after all.
However, when working as equals scientists and engineers can create truly transformative projects. Algorithms accounts for 10% of the solution. The code, infrastructure and system design accounts for 20% of the final result. The remaining 70% of the value, is directly coming from its impact. A projects that nobody uses is a failure. Something that perfectly solves a problem that nobody cares about is useless.
>
> [0] https://lockdownsceptics.org/code-review-of-fergusons-model/
This does not looks like a good example at all, as it appears that the blog author there just tries to discredit the program because he does not like the results. He also writes that all epidemiological research should be defunded.
If someone is trying to reproduce someone else's results, the data and methods are the only ingredients they need. If you add code into this mix, all you do is introduce new sources of bias.
(Ideally the results would be blinded too.)
Show me the grant announcements that identify reproducible long term code as a key deliverable, and I’ll show you 19 out of 20 scientists who start worrying about it.
Watching the density functional theory based molecular dynamics zip along at ~2 seconds per time step on my 2 year old laptop, versus the roughly 6k seconds per time step on an old Sun machine back in 1991. I remember the same code getting down to 60 seconds per time step on my desktop R8k machine in the late 90s.
Whats been really awesome about that has been the fact that I've written some binary data files on big endian machines in the early 90s, and re-read them on the laptop (little endian) adding a single compiler switch.
Perl code that worked with big XML file input in the mid 2000s continues to work, though I've largely abandoned using XML for data interchange.
C code I wrote in the mid 90s compiled, albeit with errors that needed to be corrected. C++ code was less forgiving.
Over the past 4 months, I had to forward port a code from Boost 1.41 to Boost 1.65. Enough changes over 9 years (code was from 2011) that it presented a problem. So I had to follow the changes in the API and fix it.
I am quite thankful I've avoided the various fads in platforms and languages over the years. Keep inputs in simple textual format that can be trivially parsed.
I want to second the idea of just dumping your floating point data as binary. It's basically the CSV of HPC data. It doesn't require any libraries, which could break or change, and even if the endianness changes you can still read it decades later. I've been writing a computational fluid dynamics code recently and decided to only write binary output for those reasons. I'm not convinced of the long-term stability of other formats. I've seen colleagues struggle to read data in proprietary formats even a few years after creating it. Binary is just simple and avoids all of that. Anybody can read it if needed.
My C++ Qt GUI application for NMR spectrum analysis (https://github.com/rochus-keller/CARA) runs since 20 years now with continuing high download and citation rates.
So obviously C++/Qt or Fortran 77 are very well suited to outlast time.
Similar to other comments I don't mean to fault scientists for that - their job is not coding and some of the dependencies come from earlier papers or proprietary cluster setups and are therefore hard to avoid - but the situation is not good.
To me, that's like a theoretical physicist saying "My job is not to do mathematics" when asked for a derivation of a formula he put in the paper.
Or an experimental physicist saying "My job is not mechanical engineering" when asked for details of their lab equipment (almost all of which is typically custom built for the experiment).
IMO the incentive problem in science (basically number of papers and new results is what counts) also plays into this, as investing tons of time in your code gives you hardly any reward.
Theoretical Physicists (literal conversation I had):
>Yeah, this looked like it simplifies to 1-ish and Smart John said it's probably right.
Experimental physicists (another literal conversation):
>Yeah, we build it with duck-tape and there's hot glue holding the important bits that kept falling off. Don't put anything metal in that, we use it as a tea heater, but there's 1000A running through it so it's shoots spoons out when we turn the main machine on.
But not with the current mess of software frameworks. If I am to produce reproducible scientific code, I need an idiot-proof method of doing it. Yes, I can put in the 50-100 hours to learn how to do it [1], but guess what, in about 3-5 years a lot of that knowledge will be outdated. People comparing it with math, but the math proofs I produce will still be readable and understandable a century from now.
Regularly used scientific computing frameworks like matlab/R/Python ecosystem/mathematica need a dumb guided method of producing releasable and reproducable code. I want to go through a bunch of next buttons, that help me fix the problems you indicate, and finally release a final version that has all the information necessary for someone else to reproduce the results.
[1] I have. I would put myself in the 90th percentile of physicists familiar with best practices for coding. I speak for the 50% percentile.
(1) Use a package manager, which stores hashsums in a lock file. (2) Install your dependencies from a lock file as spec. (3) Do not trust version numbers. Trust hash sums. Do not believe in "But I set the version number!". (4) Do not rely on downloads Again, trust hash sums, not URLs. (5) Hashsums!!! (6) Wherever there is randomness as in random number generators, use a seed. If the interface does not allow to specify the seed, thtow the trash away and use another generator. Careful when concurrency is involved. It might destroy reproducibility. For example this was the case with Tensorflow. Not sure it still is. (7) Use a version control system.
Of course, I have no idea about the paper you're talking about and just want to say that reproducibility isn't dependent on releasing code. There could even be a case were it's better if someone reproduces a result without having been biased by someone else's code.
(Of course, not all scientific code is discardable, large quantities of reusable code is reused every day; we have many frameworks, and the code quality of those is completely different).
But it often is. For most non-CS papers (mostly biosciences) I've read, there are specific authors whose contribution to a large degree was mainly "coding".
The idea is that you have learned something about how the universe works. Which means that the details of your experiment should not change what you find... assuming it's a true finding.
Concerns about software quality in science are primarily about avoiding experimental error at the time of publication, not the durability of the results. If you did the experiment correctly, it doesn't matter if your code can run 10 years later. Someone else can run their own experiment, write their own code, and find the same thing you did.
And if you did the experiment incorrectly, it also doesn't matter if you can run your code 10 years later; running wrong code a decade later does not tell you what the right answer is. Again--conducting new research to explore the same phenomenon would be better.
When it comes to hardware, we get this. Could you pick up a PCR machine that's been sitting in a basement for 10 years and get it running to confirm a finding from a decade ago? The real question is, why would you bother? There are plenty of new PCR machines available today, that work even better.
And it's the same for custom hardware. We use all sorts of different telescopes to look at Jupiter. Unless the telescope is broken, it looks the same in all of them. Software is also a tool for scientific observation and experimentation. Like a telescope, the thing that really matters is whether it gives a clear view of nature at the time we look through it.
One of the unsung and wonderful properties of reproducible workflows is the fact that it can allow science to be salvaged from an analysis that contains an error. If I had made an error in my thesis data analysis (and I did, pre-graduation), the error can be corrected and the analysis re-run. This works even if the authors are dead (which I am not :) ).
Reproducibility abstracts the analysis from data in a rigorous (and hopefully in the future, sustainable) fashion.
That is something no one outside of highschool cares about. The idea that you can show work in general is ridiculous. Do I need to write a few hundred pages of set theory to start using addition in a physics paper? No. The work you need to show is the work a specialist in the field would find new, which is completely different to what a layman would find new.
Every large lab, the ones that can actually reproduce results, has decades of specialist code that does not interface with anything outside the lab. Providing the source code is then as useful as giving a binary print out of an executable for an OS you've never seen before.
It can tell, however, exactly where the error lies (if the error is in software at all). Like a math teacher that can circle where the student made a mistake in an exam.
However, reproducibility is a precondition to automation, and automation is a real nice thing to have.
IMO there is a big fallacy about the "just get it to work" approach. Most serious scientific code, i.e. supporting months-years of research, is used and modified a lot. It's also not really one-off, it's a core part of a dissertation, or research program, if it fails- you do. I'd argue that (and I found that), using unit-tests, a deployment strategy, etc. ultimately allowed me to do more, and better science because in the long run I didn't spend as much time figuring out why my code didn't run when I tweaked stuff. This is really liberating stuff. I suspect this is all obvious to those who have gone down that path.
Frankly, every reasonably tricky problem benefits from unit-tests as well for another reason. Don't know how to code it, but know the answer? Assert lots of stuff, not just one at a time red-green style. Then code, and see what happens. So powerful for scientific approaches.
https://smw.ch/article/doi/smw.2020.20336
It was 2000, so I wrote a cgi-bin in Python (2?) with a MySQL backend. The menu was stored in MySQL, as were the orders. I occasionally check back to see if it's still running, and it is- a few code changes to port to Python3, a data update since they changed vendors, and a mysql update or two as well.
It's not much but at least it was honest work.
Often it is a mixture of different (and evolving) versions of different scripts and programs, with manual steps in between. Often one starts the calculation with one version of the code, identifies edge cases where it is slow or inaccurate, develops it further while the calculations are running, does the next step (or re-does a previous one) with the new version, possibly modifying intermediate results manually to fit the structure of the new code, and so on -- the process it interactive, and not trivially repeatable.
So the set of code one has at the end is not the code the results were obtained with: it is just the code with the latest edge case fixed. Is it able to reproduce the parts of the results that were obtained before it was written? One hopes so, but given that advanced research may take months of computer time and machines with high memory/disk/CPU/GPU/network speed requirements only available in a given lab -- it is not at all easy to verify.
The kind of interaction you're describing should be frowned upon. It requires the audience to trust the manual data edits are no different than rerunning the analysis. But the researcher should just rerun the analysis.
Also, mixing old and new results is a common problem in manually updated papers. It can be avoided by using reproducible research tools like R Markdown.
Reproducibility isn't usually about having a button to press that magically gives you the researchers' results. It's also not always a set of perfect instructions. More often it is a documentation of what happen and what was observed as the researcher's believe is important to the understanding of the research questions. Sometimes we don't know what's important to document so we try to document as much as possible. This isn't always practical and sometimes it is obviously unnecessary.