Readit News logoReadit News
tgamblin · 10 years ago
Another place this use case comes up is supercomputing. Not because of the unusual levels of radioactivity, but because of unusual numbers of processors. When you have more components, a single bit flip somewhere is increasingly likely. Resilience has been a research area in HPC for a while now, and people have looked at fault tolerant algorithms, redundancy schemes, faster checkpointing, and other ways to make sure your HPC application running on a million cores won't die because one core gets a bit flip.

So far this has luckily remained a research area in practice, because the vendors tend to do a good job of hardening their machines against errors as they get larger (most of the gloom and doom predictions take the error rates of current hardware and extrapolate). It will be interesting to see if it remains that way.

Relevant article: http://superfri.org/superfri/article/view/14

ams6110 · 10 years ago
Same reason RAID-5 doesn't work on for large disks. If you have a dozen 3TB drives in an array and one dies, the probability that you'll get a read error on another drive during the rebuild is quite high.

Given enough bits, random errors that individually are almost impossible become almost a certainty.

brokenmachine · 10 years ago
What level of RAID would you use for an array involving a dozen 3Tb drives?

If the drives can't be trusted, then you'd need some kind of error correction in software, I'd presume. I thought the drives should auto-correct any read errors.

I'm not talking about in a radioactive environment, just normal usage.

benjaminRRR · 10 years ago
IEEE spectrum had a nice article very recently http://spectrum.ieee.org/computing/hardware/how-to-kill-a-su...
exabrial · 10 years ago
I remember learning in my embedded programming class about state machines and how certain states cannot be reached.

In the lab, I accounted for every possible valid state with a transition out to another state.

I left my microcontroller running when I went to get lunch, came back and it was frozen. I did a dump, and discovered it was idling in a state that has no entry into it, and because I had no exit, it would just sit there.

I 'fixed' the problem by adding a transition to reset the controller from every state that shouldn't be reached.

peterwwillis · 10 years ago
This is considered a minor security feature. When a program unexpectedly finds itself in a routine, or a routine ends abruptly, it's possible it was a code jump, or the after-effects of a payload... Neither of which are good things.
TazeTSchnitzel · 10 years ago
If you use a switch(), have a default:, even if it's seemingly impossible.
StringyBob · 10 years ago
Yes - in hardware where faults are expected to be rare it's often a good idea to just detect that something went wrong. If you know something bad happened you can choose to re-run. (E.g. restart using a watchdog timer)

Silently jumping back to a default state is sometimes useful, but could result in unpredictability or perhaps silent data corruption for some critical applications.

Normal_gaussian · 10 years ago
So the use case here is particularly interesting.

Most of the answers are talking about spaceflight, because it is one of the only radioactive environments that has a particle distribution that can be effectively fought.

Most earthbound environments have particle distributions that are practically unsolvable in software, and should instead be solved by using mechanical and noise tolerant analogue methods of sensing and manipulating and then massively shielding your computation (normally by controlling from a different building).

*Note. By particle distribution I am refferring to both rate and charge

phkahler · 10 years ago
>> Most of the answers are talking about spaceflight...

Another place to look is in safety critical systems. IEC 61508 in general and the automotive variant xxx26262. There are several dual core processors available now that run two copies of the code and check for any errors. That doesn't tell you which one failed, but it does catch the error. There are methods defined for creating fault tolerant systems too.

We've been doing this shit for years - your electric steering system, brakes, even throttle control (with certain exceptions of course ;-)

The funny thing in cars is that most of those systems have mechanical backups or consider "shut down" an undesirable but reasonable failure (vs steering left or dumping brake fluid). Your car can coast to the side of the road under driver control without any power. All this self driving stuff will require fully redundant components for some of these systems (including LIDAR or cameras) to really be safe.

explanibrag · 10 years ago
I once read that NASA control engineers have three independent teams code up three versions of their guidance systems. If the systems disagree, they go with the majority vote.
dkopi · 10 years ago
These type of posts are what bring me to hacker-news.
iliis · 10 years ago
Another interesting case apart from spaceflight where radiation hardness is important: Particle accelerators.

CERN for example has a lot of material, a quick search leads to e.g. https://lhcb-elec.web.cern.ch/lhcb-elec/html/rad_hard_links.... or the Radiation to Electronics workgroup https://r2e.web.cern.ch/R2E/

derekp7 · 10 years ago
I'm reminded of an old game called Core War, where the object is to write code that corrupts your opponent's code while simultaneously protecting your own code. I wonder if people who are good at that game would be the best candidates for this type of fault tolerant programming?
dkopi · 10 years ago
Israel has had its own version called "codeguru extreme" that's been going on for the last 11 years. The rules are significantly different, but the idea and inspiration is from core wars. (Sorry, hebrew only: http://www.codeguru.co.il/xtreme/about.htm)

It's sponsored by IBM, A large Israeli college, A technological high school chain - And Several Israeli Technological Military Units (Air force, Communications, Cyber). So yeah, that's probably a really good indicator you might be on to something.

jacobsladder · 10 years ago
Doing any modification to the source code to prevent this is not a good design. C/C++ is not designed for this use case, so it's going to create a mess if one will try to work around this. This might lead to similar problems that premature optimization does to the code. Instead the solution should be on hardware level, like in airplanes. Put three computers in radiation environment. Then put a forth computer that'd analyze the computing results from the three computers and do action only when there 3 out of 3 or 2 out of 3 computers agree on the result. Or average / median can be applied, depending on the task. Ideally the fourth computer needs to be put in non-radioactive environment. If not possible, that still simplifies the problem somewhat. Because only the final forth computer who collects the data and selects the trustworthiness can contain bugs, not the rest of the code.
mturmon · 10 years ago
Not so fast: it's possible to have library-level fault tolerance, or OS-level fault tolerance, that is implemented in software, without having to change the source code much or at all.

My perspective on this is informed by work on ABFT (for more on that, see https://www.computer.org/csdl/trans/tc/1984/06/01676475.pdf, and the literally 1000 later papers citing it) -- you can design a version of the basic linear algebra subroutines that have fault-tolerance built in, and use them without changing program source code.

For codes that spend a lot of time doing numerical computations (e.g., preliminary data reduction in a spacecraft on-board computer), ABFT is an interesting option.

jacobsladder · 10 years ago
To clarify, each piece of the puzzle must do what it is expected to do, otherwise it will be hacky. It's not expected for C++ program to detect its underlying hardware problems, there are no tools, nothing for that. So any solution would then look like and feel like a hack, temporary workaround. It's however expected for the computer as a whole to produce buggy results, it's normal and there are many existing design solutions that can take that into account and work around that. So that's why you take three computers and judge their output. Then each piece of the puzzle does it what it is intended to do, it does what it is expected to do and it doesn't do stuff outside of its responsibility.
jacobsladder · 10 years ago
Sorry it looks like a flooding, but to add to that, having program self-check itself is also unnecessarily ties the program to its usecase in the radiation environment, it creates unnecessary strong dependency to its execution environment, and that just goes against all the good principles of design. It's like having a software for clock in the microwave machine be aware that it is placed in the microwave machine.
joelthelion · 10 years ago
If you want to address the problem at the software level, you could probably write a compiler for this. But that would be a massive undertaking...
zzzcpan · 10 years ago
Maybe not as massive as it seems. You could compile your code into llvm IR, work on that and compile to native code with llvm.
beagle3 · 10 years ago
As far as I can see, no one mentioned fluidic[0] computation so far. I am not aware of a fluidic processor that can run C++ compiled code, but .. if you want hardware that can withstand 1000 degree celsius, there's pretty much no alternative that I'm aware of.

[0] https://en.wikipedia.org/wiki/Fluidics