Ooof. The core collections in Java are well understood to not be thread-safe by design, and this should have been noticed.
OP should go through the rest of the code and see if there are other places where collections are potentially operated by multiple threads.
>The easiest way to fix this was to wrap the TreeMap with Collections.synchronizedMap or switch to ConcurrentHashMap and sort on demand.
That will make the individual map operations thread-safe, but given that nobody thought of concurrency, are you sure series of operations are thread-safe? That is, are you sure the object that owns the tree-map is thread-safe. I wouldn't bet on it.
>Controversial Fix: Track visited nodes
Don't do that! The collection will still not be thread-safe and you'll just fail in some other more subtle way .. either now, or in the future (if/when the implementation changes in the next Java release).
>Sometimes, a detail oriented developer will notice the combination of threads and TreeMap, or even suggest to not use a TreeMap if ordered elements are not needed. Unfortunately, that didn’t happen in this case.
That's not a good take-away! OP, the problem is you violating the contract of the collection, which is clear that it isn't thread-safe. The problem ISN'T the side-effect of what happens when you violate the contract. If you change TreeMap to HashMap, it's still wrong (!!!), even though you may not get the side-effect of a high CPU utilization.
---------
When working with code that is operated on by multiple threads, the only surefire strategy I found was to to make every possible object immutable and limiting any object that could not be made immutable to small, self-contained and highly controlled sections. We rewrote one of the core modules following these principles, and it went from being a constant source of issues, to one of the most resilient sections of our codebase. Having these guidelines in place, also made code reviews much easier.
> That will make the individual map operations thread-safe, but given that nobody thought of concurrency, are you sure series of operations are thread-safe? That is, are you sure the object that owns the tree-map is thread-safe. I wouldn't bet on it.
This is a very important point! And, in my experience, a common mistake by programmers who aren't great at concurrency.
Lets say you have a concurrent dynamic array. The array class is designed to be thread-safe for the explicit purpose that you want to share it between threads. You want to access element 10, but you want to be sure that you're not out of bounds. So you do this:
if (array.size() > 10) {
array.element(10).doSomething();
}
It doesn't matter how thread-safe this array class is: these two operations combined are NOT thread-safe, because thread-safety of the class only means that `.size()` and `.element()` are on their own not going to cause any races. But it's entirely possible another thread removes elements in between you checking the size and accessing the element, at which point you'll (at best) get an out-of-bounds crash.
The way to fix it is to either use atomic methods on the class which does both (something like `.element_or_null()` or whatever), or to not bother with a concurrent dynamic array at all and instead just use regular one you guard with a mutex (so the mutex is held during both operations and whatever other operations other threads perform on the array).
A similar pattern (and overlooked very often) is with DB updates where you want to increment a number by 1, but split it into 2 roundtrips to the DB to get the existing value and set it to `value + 1`.
Concurrent transactions can screw it up, and you wouldn't notice until an escalation happens.
This kind of stuff is exactly why the original "thread-safe" collections in Java were deprecated in favor of the current one: locking on every operation means that a lot of trivial code "just works", but then things break without warning once you need consistency between consequent operations. And, at the same time, everybody is paying the locking penalty, even if they don't actually do concurrent access, or if they do their own locking.
Ironically, .NET repeated the same mistake with the same outcome - it's just less obvious because for .NET the switch from synchronized-by-default to unsync-by-default happened when generics were introduced, so type safety of the new collections was the part that devs generally paid the most attention to.
This is basically what I thought: "Wait a second, is that data structure even thread-safe??" and then "Seems like you are just using the wrong data structure for your purpose.". Shoehorning with mutexes will usually lead to problems elsewhere and bottlenecks.
What you suggest to make objects immutable: Yes, and then one actually uses different data structures, which are made for being used as immutable data structures, making use of structural sharing, to avoid or limit an explosion in memory usage. So again its a case of "use the right data structures for the job". Using purely functional data structures makes one side of the problem simply disappear. The other side might not even come up in many cases: What if the threads depend on other versions of the data structure, which the other threads create? (need intermediate results of other threads). Then one has a real headache coming ones way, and probably needs again other data structures.
One additional idea I had, if one really wants to shoehorn it with the already used mutable data structure is, that one could write something that serializes the access attempts before they hit the data structure. That could also include grouping accesses into transactions and only running complete transactions. Then it makes me think whether it is close to implementing a database ...
So your solution to the the insanity that is the threading model was to reinvent the process model in it...
I don't intend to be mean, I just think the idea of independent execution in a shared memory environment is fundamentally flawed. And the effort in trying to get such follies working should have been put into better process models.
>So your solution to the the insanity that is the threading model was to reinvent the process model in it...
Where did you get that?
Immutability does not mean replicating the 'process model'. It means removing (limiting) an entire class of bugs from your multithreaded code. There will still be shared areas - and those should be controlled.
public void someFunction(SomeType relatedObject,
List<SomeOtherType> unrelatedObjects) {
...
treeMap.put(relatedObject.a(), relatedObject.b());
...
// unrelatedObjects is used later on in the function so the
// parameter cannot be removed
}
That’s not true. The original code only does the treeMap.put if unrelatedObjects is nonempty. That may or may not be a bug.
You also would have to check that a and b return the same value every time, and that treeMap behaves like a map. If, for example, it logs updates, you’d have to check that changing that to log only once is acceptable.
Whether it occurs or not can depend on the specific data being processed, and the order in which it is being processed. So this can happen in production after seemingly working fine for a long time.
I recently wrote a blog post which (while not the main focus) includes a discussion on randomized testing of sort algorithms in the face of inconsistent comparators: https://sunshowers.io/posts/monads-through-pbt/ section 2.
I have seen a lot of incorrect Comparators and Comparable implementations in existing code, but haven’t personally come across the infinite-loop situation yet.
To give one example, a common error is to compare two int values via subtraction, which can give incorrect results due to overflow and modulo semantics, instead of using Integer::compare (or some equivalent of its implementation).
> I always thought of race conditions as corrupting the data or deadlocking. I never though it could cause performance issues. But it makes sense, you could corrupt the data in a way that creates an infinite loop.
Food for thought. I often think to myself that any error or strange behavior or even warnings in a project should be fixed as a matter of principle, as they could cause seemingly unrelated problems. Rarely is this accepted by whoever chooses what we should work on.
It's a decent rule of thumb, but it definitely needs some pragmatism. Squashing any error, strangeness and warning can be very expensive in some projects, much more than paying the occasional seemingly-unrelated problem.
But of course it's quasi-impossible to know in advance the likelihood of a given error participating in a future problem, and whether it's cheaper to fix this error ahead or let the problem happen. So it becomes an art more than a science to decide what to focus on.
"fix nothing" is certainly a horrible approach, "fix everything" is often impractical. So you either need some sort of decision framework, or a combination of decent "instinct" (experience, really) and trust from your stakeholder (which comes from many places, including good communication and track record of being pragmatic over dogmatic)
> Squashing any error, strangeness and warning can be very expensive in some projects
Strongly disagreed. Strange, unexpected behaviour of code is a warning sign that you have fallen short in defensive programming and you no longer have a mental model of your code that corresponds with reality. That is a very dangerous to be in. Very quickly possible to be stuck in quicksand not too far afterwards.
I remember a project to get us down from like 1500 build warnings to sub 100. It took a long time, generated plenty of bikeshedding, and was impossible to demonstrate value.
I, personally, was mostly just pissed we didn't get it to zero. Unsurprisingly the number has climbed back up since
Fixing everything is impractical, but I'd say a safer rule of thumb would be to at least understand small strangenesses/errors. In the case of things that are hard to fix - e.g. design/architectural decisions that lead to certain performance issues or what have you - it's still usually not too time consuming to get a basic understanding of why something is happening.
Still better to quash small bugs and errors where possible, but at least if you know why they happen, you can prevent unforeseen issues.
At my job we treat all warnings as errors and you can't merge your pull requests unless all automatically triggered CI pipelines pass. It requires discipline, but once you get it into that state it's a lot easier to keep it there.
It then creates immense value by avoiding a lot of risk and uncertainty for little effort.
Getting from "thousands of warnings" to zero isn't a good ROI in many cases, certainly not on a shortish term. But staying at zero is nearly free.
This is even more so with those "fifteen flickering tests" these 23 tests that have been failing and ignored or skipped for years.
It's also why I commonly set up a CI, testing systems, linters, continuous deployment before anything else. I'll most often have an entire CI and guidelines and build automation to deploy something that will only say "hello world".
Because it's much easier to keep it green, clean and automated than to move there later on
At my job we treat all warnings as errors and you can't merge your pull requests unless all automatically triggered CI pipelines pass. It requires discipline, but once you get it into that state it's a lot easier to keep it there.
Sounds like what we used to call "professionalism." That was before "move fast, break things and blame the user" became the norm.
Pretty sure what you actually mean is that you treat some warnings as errors, and disable the others. I would find it hard to believe you're building with literally all the compiler warnings enabled all the time.
The other thing is don't catch and ignore exceptions. Even "catch and log" is a bad idea unless you specifically know that program execution is safe to continue. Just let the exception propagate up to where something useful can be done, like return 500 or displaying an error dialog.
I agree, and I think the OP does as well. Really, the Executor framework used here was to blame by not setting the uncaught exception handler to catch and propagate those exceptions by default.
But having shared, mutable state without locks or other protection should never have passed code review in the first place. I would look at that commit to try to determine how it happened, then try to change the team's processes to prevent that.
Yes, but…I suppose you have to pick your battles. There was recently a problem vexing me about a Rails project I maintain where the logs were filled with complaints about “unsupported parameters”, even though we painstakingly went through all the controllers and allowed them. It’s /probably/ benign, but it adds a lot of noise to the logs. Several of us took a stab at resolving it, but in the end we always had much higher priorities to address. Also it’s hard to justify spending so many hours on something that has little “business value”, especially when
there is currently no functional impact.
It’s a nuisance issue sorta like hemorrhoids. Do you get the surgery and suffer weeks of extreme pain, or do you just deal with it? Hemorrhoids are mostly benign, but certainly have the potential to become more severe and very problematic. Maybe this phenomenon should be called digital hemorroids?
As someone with pretty bad hemorrhoids, I’m hesitant to ask my doctor about surgery because I’ve been told the hemorrhoids will come back, without question. So it’s even still just a temporary fix…
> > race conditions […] I never though it could cause performance issues. But it makes sense, you could corrupt the data in a way that creates an infinite loop.
Even without corruption a race condition can cause significant performance issues by causing the same work to be done many times with only one result being kept.
> Food for thought. I often think to myself that any error or strange behavior or even warnings in a project should be fixed as a matter of principle
For warnings: at least explained in a comment, where it has been decided irrelevant (preferably with a pragma to turn off the warning as locally as possible).
Strange behaviour I prefer to get rid of. I've found code marked (by me at least once!) "not sure why this works, but" which very much no longer works, and had to rewrite in a rush where if it had been addressed earlier there would have been more time to be careful.
> Rarely is this accepted by whoever chooses what we should work on.
You need to find more disciplined teams.
There are still people out there who care about correctness and understand how to achieve it without it being an expensive distraction. It a team culture factor that mostly just involves staying on top of these concerns as soon as they're encountered so there's not some insurmountable and inscrutable backlog that makes it feel daunting and hopeless or that makes prioritization difficult.
Most teams are less disciplined than they should be. Also, job/team mobility is very low right now. So the question becomes, how do you increase discipline on the team you're on?
> Food for thought. I often think to myself that any error or strange behavior or even warnings in a project should be fixed as a matter of principle, as they could cause seemingly unrelated problems. Rarely is this accepted by whoever chooses what we should work on.
I agree. I hate lingering warnings. Unfortunately at the at time of this bug I did not have static analysis tools to detect these code smells.
> Rarely is this accepted by whoever chooses what we should work on.
I get that YMMV based on the org, but I find that more often than not, it’s expected that you are properly maintaining the software that you build, including the things like fixing warnings as you notice them. I can already feel the pushback coming, which is “no but really, we are NOT allowed to do ANYTHING but feature work!!!!” and… okay, I’m sorry, that really sucks… but maybe, MAYBE, some of that is you being defeatist, which is only adding to this culture problem. Oh well, now’s not the time to get fired for something silly like going against the status quo, so… I get it either way.
And from a security perspective, the "might cause a problem 0.000001% of the time" flaws can often be manipulated into becoming a problem 100% of the time.
>Rarely is this accepted by whoever chooses what we should work on.
This is the kind of thing you should probably just do, not seek buy-in for from whoever chooses what you should work on. Unless it is going to take some extreme amount of time.
The mention of "could barely ssh in" reminds me of a situation in grad school where our group had a Sun UltraSparc 170 (IIRC) with 1GB HD and 128 or 256 MB of RAM, shared by maybe 8 people in a small research group relating to parallel and distributed processing. Keep in mind, Sun machines were rarely rebooted, ever.
So I guess the new user / student was trying to do things in parallel to speed things up when they chopped up their large text file into N (32 or 64) sections based on line number (not separate files), and then ran N copies of perl in parallel, each processing its own set of lines from that one file.
Not only did you have a large amount (for back then) of RAM used by N copies of the perl interpreter (separate processes, not threads, mind you!) processing its data, as well, any attempt to swap was interleaved with frantic seeking to a different section of the same file to read a few more lines for one of N processes stalled on IO. Also, probably the Jth process had to read J/N of the entire file to get to its section. So the first section of the file was read N times, the next N-1, then N-2, etc.
We (me and the fellow-student-trusted-as-sysadmin who had the root password) couldn't even get a login prompt on the console. Luckily, I had a logged-in session (ssh from an actual x-terminal - a large-screen "dumb" terminal), and "su" prompted for a password after 20-30 minutes of running it. After another 5-10 minutes, we had a root session and were able to run top and figure out what was going on. Killing the offending processes (after trying to contact the user) restored the system back to normal.
Edit: forgot to say: had the right idea, but totally didn't understand the system's limitations. It was SEVERELY I/O limited with that hard drive and relatively low RAM, so just processing the data linearly would have easily been the best approach unless the amount of data to be kept would have gotten too large.
Yep, ran into this way too many times. Performing concurrent operations on non thread-safe objects in java or generally in any language produces the most interesting bugs in the world.
Which is why you manage atomic access to non-thread-safe objects yourself, or use a thread-safe version of them when using them across threads.
Multithreading errors are the worst to debug. In this case it's dead simple to identify at design time and warning flags should have gone up as soon as he started thinking about using any of the normal containers in a multithreaded environment.
Every time I think I'm sorta getting somewhere in my understanding of how to write code I see a comment like this that reminds me that the rabbithole is functionally infinite in both breadth and depth.
There's simply no straightforward default approach that won't have you running into and thinking through the most esoteric sounding problems. I guess that's half the fun!
The usual issue is code evolution over time, not the initial version which tends to be okay. You really want to have tooling strictly enforce invariants, and do so in a way that fails closed rather than open.
I ran into my share of concurrency bugs, but one thing I could never intentionally trigger was any kind of inconsistency stemming from removing a "volatile" modifier from a mutable field in Java. Maybe the JVM I tried this with was just too awesome.
I've universally found that even when I am convinced that I am OK with the consequences of sharing something that isn't synchronized, the actual outcome is something I wasn't expecting.
The only things that should be shared without synchronization are readonly objects where the initialization is somehow externally serialized with accessors, and atomic scalars -- C++ std::atomic, Java has something similar, etc.
This is kind of a hot take but I actually prefer debugging races in C/C++ for this reason. Yes, the language prescribes insane semantics (basically none) when it happens, but in practice you’ll get memory corruption or other noisy issues pretty often, and the fact that races are mostly illegal means you can write something like thread sanitizer without needing source code changes to indicate semantics. Meanwhile in Java you’ll never have UB but often you’ll have two fields be subtly out of sync and it’s a lot harder to track this kind of thing down.
Some (maybe most?) operations on Java Collections perform integrity checks to warn about such issues, for example map throwing ConcurrentModificationException
ConcurrentModificationException does not check threads, it triggers when it is already too late. It also triggers on the same thread if you remove while iterating an iterator
ConcurrentModificationException is typically thrown from an iterator when it detects that it’s been invalidated by a modification to the underlying collection. It’s harder to check for the case described in this article, which is about multiple threads calling put() concurrently on a non thread safe object.
I've seen this in production C#. Same symptoms, process dump showed a corrupt dictionary. We thought we were using ConcurrentDictionary everywhere but this one came in from a library. We were using .net framework, IIRC .net core has code to detect concurrent modification. I don't know how they implement it but it could be as simple as a version counter.
It's odd he gets so hung up on npe being a critical ingredient when that doesn't appear to be present in the original manifestation. There's no reason to think you can't have a bug like this in C just because it doesn't support exceptions.
To me it's all about class invariants. In general, they don't hold while a mutator is executing and will only be restored at the end. Executing another mutator before the invariants are reestablished is how you corrupt your data structure. If you're not in a valid state when you begin you probably won't be in a valid state when you finish.
It came down to poor logic. I got hung up on it because I got unlucky and couldn't reproduce it with uncaught NPE so I incorrectly concluded that uncaught NPE was a necessary condition.
> Could an unguarded TreeMap cause 3,200% utilization?
I've seen the same thing with an undersynchronized java.util.HashMap. This would have been in like 2009, but afaik it can still happen today. iirc HashMap uses chaining to resolve collisions; my guess was it introduced a cycle in the chain somehow, but I just got to work nuking the bad code from orbit [1] rather than digging in to verify.
I often interview folks on concurrency knowledge. If they think a data race is only slightly bad, I'm unimpressed, and this is an example of why.
[1] This undersynchronization was far from the only problem with that codebase.
OP should go through the rest of the code and see if there are other places where collections are potentially operated by multiple threads.
>The easiest way to fix this was to wrap the TreeMap with Collections.synchronizedMap or switch to ConcurrentHashMap and sort on demand.
That will make the individual map operations thread-safe, but given that nobody thought of concurrency, are you sure series of operations are thread-safe? That is, are you sure the object that owns the tree-map is thread-safe. I wouldn't bet on it.
>Controversial Fix: Track visited nodes
Don't do that! The collection will still not be thread-safe and you'll just fail in some other more subtle way .. either now, or in the future (if/when the implementation changes in the next Java release).
>Sometimes, a detail oriented developer will notice the combination of threads and TreeMap, or even suggest to not use a TreeMap if ordered elements are not needed. Unfortunately, that didn’t happen in this case.
That's not a good take-away! OP, the problem is you violating the contract of the collection, which is clear that it isn't thread-safe. The problem ISN'T the side-effect of what happens when you violate the contract. If you change TreeMap to HashMap, it's still wrong (!!!), even though you may not get the side-effect of a high CPU utilization.
---------
When working with code that is operated on by multiple threads, the only surefire strategy I found was to to make every possible object immutable and limiting any object that could not be made immutable to small, self-contained and highly controlled sections. We rewrote one of the core modules following these principles, and it went from being a constant source of issues, to one of the most resilient sections of our codebase. Having these guidelines in place, also made code reviews much easier.
This is a very important point! And, in my experience, a common mistake by programmers who aren't great at concurrency.
Lets say you have a concurrent dynamic array. The array class is designed to be thread-safe for the explicit purpose that you want to share it between threads. You want to access element 10, but you want to be sure that you're not out of bounds. So you do this:
It doesn't matter how thread-safe this array class is: these two operations combined are NOT thread-safe, because thread-safety of the class only means that `.size()` and `.element()` are on their own not going to cause any races. But it's entirely possible another thread removes elements in between you checking the size and accessing the element, at which point you'll (at best) get an out-of-bounds crash.The way to fix it is to either use atomic methods on the class which does both (something like `.element_or_null()` or whatever), or to not bother with a concurrent dynamic array at all and instead just use regular one you guard with a mutex (so the mutex is held during both operations and whatever other operations other threads perform on the array).
Concurrent transactions can screw it up, and you wouldn't notice until an escalation happens.
Ironically, .NET repeated the same mistake with the same outcome - it's just less obvious because for .NET the switch from synchronized-by-default to unsync-by-default happened when generics were introduced, so type safety of the new collections was the part that devs generally paid the most attention to.
This means there are three ways to resolve this:
* Add synchronization with locks, etc.
* Don't share mutable access, e.g. single-owner model with channels.
* Make data immutable: an insight originally from purely functional languages. I believe Google invested quite heavily in this model with Guava.
Rust lets you choose which one of the three to give up, and it also statically prevents all three from happening at the same time.
That can still break if a reader makes multiple calls and implicitly assumes that the data structure at large hasn't been mutated between them.
What you suggest to make objects immutable: Yes, and then one actually uses different data structures, which are made for being used as immutable data structures, making use of structural sharing, to avoid or limit an explosion in memory usage. So again its a case of "use the right data structures for the job". Using purely functional data structures makes one side of the problem simply disappear. The other side might not even come up in many cases: What if the threads depend on other versions of the data structure, which the other threads create? (need intermediate results of other threads). Then one has a real headache coming ones way, and probably needs again other data structures.
One additional idea I had, if one really wants to shoehorn it with the already used mutable data structure is, that one could write something that serializes the access attempts before they hit the data structure. That could also include grouping accesses into transactions and only running complete transactions. Then it makes me think whether it is close to implementing a database ...
I don't intend to be mean, I just think the idea of independent execution in a shared memory environment is fundamentally flawed. And the effort in trying to get such follies working should have been put into better process models.
Where did you get that?
Immutability does not mean replicating the 'process model'. It means removing (limiting) an entire class of bugs from your multithreaded code. There will still be shared areas - and those should be controlled.
You also would have to check that a and b return the same value every time, and that treeMap behaves like a map. If, for example, it logs updates, you’d have to check that changing that to log only once is acceptable.
Whether it occurs or not can depend on the specific data being processed, and the order in which it is being processed. So this can happen in production after seemingly working fine for a long time.
I haven't personally encountered a buggy comparator without a total order.
I recently wrote a blog post which (while not the main focus) includes a discussion on randomized testing of sort algorithms in the face of inconsistent comparators: https://sunshowers.io/posts/monads-through-pbt/ section 2.
To give one example, a common error is to compare two int values via subtraction, which can give incorrect results due to overflow and modulo semantics, instead of using Integer::compare (or some equivalent of its implementation).
Food for thought. I often think to myself that any error or strange behavior or even warnings in a project should be fixed as a matter of principle, as they could cause seemingly unrelated problems. Rarely is this accepted by whoever chooses what we should work on.
But of course it's quasi-impossible to know in advance the likelihood of a given error participating in a future problem, and whether it's cheaper to fix this error ahead or let the problem happen. So it becomes an art more than a science to decide what to focus on.
"fix nothing" is certainly a horrible approach, "fix everything" is often impractical. So you either need some sort of decision framework, or a combination of decent "instinct" (experience, really) and trust from your stakeholder (which comes from many places, including good communication and track record of being pragmatic over dogmatic)
Strongly disagreed. Strange, unexpected behaviour of code is a warning sign that you have fallen short in defensive programming and you no longer have a mental model of your code that corresponds with reality. That is a very dangerous to be in. Very quickly possible to be stuck in quicksand not too far afterwards.
I, personally, was mostly just pissed we didn't get it to zero. Unsurprisingly the number has climbed back up since
Still better to quash small bugs and errors where possible, but at least if you know why they happen, you can prevent unforeseen issues.
there's nothing pagmatic about it. once I get into the habit of ignoring a few warnings that effectively means all warnings will be ignored
It then creates immense value by avoiding a lot of risk and uncertainty for little effort.
Getting from "thousands of warnings" to zero isn't a good ROI in many cases, certainly not on a shortish term. But staying at zero is nearly free.
This is even more so with those "fifteen flickering tests" these 23 tests that have been failing and ignored or skipped for years.
It's also why I commonly set up a CI, testing systems, linters, continuous deployment before anything else. I'll most often have an entire CI and guidelines and build automation to deploy something that will only say "hello world". Because it's much easier to keep it green, clean and automated than to move there later on
Sounds like what we used to call "professionalism." That was before "move fast, break things and blame the user" became the norm.
Pretty sure what you actually mean is that you treat some warnings as errors, and disable the others. I would find it hard to believe you're building with literally all the compiler warnings enabled all the time.
But having shared, mutable state without locks or other protection should never have passed code review in the first place. I would look at that commit to try to determine how it happened, then try to change the team's processes to prevent that.
It’s a nuisance issue sorta like hemorrhoids. Do you get the surgery and suffer weeks of extreme pain, or do you just deal with it? Hemorrhoids are mostly benign, but certainly have the potential to become more severe and very problematic. Maybe this phenomenon should be called digital hemorroids?
Even without corruption a race condition can cause significant performance issues by causing the same work to be done many times with only one result being kept.
> Food for thought. I often think to myself that any error or strange behavior or even warnings in a project should be fixed as a matter of principle
For warnings: at least explained in a comment, where it has been decided irrelevant (preferably with a pragma to turn off the warning as locally as possible).
Strange behaviour I prefer to get rid of. I've found code marked (by me at least once!) "not sure why this works, but" which very much no longer works, and had to rewrite in a rush where if it had been addressed earlier there would have been more time to be careful.
Only time to fix is just after it is discovered - or mostly never ever because it becomes expensive to build up the context in mind again.
You need to find more disciplined teams.
There are still people out there who care about correctness and understand how to achieve it without it being an expensive distraction. It a team culture factor that mostly just involves staying on top of these concerns as soon as they're encountered so there's not some insurmountable and inscrutable backlog that makes it feel daunting and hopeless or that makes prioritization difficult.
I agree. I hate lingering warnings. Unfortunately at the at time of this bug I did not have static analysis tools to detect these code smells.
I get that YMMV based on the org, but I find that more often than not, it’s expected that you are properly maintaining the software that you build, including the things like fixing warnings as you notice them. I can already feel the pushback coming, which is “no but really, we are NOT allowed to do ANYTHING but feature work!!!!” and… okay, I’m sorry, that really sucks… but maybe, MAYBE, some of that is you being defeatist, which is only adding to this culture problem. Oh well, now’s not the time to get fired for something silly like going against the status quo, so… I get it either way.
But for some solace https://en.wikipedia.org/wiki/The_Power_of_10:_Rules_for_Dev...
Deleted Comment
This is the kind of thing you should probably just do, not seek buy-in for from whoever chooses what you should work on. Unless it is going to take some extreme amount of time.
Hyperbole sure, but made me chuckle and is a nice reminder to check all assumptions.
I myself do try to fix most warnings -- some are just false positives.
Deleted Comment
So I guess the new user / student was trying to do things in parallel to speed things up when they chopped up their large text file into N (32 or 64) sections based on line number (not separate files), and then ran N copies of perl in parallel, each processing its own set of lines from that one file.
Not only did you have a large amount (for back then) of RAM used by N copies of the perl interpreter (separate processes, not threads, mind you!) processing its data, as well, any attempt to swap was interleaved with frantic seeking to a different section of the same file to read a few more lines for one of N processes stalled on IO. Also, probably the Jth process had to read J/N of the entire file to get to its section. So the first section of the file was read N times, the next N-1, then N-2, etc.
We (me and the fellow-student-trusted-as-sysadmin who had the root password) couldn't even get a login prompt on the console. Luckily, I had a logged-in session (ssh from an actual x-terminal - a large-screen "dumb" terminal), and "su" prompted for a password after 20-30 minutes of running it. After another 5-10 minutes, we had a root session and were able to run top and figure out what was going on. Killing the offending processes (after trying to contact the user) restored the system back to normal.
Edit: forgot to say: had the right idea, but totally didn't understand the system's limitations. It was SEVERELY I/O limited with that hard drive and relatively low RAM, so just processing the data linearly would have easily been the best approach unless the amount of data to be kept would have gotten too large.
Multithreading errors are the worst to debug. In this case it's dead simple to identify at design time and warning flags should have gone up as soon as he started thinking about using any of the normal containers in a multithreaded environment.
There's simply no straightforward default approach that won't have you running into and thinking through the most esoteric sounding problems. I guess that's half the fun!
In other words, use Rust.
It's odd he gets so hung up on npe being a critical ingredient when that doesn't appear to be present in the original manifestation. There's no reason to think you can't have a bug like this in C just because it doesn't support exceptions.
To me it's all about class invariants. In general, they don't hold while a mutator is executing and will only be restored at the end. Executing another mutator before the invariants are reestablished is how you corrupt your data structure. If you're not in a valid state when you begin you probably won't be in a valid state when you finish.
I've seen the same thing with an undersynchronized java.util.HashMap. This would have been in like 2009, but afaik it can still happen today. iirc HashMap uses chaining to resolve collisions; my guess was it introduced a cycle in the chain somehow, but I just got to work nuking the bad code from orbit [1] rather than digging in to verify.
I often interview folks on concurrency knowledge. If they think a data race is only slightly bad, I'm unimpressed, and this is an example of why.
[1] This undersynchronization was far from the only problem with that codebase.
...which I don't think I've ever seen before, but nicely explains the problem I saw, and coincidentally is from the same year!
Dead Comment