3,200% CPU Utilization

Ooof. The core collections in Java are well understood to not be thread-safe by design, and this should have been noticed.

OP should go through the rest of the code and see if there are other places where collections are potentially operated by multiple threads.

>The easiest way to fix this was to wrap the TreeMap with Collections.synchronizedMap or switch to ConcurrentHashMap and sort on demand.

That will make the individual map operations thread-safe, but given that nobody thought of concurrency, are you sure series of operations are thread-safe? That is, are you sure the object that owns the tree-map is thread-safe. I wouldn't bet on it.

>Controversial Fix: Track visited nodes

Don't do that! The collection will still not be thread-safe and you'll just fail in some other more subtle way .. either now, or in the future (if/when the implementation changes in the next Java release).

>Sometimes, a detail oriented developer will notice the combination of threads and TreeMap, or even suggest to not use a TreeMap if ordered elements are not needed. Unfortunately, that didn’t happen in this case.

That's not a good take-away! OP, the problem is you violating the contract of the collection, which is clear that it isn't thread-safe. The problem ISN'T the side-effect of what happens when you violate the contract. If you change TreeMap to HashMap, it's still wrong (!!!), even though you may not get the side-effect of a high CPU utilization.

---------

When working with code that is operated on by multiple threads, the only surefire strategy I found was to to make every possible object immutable and limiting any object that could not be made immutable to small, self-contained and highly controlled sections. We rewrote one of the core modules following these principles, and it went from being a constant source of issues, to one of the most resilient sections of our codebase. Having these guidelines in place, also made code reviews much easier.

OskarS · 6 months ago

> That will make the individual map operations thread-safe, but given that nobody thought of concurrency, are you sure series of operations are thread-safe? That is, are you sure the object that owns the tree-map is thread-safe. I wouldn't bet on it.

This is a very important point! And, in my experience, a common mistake by programmers who aren't great at concurrency.

Lets say you have a concurrent dynamic array. The array class is designed to be thread-safe for the explicit purpose that you want to share it between threads. You want to access element 10, but you want to be sure that you're not out of bounds. So you do this:

    if (array.size() > 10) {
        array.element(10).doSomething();
    }

It doesn't matter how thread-safe this array class is: these two operations combined are NOT thread-safe, because thread-safety of the class only means that `.size()` and `.element()` are on their own not going to cause any races. But it's entirely possible another thread removes elements in between you checking the size and accessing the element, at which point you'll (at best) get an out-of-bounds crash.

The way to fix it is to either use atomic methods on the class which does both (something like `.element_or_null()` or whatever), or to not bother with a concurrent dynamic array at all and instead just use regular one you guard with a mutex (so the mutex is held during both operations and whatever other operations other threads perform on the array).

hassleblad23 · 6 months ago

A similar pattern (and overlooked very often) is with DB updates where you want to increment a number by 1, but split it into 2 roundtrips to the DB to get the existing value and set it to `value + 1`.

Concurrent transactions can screw it up, and you wouldn't notice until an escalation happens.

int_19h · 6 months ago

This kind of stuff is exactly why the original "thread-safe" collections in Java were deprecated in favor of the current one: locking on every operation means that a lot of trivial code "just works", but then things break without warning once you need consistency between consequent operations. And, at the same time, everybody is paying the locking penalty, even if they don't actually do concurrent access, or if they do their own locking.

Ironically, .NET repeated the same mistake with the same outcome - it's just less obvious because for .NET the switch from synchronized-by-default to unsync-by-default happened when generics were introduced, so type safety of the new collections was the part that devs generally paid the most attention to.

Thaxll · 6 months ago

This is not a thread safety issue though, this is a transaction problem.

sunshowers · 6 months ago

Unsynchronized, shared, mutable state are where data races happen. You need all three.

This means there are three ways to resolve this:

* Add synchronization with locks, etc.

* Don't share mutable access, e.g. single-owner model with channels.

* Make data immutable: an insight originally from purely functional languages. I believe Google invested quite heavily in this model with Guava.

Rust lets you choose which one of the three to give up, and it also statically prevents all three from happening at the same time.

fc417fc802 · 6 months ago

> Don't share mutable access, e.g. single-owner model with channels.

That can still break if a reader makes multiple calls and implicitly assumes that the data structure at large hasn't been mutated between them.

zelphirkalt · 6 months ago

This is basically what I thought: "Wait a second, is that data structure even thread-safe??" and then "Seems like you are just using the wrong data structure for your purpose.". Shoehorning with mutexes will usually lead to problems elsewhere and bottlenecks.

What you suggest to make objects immutable: Yes, and then one actually uses different data structures, which are made for being used as immutable data structures, making use of structural sharing, to avoid or limit an explosion in memory usage. So again its a case of "use the right data structures for the job". Using purely functional data structures makes one side of the problem simply disappear. The other side might not even come up in many cases: What if the threads depend on other versions of the data structure, which the other threads create? (need intermediate results of other threads). Then one has a real headache coming ones way, and probably needs again other data structures.

One additional idea I had, if one really wants to shoehorn it with the already used mutable data structure is, that one could write something that serializes the access attempts before they hit the data structure. That could also include grouping accesses into transactions and only running complete transactions. Then it makes me think whether it is close to implementing a database ...

somat · 6 months ago

So your solution to the the insanity that is the threading model was to reinvent the process model in it...

I don't intend to be mean, I just think the idea of independent execution in a shared memory environment is fundamentally flawed. And the effort in trying to get such follies working should have been put into better process models.

macspoofing · 6 months ago

>So your solution to the the insanity that is the threading model was to reinvent the process model in it...

Where did you get that?

Immutability does not mean replicating the 'process model'. It means removing (limiting) an entire class of bugs from your multithreaded code. There will still be shared areas - and those should be controlled.

public void someFunction(SomeType relatedObject, List<SomeOtherType> unrelatedObjects) { ... treeMap.put(relatedObject.a(), relatedObject.b()); ... // unrelatedObjects is used later on in the function so the // parameter cannot be removed }

> I always thought of race conditions as corrupting the data or deadlocking. I never though it could cause performance issues. But it makes sense, you could corrupt the data in a way that creates an infinite loop.

Food for thought. I often think to myself that any error or strange behavior or even warnings in a project should be fixed as a matter of principle, as they could cause seemingly unrelated problems. Rarely is this accepted by whoever chooses what we should work on.

williamdclt · 6 months ago

It's a decent rule of thumb, but it definitely needs some pragmatism. Squashing any error, strangeness and warning can be very expensive in some projects, much more than paying the occasional seemingly-unrelated problem.

But of course it's quasi-impossible to know in advance the likelihood of a given error participating in a future problem, and whether it's cheaper to fix this error ahead or let the problem happen. So it becomes an art more than a science to decide what to focus on.

"fix nothing" is certainly a horrible approach, "fix everything" is often impractical. So you either need some sort of decision framework, or a combination of decent "instinct" (experience, really) and trust from your stakeholder (which comes from many places, including good communication and track record of being pragmatic over dogmatic)

alex_smart · 6 months ago

> Squashing any error, strangeness and warning can be very expensive in some projects

Strongly disagreed. Strange, unexpected behaviour of code is a warning sign that you have fallen short in defensive programming and you no longer have a mental model of your code that corresponds with reality. That is a very dangerous to be in. Very quickly possible to be stuck in quicksand not too far afterwards.

Ntrails · 6 months ago

I remember a project to get us down from like 1500 build warnings to sub 100. It took a long time, generated plenty of bikeshedding, and was impossible to demonstrate value.

I, personally, was mostly just pissed we didn't get it to zero. Unsurprisingly the number has climbed back up since

Forricide · 6 months ago

Fixing everything is impractical, but I'd say a safer rule of thumb would be to at least understand small strangenesses/errors. In the case of things that are hard to fix - e.g. design/architectural decisions that lead to certain performance issues or what have you - it's still usually not too time consuming to get a basic understanding of why something is happening.

Still better to quash small bugs and errors where possible, but at least if you know why they happen, you can prevent unforeseen issues.

gamedever · 6 months ago

if there are any warnings I'm supposed to ignore then there are effectively no warnings.

there's nothing pagmatic about it. once I get into the habit of ignoring a few warnings that effectively means all warnings will be ignored

connicpu · 6 months ago

At my job we treat all warnings as errors and you can't merge your pull requests unless all automatically triggered CI pipelines pass. It requires discipline, but once you get it into that state it's a lot easier to keep it there.

berkes · 6 months ago

The last point is the key.

It then creates immense value by avoiding a lot of risk and uncertainty for little effort.

Getting from "thousands of warnings" to zero isn't a good ROI in many cases, certainly not on a shortish term. But staying at zero is nearly free.

This is even more so with those "fifteen flickering tests" these 23 tests that have been failing and ignored or skipped for years.

It's also why I commonly set up a CI, testing systems, linters, continuous deployment before anything else. I'll most often have an entire CI and guidelines and build automation to deploy something that will only say "hello world". Because it's much easier to keep it green, clean and automated than to move there later on

reaperducer · 6 months ago

Sounds like what we used to call "professionalism." That was before "move fast, break things and blame the user" became the norm.

dataflow · 6 months ago

> At my job we treat all warnings as errors

Pretty sure what you actually mean is that you treat some warnings as errors, and disable the others. I would find it hard to believe you're building with literally all the compiler warnings enabled all the time.

stickfigure · 6 months ago

The other thing is don't catch and ignore exceptions. Even "catch and log" is a bad idea unless you specifically know that program execution is safe to continue. Just let the exception propagate up to where something useful can be done, like return 500 or displaying an error dialog.

wpollock · 6 months ago

I agree, and I think the OP does as well. Really, the Executor framework used here was to blame by not setting the uncaught exception handler to catch and propagate those exceptions by default.

But having shared, mutable state without locks or other protection should never have passed code review in the first place. I would look at that commit to try to determine how it happened, then try to change the team's processes to prevent that.

temporallobe · 6 months ago

Yes, but…I suppose you have to pick your battles. There was recently a problem vexing me about a Rails project I maintain where the logs were filled with complaints about “unsupported parameters”, even though we painstakingly went through all the controllers and allowed them. It’s /probably/ benign, but it adds a lot of noise to the logs. Several of us took a stab at resolving it, but in the end we always had much higher priorities to address. Also it’s hard to justify spending so many hours on something that has little “business value”, especially when there is currently no functional impact.

It’s a nuisance issue sorta like hemorrhoids. Do you get the surgery and suffer weeks of extreme pain, or do you just deal with it? Hemorrhoids are mostly benign, but certainly have the potential to become more severe and very problematic. Maybe this phenomenon should be called digital hemorroids?

brirec · 6 months ago

As someone with pretty bad hemorrhoids, I’m hesitant to ask my doctor about surgery because I’ve been told the hemorrhoids will come back, without question. So it’s even still just a temporary fix…

colechristensen · 6 months ago

Couldn’t you just run a debugger to find all of the incidents of that issue?

dspillett · 6 months ago

> > race conditions […] I never though it could cause performance issues. But it makes sense, you could corrupt the data in a way that creates an infinite loop.

Even without corruption a race condition can cause significant performance issues by causing the same work to be done many times with only one result being kept.

> Food for thought. I often think to myself that any error or strange behavior or even warnings in a project should be fixed as a matter of principle

For warnings: at least explained in a comment, where it has been decided irrelevant (preferably with a pragma to turn off the warning as locally as possible).

Strange behaviour I prefer to get rid of. I've found code marked (by me at least once!) "not sure why this works, but" which very much no longer works, and had to rewrite in a rush where if it had been addressed earlier there would have been more time to be careful.

szundi · 6 months ago

There are so many of these in some projects that starting to fix them kills it.

Only time to fix is just after it is discovered - or mostly never ever because it becomes expensive to build up the context in mind again.

swatcoder · 6 months ago

> Rarely is this accepted by whoever chooses what we should work on.

You need to find more disciplined teams.

There are still people out there who care about correctness and understand how to achieve it without it being an expensive distraction. It a team culture factor that mostly just involves staying on top of these concerns as soon as they're encountered so there's not some insurmountable and inscrutable backlog that makes it feel daunting and hopeless or that makes prioritization difficult.

saulpw · 6 months ago

Most teams are less disciplined than they should be. Also, job/team mobility is very low right now. So the question becomes, how do you increase discipline on the team you're on?

TYMorningCoffee · 6 months ago

> Food for thought. I often think to myself that any error or strange behavior or even warnings in a project should be fixed as a matter of principle, as they could cause seemingly unrelated problems. Rarely is this accepted by whoever chooses what we should work on.

I agree. I hate lingering warnings. Unfortunately at the at time of this bug I did not have static analysis tools to detect these code smells.

berkes · 6 months ago

Another problem with lingering warnings is that it's easy to overlook that one new warning that's actually important amongst floods of older warnings.

corytheboyd · 6 months ago

> Rarely is this accepted by whoever chooses what we should work on.

I get that YMMV based on the org, but I find that more often than not, it’s expected that you are properly maintaining the software that you build, including the things like fixing warnings as you notice them. I can already feel the pushback coming, which is “no but really, we are NOT allowed to do ANYTHING but feature work!!!!” and… okay, I’m sorry, that really sucks… but maybe, MAYBE, some of that is you being defeatist, which is only adding to this culture problem. Oh well, now’s not the time to get fired for something silly like going against the status quo, so… I get it either way.

Retr0id · 6 months ago

And from a security perspective, the "might cause a problem 0.000001% of the time" flaws can often be manipulated into becoming a problem 100% of the time.

sitkack · 6 months ago

All security issues are subclass of bugs. Security is a niche version of QA.

chanux · 6 months ago

I've also felt lonely on this hill I'm dying on.

But for some solace https://en.wikipedia.org/wiki/The_Power_of_10:_Rules_for_Dev...

Deleted Comment

closeparen · 6 months ago

>Rarely is this accepted by whoever chooses what we should work on.

This is the kind of thing you should probably just do, not seek buy-in for from whoever chooses what you should work on. Unless it is going to take some extreme amount of time.

phamilton · 6 months ago

A coworker once told me: "Undefined behavior means the code could theoretically order a pizza"

Hyperbole sure, but made me chuckle and is a nice reminder to check all assumptions.

cryptonector · 6 months ago

Not everyone agrees. The SQLite team, for example, famously refuses to fix warnings reported by users.

I myself do try to fix most warnings -- some are just false positives.

Deleted Comment