For processing strings, streams in C++ can be slow

> When performing string concatenation operations, it is more advantageous in terms of performance to use std::ostringstream rather than std::string. This approach is also used elsewhere, such as debug_utils and node_errors.

(From the Node.js GitHub issue.) Sounds like this guy is mixing up his Java knowledge with C++ knowledge.

C++ streams are frankly insane: Loads of implicit state, needing to set about a half dozen flags to do any nontrivial formatting, running the risk of accidentally "poisoning" all downstream operations if you forget to reset any of that state, the useless callbacks API [1], obfuscated function names (xsgetn, epptr, egptr), a ridiculously convoluted inheritance hierarchy that includes virtual/diamond inheritance [2], and use of virtual functions for simple buffer manipulation. These were all bad decisions even at the time.

[1] https://en.cppreference.com/w/cpp/io/ios_base/register_callb...

[2] https://i.stack.imgur.com/dXhXP.png

thefourthchime · 2 years ago

C++ dev for 20+ years. I refused to use them, they encapsulated everything with C++ I hated. A ton of implicit actions and gotchas. It's a gun in hand, foot in target.

jnwatson · 2 years ago

streams is the best evidence that C++ was an experiment. It was a sandbox to try a bunch of different language ideas.

Overloading the shift operators for this purpose is prima facie insane, and anyone who has single-stepped through a C++ "hello world" program can figure out it isn't remotely efficient, but it was certainly creative.

pjmlp · 2 years ago

Using C++ since 1993, I love them.

juunpp · 2 years ago

> (From the Node.js GitHub issue.) Sounds like this guy is mixing up his Java knowledge with C++ knowledge.

That is exactly it. C++ string streams have had atrocious performance since forever. Good abstraction, not very useful in practice.

In Java, if I remember correctly, strings are immutable, so the StringBuilder or whatever ridiculous name it had was the faster way to build a string.

> "I recently learned that some Node.js engineers prefer stream classes when building strings, for performance reasons."

Pretty much tells you everything you need to know about node js, I guess.

paulddraper · 2 years ago

> "I recently learned that some Node.js engineers prefer stream classes when building strings, for performance reasons." Pretty much tells you everything you need to know about node js, I guess.

Google Closure Library includes a StringBuffer class. [1]

I recall it having explanatory notes, but I don't see them in the code now. JavaScript engines can optimize a string concatenating to in-place edit, if there is only one reference to the first string. The StringBuffer class keeps the reference count at one, guaranteeing this optimization is available, even if the StringBuffer itself is ever shared.

[1] https://github.com/google/closure-library/blob/master/closur...

Rebelgecko · 2 years ago

For what it's worth, even in Java the compiler is often smart enough to replace naive string concatenation with equivalent StringBuilder usage (although I don't know if it is smart enough to do that in a for loop like this)

kristianp · 2 years ago

The github issue:

https://github.com/nodejs/node/pull/50253

Note that the person mixing java knowledge and C++ isn't Daniel Lemire.

lifthrasiir · 2 years ago

A bigger problem is that iostream is still the only C++ way to read and write files. Yeah, you can still use `std::fopen` and so on, but the modern C++ strives to minimize type-ignorant C functions right? The introduction of `std::format` made the formatting aspect of iostream obsolete, but iostream still has no standard alternative for other aspects.

nly · 2 years ago

std::filebuf can be used directly without std::fstream if you just want to read and write bytes without doing formatted operations.

Calavar · 2 years ago

std::print is coming with C++23. In the meantime, there's std::format_to. You still have to dump the output into an std::ostream, but at least you don't have to use the disgusting ostream interface directly.

AndrewStephens · 2 years ago

iostreams are such a messy design, for all the reasons you mention and more. They are rightly avoided in all code that cares about performance.

raverbashing · 2 years ago

Really

C++ the base language has issues, for sure

But iostream is like taking someone that's crazy to use every single language feature and who think there's some ulterior motive to create these crazy inheritance levels etc

Maybe we need C+=2 to make things less crazy

kevin_thibedeau · 2 years ago

Careful, the committee may decide to deprecate compound assignment again.

uxp8u61q · 2 years ago

What's the alternative to string streams for building strings pieces by pieces in C++? Plain old string concatenation? Asking for a friend... I should run benchmarks I guess.

im3w1l · 2 years ago

Are you doing something in a hot loop or not? The answer is likely that it doesn't really matter.

If you really need speed, then estimate how large string you need in advance and preallocate it (either with new char[], or string::reserve ig).

calamari4065 · 2 years ago

Huh. I don't know much at all about C++ streams, but this sounds only marginally worse than C#'s streams

colonwqbang · 2 years ago

Maybe not. The problem is that C++ (before C++20) has no normal print and format function. You supposed to do everything with streams. To switch to two decimal places you would first output some magic value that sets an internal flag in the stream. Then you need to remember to restore it again.

Of course you could just use good old C printf to get some work done. But if you did that the "real" C++ programmers would sneer at you.

<iostreams> are currently a good example of pure product of the 199X/2000s when the hype about Object Oriented was around its peak.

Almost everything related to c++ iostreams has this code smell of OOP pushed too far:

- Usage of runtime virtual dispatch with virtual calls when it was not necessary. Causing a negative unavoidable impact on performance.

- Heavy usage of function overloading with the "<<" operator. Leading to pages long compilation errors when an overload fails.

- Hidden states everywhere with the usage of state formatters and globals in the background.

- Unnecessary complexity with std::locale which is almost entirely useless for proper internationalisation.

- Bloat. Any statically compiled binary will inherit around ~100k of binary fat bloat when using iostream

- Useless encapsulation with error reports done as abstracted bit flags. Which is absolutely horrendous when dealing with file I/O: It hides away the underlying error with no proper way to access it.

- Deep class hierarchy making the entire thing looks like spaghetti.

- Useless abstraction with stringstream that hides the underlying buffer away, making it close to unusable on embedded safety critical systems where memory allocations are forbidden.

All of that made <iostreams> aged pretty badly, and for good reasons.

Fortunately there is an incoming way out of that with work of Victor Zverovich on std::format and libfmt [1].

[1]: https://github.com/fmtlib/fmt

planede · 2 years ago

I hear you, however something like streambuf is kinda necessary for a type-erased interface for input/output of trivial objects. The C alternative is FILE*, which isn't much better and isn't as customizable either.

I agree that the formatting could have been done better, and that part is indeed handled much better in fmt, although personally I dislike format strings. It's much better than printf, granted.

vitaut · 2 years ago

{fmt} has internal buffering but it's not yet exposed to users. There is a feature request for it: https://github.com/fmtlib/fmt/issues/2354. FILE buffering is not too bad but it can be easily optimized: https://www.zverovich.net/2020/08/04/optimal-file-buffer-siz....

Deleted Comment

shiroiuma · 2 years ago

>Fortunately there is an incoming way out of that with work of Viktor Zverovich on std::format and libfmt

Those are great, but iostreams hasn't been necessary in a very long time thanks to other libraries like Qt and Boost.

GuB-42 · 2 years ago

Avoiding dependencies is a good thing, especially for C++, that doesn't have a widely used centralized repository and dependency manager like npm, cargo or cpan. For the better or for the worse.

And pulling Boost, let alone Qt just to avoid the occasional use of iostreams (or printf) is a bit much IMHO. I usually try to avoid Boost, as I feel it is more of a sort of beta/preview for the standard library. Don't get me wrong, it is production-worthy, but it can lead to awkward things when some boost feature ends up in the standard libraries and the project ends up with bits of both.

std::format is great because at last, we can use it without dependencies.

adev_ · 2 years ago

Yes. Several alternatives have been available for a while.

The success of Victor has been to make the C++ committee accepts the idea that a new formatter was necessary and to bring <format> in the STL.

This was not a small task: The committee has its fair amount of dinosaur gatekeepers and windmills [1]. For the best and the worst.

We at least now have a way forward to evolve from <iostream> if we want to with maybe one day the hope of getting something that can entirely replace iostream.

[1]: Windmills: Person displacing air around but not much more than air.

JohnFen · 2 years ago

Indeed. I heavily dislike iostreams and think it does more harm than good for most use cases.

struct ZeroCopyBuf : public std::streambuf { ZeroCopyBuf(const std::string &s) : ZeroCopyBuf(s.c_str(), s.length()) {} ZeroCopyBuf(const char *c, std::size_t l) : ZeroCopyBuf(const_cast<char*>(c), l) {} ZeroCopyBuf(char *c, std::size_t l) { setg(c, c, c + l); } }; ... std::string s ...; ZeroCopyBuf buf(s); std::istream is(&buf);

ewalk153 · 2 years ago

Daniel’s PR to implement this in node.js[1] is case study in:

- crafting a high context yet succinct description

- addressing PR feedback well

- giving respect to a pedantic commenter who understands the inner workings far less than Daniel while not conceded to make a destructive change.

I will share this PR widely as arole model in open source contributions.

[1] https://github.com/nodejs/node/pull/50288

rossjudson · 2 years ago

The same thing struck me as well. This is one of the best optimization professionals on the planet, showing up with a huge improvement, and receiving some misplaced arrogance.

The lesson here is to always, always watch your own review tone, and not make this mistake.

The other lesson is that when a PR shows up with this kind of technical information attached to it, spend the 60 seconds it takes to Google for "lemire".

pitaj · 2 years ago

I'm surprised that the reviewer was so ignorant of amortized constant time insertion.

If I'm being super pedantic, I would argue that while `string::push_back` should take amortized constant time, `string::append` has no such guarantee [1]. So it is technically possible for `my_string += "a";` (same to `string::append`) will reallocate every time. Very pedantic indeed, but I have seen some C++ implementation where `std::vector<T>` is an alias to `std::deque<T>`, so...

One thing I don't like about lemire's phrasing is that he only looks at the current, often only most available, implementations and doesn't make this point explicit for most cases.

EDIT: Thankfully he does acknowledge that in a later post [2].

[1] https://timsong-cpp.github.io/cppwp/n4861/strings#string.app...

[2] https://lemire.me/blog/2023/10/23/appending-to-an-stdstring-...

I am not at all surprised. Kids these days have no idea what CPUs can do. ;)

I periodically have interview candidates work through problems involving binary search, then switch to bounded and ask them how to make it go faster over N elements, where N is < 1e3. The answer is "just linear search, because CPUs really like to do that".

euiq · 2 years ago

This feels like a conversation where it would have been useful for the participants to be very explicit about the points they were trying to convey: the reviewer could have said "Isn't this a quadratic algorithm, because each call to `+=` reallocates `escaped_file_path`?" (or whatever their specific concern was; I may have misunderstood), and the author's initial response could have been "No, because the capacity of the string is doubled when necessary."

gumby · 2 years ago

My impression is that C++ streams are on their way out -- unlikely to be deprecated (too much existing code) but also unlikely to receive any more attention. They are old enough to likely not have any actual implementation bugs at this point, but in retrospect the design bugs from the 1980s are pretty serious.

The rapid incorporation of the excellent `format` package for printing points to a future falling back at least to ANSI buffered IO and possibly raw POSIX IO.

synergy20 · 2 years ago

you mean format + printf/scanf from C?

corysama · 2 years ago

https://en.cppreference.com/w/cpp/header/format

Which is based on

https://fmt.dev/latest/index.html

As others have pointed out I mean `std::format` et al in <format>.

There's also a proposal for a type safe scanf: scnlib, sort of format in reverse: https://scnlib.dev/en/master/

signa11 · 2 years ago

this: https://en.cppreference.com/w/cpp/utility/format/format most likely.

jujube3 · 2 years ago

Is there any reason to use the format package rather than printf? Aside from C++ people who claim to hate C needing to save face?

flohofwoe · 2 years ago

I like C much more than C++, but even with that must say that https://github.com/fmtlib/fmt is pretty nice (which is the base for std::format). Together with pystring (https://github.com/imageworks/pystring) it makes string processing in C++ somewhat bearable (pystring is slow though because it still uses the std::string type which excessively allocates, but at least it's convenient compared to 'raw' C++ string functionality).

IcyWindows · 2 years ago

The library overview goes over the reasons fairly succinctly: https://fmt.dev/latest/index.html#overview

Other than caring about type safety, and possible CVEs due to misuse of format strings?

amelius · 2 years ago

The problem with printf and C++ is that you can't do this in general:

    printf("%d %s", i, obj.convert_to_string())

because the const char* you get from obj.convert_to_string() might be deleted from the heap before printf() is called. And similar issues.

linuxhansl · 2 years ago

Yep. stringstreams make copies of passed strings too. I had to hack around that once like this:

Terrible. C++-20 makes it better by adding move semantics to most methods... Still.

Just use the Boost Iostreams library

The array_source device can be used with any buffer (char*, size_t)

taspeotis · 2 years ago

When I did competitive programming in C++ nobody used fstream for reading input because it was much slower compared to fscanf.

IAmLiterallyAB · 2 years ago

It always surprises me how slow streams are. fscanf should be relatively slow because it has to parse the format string at runtime. So the new C++ format should be (and I believe is) much faster

fooker · 2 years ago

Parsing a string is significantly faster than disk io, so this is not an issue unless you are reading minuscule amounts of data.

kccqzy · 2 years ago

I believe the Abseil library has formatting functions that's capable of parsing the format string at compile time.

dezgeg · 2 years ago

fscanf() is also pretty slow because of thread safety so each call involves a mutex (which goes unused 99% of the time). I wonder do the new C++ libraries have faster non-threadsafe options?

Does it matter?

The little I did competitive programming, input parsing time was negligible compared to the allowed runtime for solving the problem. Inputs were designed so that if you had the right algorithm, you could do it easily even with terrible optimization. Fast code could be an advantage in the algorithm (but not in parsing), as it could help you "cheat" and, for example, do a problem designed for N² in N³. Personally, I used iostreams, just because I found it a bit easier to type.

But then, different competition have different rules, and maybe there are some where fscanf really is an advantage.

assbuttbuttass · 2 years ago

The best part of C++ is you can ignore all the iostream nonsense and use printf

neverartful · 2 years ago

Agreed! And not just printf -- any or all of the IO functions.

ho_schi · 2 years ago

Question

     std::ios_base::sync_with_stdio(false);

What is the effect of turning off synchronization with legacy functions from C? When C++ is used for I/O and no C is used this should be a habit. I’ve the impression that most C++ books don’t mention it (e.g. Primer) or only late.

It is similar to String and StringBuilder from Java. You need to know it, remember it and use it by habit. And again, books often mention it only late (e.g. Head First).

By the way. I like the plain things from <iostream>, especially the shift << and >> operators and ease of concatenating and handling strings. But as others mentioned, the implementation (e.g. inheritance) looks complicate.

Source https://en.cppreference.com/w/cpp/io/ios_base/sync_with_stdi...