Everything I wish I knew when learning C

typedef uint32_t t1[5]; typedef struct { uint32_t arr[5]; } t2; void test(t1 a, t2 b) { t1 c; t2 d; printf("%d %d %d %d\n", sizeof(a), sizeof(b), sizeof(c), sizeof(d)); }

> Everything I wish I knew when learning C

By far my biggest regret is that the learning materials I was exposed to (web pages, textbooks, lectures, professors, etc.) did not mention or emphasize how insidious undefined behavior is.

Two of the worst C and C++ debugging experiences I had followed this template: Some coworker asked me why their function was crashing, I edit their function and it sometimes crashes or doesn't depending on how I rearrange lines of code, and later I figure out that some statement near the top of the function corrupted the stack and that the crashes had nothing to do with my edits.

Undefined behavior is deceptive because the point at which the program state is corrupted can be arbitrarily far away from the point at which you visibly notice a crash or wrong data. UB can also be non-deterministic depending on OS/compiler/code/moonphase. Moreover, "behaving correctly" is one legal behavior of UB, which can fool you into believing your program is correct when it has a hidden bug.

My own write-up: https://www.nayuki.io/page/undefined-behavior-in-c-and-cplus...

The take-home lesson about UB is to only rely on following the language rules strictly (e.g. don't dereference null pointer, don't overflow signed integer, don't go past end of array). Don't just assume that your program is correct because there were no compiler warnings and the runtime behavior passed your tests.

titzer · 3 years ago

> how insidious undefined behavior is.

Indeed. UB in C doesn't mean "and then the program goes off the rails", it means that the entire program execution was meaningless, and no part of the toolchain is obligated to give any guarantees whatsoever if the program is ever executed, from the very first instruction. A UB-having program could time-travel back to the start of the universe, delete it, and replace the entire universe with a version that did not give rise to humans and thus did not give rise to computers or C, and thus never exist.

It's so insidiously defined because compilers optimize based on UB; they assume it never happens and will make transformations to the program whose effects could manifest before the UB-having code executes. That effectively makes UB impossible to debug. It's monumentally rude to us poor programmers who have bugs in our programs.

mattkrause · 3 years ago

I'm not sure that's a productive way to think about UB.

The "weirdness" happens because the compiler is deducing things from false premises. For example,

1. Null pointers must never be dereferenced.

2. This pointer is dereferenced.

3. Therefore, it is not null.

4. If a pointer is provably non-null, the result of `if(p)` is true.

5. Therefore, the conditional can be removed.

There are definitely situations where many interacting rules and assumptions produce deeply weird, emergent behavior, but deep down, there is some kind of logic to it. It's not as if the compiler writers are doing

   if(find_undefined_behv(AST))
      emit_nasal_demons()
   else
      do_what_they_mean(AST)

nayuki · 3 years ago

I agree with the factual things that you said (e.g. "entire program execution was meaningless"). Some stuff was hyperbolic ("time-travel back to the start of the universe, delete it").

> [compilers] will make transformations to the program whose effects could manifest before the UB-having code executes [...] It's monumentally rude to us poor programmers who have bugs in our programs.

The first statement is factually true, but I can provide a justification for the second statement which is an opinion.

Consider this code:

    void foo(int x, int y) {
        printf("sum %d", x + y);
        printf("quotient %d", x / y);
    }

We know that foo(0, 0) will cause undefined behavior because it performs division by zero. Integer division is a slow operation, and under the rules of C, it has no side effects. An optimizing compiler may choose to move the division operation earlier so that the processor can do other useful work while the division is running in the background. For example, the compiler can move the expression x / y above the first printf(), which would totally be legal. But then, the behavior is that the program would appear to crash before the sum and first printf() were executed. UB time travel is real, and that's why it's important to follow the rules, not just make conclusions based on observed behavior.

https://blog.regehr.org/archives/232

hzhou321 · 3 years ago

> Indeed. UB in C doesn't mean "and then the program goes off the rails", it means that the entire program execution was meaningless, and no part of the toolchain is obligated to give any guarantees whatsoever if the program is ever executed, from the very first instruction.

This is the greatest sin modern compiler folks committed to abuse C. C as the language never says the compiler can change the code arbitrarily due to an UB statement. It is undefined. Most UB code in C, while not fully defined, has an obvious part of semantics that every one understands. For example, an integer overflow, while not defined on what should be the final value, it is understood that it is an operation of updating a value. It is definitely not, e.g., an assertion on the operand because UB can't happen.

Think about our natural language, which is full of undefined sentences. For example, "I'll lasso the moon for you". A compiler, which is a listener's brain, may not fully understand the sentence and it is perfectly fine to ignore the sentence. But if we interpret an undefined sentence as a license to misinterpret the entire conversation, then no one would dare to speak.

As computing goes beyond arithmetic and the program grows in complexity, I personally believe some amount of fuzziness is the key. This current narrow view from the compiler folks (and somehow gets accepted at large) is really, IMO, a setback in the computing evolution.

LegionMammal978 · 3 years ago

> Indeed. UB in C doesn't mean "and then the program goes off the rails", it means that the entire program execution was meaningless, and no part of the toolchain is obligated to give any guarantees whatsoever if the program is ever executed, from the very first instruction.

I don't think this is exactly accurate: a program can result in UB given some input, but not result in UB given some other input. The time travel couldn't extend before the first input that makes UB inevitable.

andrewmcwatters · 3 years ago

Personally, I've found that some of the optimizations cause undefined behavior, which is so much worse. You can write perfectly good, strict C that does not cause undefined behavior, then one pass of optimization and another together can CAUSE undefined behavior.

When I learned this, if it was and is correct, I felt that one could be betrayed by the compiler.

saghm · 3 years ago

I recently dealt with a bit of undefined behavior (in unsafe Rust code, although the behavior here could similarly happen in C/C++) where attempting to print a value caused it to change. It's hard to overstate how jarring it is to see an code that says "assert that this value isn't an error, print it, and then try to use it", and have the assertion pass but then have it be printed out as an error and then panic when trying to use it There's absolutely no reason why this can't happen since "flipping bits of the value you tried to print" doesn't count as potential UB any less than a segfault, but it can be hard to turn off the part of your brain that is used to assuming that values can't just arbitrarily change at any point in time. "Ignore the rest of the program and do whatever you want after a single mistake" is not a good failure mode, and it's kind of astonishing to me that people are mostly just fine with it because they think they'll be careful enough not to make a mistake ever or that enough of the time it happened they were lucky that it didn't completely screw them over.

The only reason we use unsafe code on my team's project is because we're interfacing with C code, so it was hard not to come away from that experience thinking that it would be incredibly valuable to shrink the amount of interfacing with C as small as possible, and ideally to the point where we don't need to at all.

bluecalm · 3 years ago

It's not insidious at all. C compiler offers you a deal: "Hey, my dear programmer, we are trying to make an efficient program here. Sadly, I am not sophisticated enough to deduct a lot of things but you can help me! Here are some of the rules: don't overflow integers, don't dereference null pointers, don't go outside of array bounds. You follow those and I will fulfill my part of making your code execute quickly".

The deal is known and fair. Just be a responsible adult about it: accept it, live with the consequences and enjoy efficiency gains. You can reject it but then don't use arrays without a bound check (a lot of libraries out there offer that), check your integers bounds or use a sanitizer, check your pointers for nulls before dereferencing them, there are many tools out, there to help you, or... Just use another language that does all that for you.

nayuki · 3 years ago

UB was insidious to me because I was not taught the rules (this was back in years 2005 to 2012; maybe it got more attention now), it seemed my coworkers didn't know the rules and they handed me codebases with lots of existing hidden UB, and UB blew up in my face in very nasty ways that cost me a lot of debugging time and anguish.

Also, the UB instances that blew up were already tested to work correctly... on some other platform (e.g. Windows vs. Linux) or on some other compiler version. There are many things in life and computing where when you make a mistake, you find out quickly. If you touch a hot pan, you get a burn and quickly pull away. But if you miswire an electrical connection, it could slowly come loose over a decade and start a fire behind the wall. Likewise, a wrong piece of code that seems to behave correctly at first would lull the author into a false sense of security. By the time a problem appears, the author could be gone, or she couldn't recall what line out of thousands written years ago would cause the issue.

Three dictionary definitions for insidious, which I think are all appropriate: 1) intended to entrap or beguile 2) stealthily treacherous or deceitful 3) operating or proceeding in an inconspicuous or seemingly harmless way but actually with grave effect.

I'm neutral now with respect to UB and compilers; I understand the pros and cons of doing things this way. My current stance is to know the rules clearly and always stay within their bounds, to write code that never triggers UB to the best of my knowledge. I know that testing compiled binaries produces good evidence of correct behavior but cannot prove the nonexistence of UB.

patrick451 · 3 years ago

I don't think this is the whole story. That are certain classes of undefined behavior that some compilers actually guarantee to treat as valid code. Type punning through unions in c++ comes to mind. Gcc says go ahead, the standard says UB. In cases like these, it really just seems like the standard is lazy.

lmm · 3 years ago

> The deal is known and fair.

It often isn't. C is often falsely advertised as a cross-platform assembly language, that will compile to the assembly that the author would expect. Some writers may be used to pre-standardization compilers that are much less hostile than modern GCC/Clang.

photochemsyn · 3 years ago

This article on undefined behavior looks pretty good (2011?)

https://blog.regehr.org/archives/213

A main point in the article is function classification, i.e. 'Type 1 Functions' are outward-facing, and subject to bad or malicious input, so require lots of input checking and verification that preconditions are met:

> "These have no restrictions on their inputs: they behave well for all possible inputs (of course, “behaving well” may include returning an error code). Generally, API-level functions and functions that deal with unsanitized data should be Type 1."

Internal utility functions that only use data already filtered through Type 1 functions are called "Type 3 Functions", i.e. they can result in UB if given bad inputs:

> "Is it OK to write functions like this, that have non-trivial preconditions? In general, for internal utility functions this is perfectly OK as long as the precondition is clearly documented."

Incidentally I found that article from the top link in this Chris Lattner post on the LLVM Project Blog, "What Every C Programmer Should Know About Undefined Behavior":

http://blog.llvm.org/2011/05/what-every-c-programmer-should-...

In particular this bit on why internal functions (Type 3, above) shouldn't have to implement extensive preconditions (pointer dereferencing in this case):

> "To eliminate this source of undefined behavior, array accesses would have to each be range checked, and the ABI would have to be changed to make sure that range information follows around any pointers that could be subject to pointer arithmetic. This would have an extremely high cost for many numerical and other applications, as well as breaking binary compatibility with every existing C library."

Basically, the conclusion appears to be that any data input to a C program by a user, socket, file, etc. needs to go through a filtering and verification process of some kind, before being handed to over to internal functions (not accessible to users etc.) that don't bother with precondition testing, and which are designed to maximize performance.

In C++ I suppose, this is formalized with public/private/protected class members.

Waterluvian · 3 years ago

I haven’t used C or C++ for anything, but in writing a Game Boy emulator I ran into exactly that kind of memory corruption pain. An opcode I implemented wrong causes memory to corrupt, which goes unnoticed for millions of cycles or sometimes forever depending on the game. Good luck debugging that!

My lesson was: here’s a really really good case for careful unit testing.

Gigachad · 3 years ago

Yeah for that kind of stuff you want tests on every single op checking they make exactly the change you expect.

dockd · 3 years ago

I would go one step farther: The documentation will say it is undefined behavior but the compiler doesn't have to. Here's an example from the man page for sprintf

  sprintf(buf, "%s some further text", buf);

If you miss that section of the manual, your code may work, leading you to think the behavior is defined.

Then you will have interesting arguments with other programmers about what exactly is undefined behavior, e.g. what happens for

  sprintf(buf, "%d %d", f(i), i++);

krylon · 3 years ago

I remember reading a blog post a couple of years back on undefined behavior from the perspective of someone building a compiler. The way the standard defines undefined behavior (pun not intended), a compiler writer can basically assume undefined behavior never occurs and stay compliant with the standard.

This offers the door to some optimizations, but also allows compiler writers to reduce the complexity in the compiler itself in some places.

I'm being very vague here, because I have no actual experience with compiler internals, nor that level of language-lawyer pedantry. The blog's name was "Embedded in academia", I think, you can probably still find the blog and the particular post if it sounds interesting.

Sohcahtoa82 · 3 years ago

I'll tell you what happens when someone writes:

      sprintf(buf, "%d %d", f(i), i++);

They get told to rewrite it.

halpmeh · 3 years ago

Was rewriting the stack due to undefined behavior or was it due to a logic error, e.g. improper bounds calculation?

ghostpepper · 3 years ago

Isn’t all UB a result of logic errors?

Writing beyond the end of allocated memory (due to incorrect bounds calculation ) is an example of undefined behaviour

lolptdr · 3 years ago

As a curious FE developer with no C experience, this was very interesting. Thanks for writing the article!

When I first learned C - which also was my first contact with programming at all - I did not understand how pointers work, and the book I was using was not helpful at all in this department. I only "got" pointers like three or four years later, fortunately programming was still a hobby at that point.

Funnily when I felt confident enough to tell other people about this, several immediate started laughing and told me what a relief it was to hear they weren't the only ones with that experience.

Ah, fun times.

EDIT: One book I found invaluable when getting serious about C was "The New C Standard: A Cultural and Economic Commentary" by Derek Jones (http://knosof.co.uk/cbook). You can read it for free because the book ended up being to long for the publisher's printing presses or something like that. It's basically a sentence-by-sentence annotated version of the C standard (C99 only, though) that tries to explain what the respective sentence means to C programmers and compiler writers and how other languages (mostly C++) deal with the issue at hand, but also how this impacts the work of someone developing coding guidelines for large teams of programmers (which was how the author made a living at the time, possibly still is). It's more than 1500 pages and a very dense read, but it is incredibly fine-grained and in-depth. Definitely not suitable for people who are just learning C, but if you have read "Expert C Programming: Deep C Secrets" and found it too shallow and whimsical, this book was written for you.

chasil · 3 years ago

Having basic experience in any assembly language makes pointers far more clear.

"Addressing modes," where a register and some constant are used to calculate the source or target of a memory operation, make the equivalence of a[b]==*(a+b) much more obvious.

I also wonder about the author's claims that a char is almost always 8 bits. The first SMP machine that ran Research UNIX was a 36-bit UNIVAC. I think it was ASCII, but the OS2200/EXEC8 SMP matured in 1964, so this was an old architecture at the time of the port.

"Any configuration supplied by Sperry, including multiprocessor ones, can run the UNIX system."

https://www.bell-labs.com/usr/dmr/www/otherports/newp.pdf

jjav · 3 years ago

> Having basic experience in any assembly language makes pointers far more clear.

That's a key point. I came to C after several years of programming in assembly and a pointer was an obvious thing. But I can see that for someone coming to C from higher level languages it might be an odd thing.

kjs3 · 3 years ago

There was an "official" C compiler for NOS running on the CDC Cyber. As I recall, 18-bit address, 60-bit words, more than one definition of a 'char' (12-bit or 5-bit, I think). It was interesting. There were a lot of strange architectures with a C compiler.

I would also point out architectures like the 8051 and 8086 made (make...they are still around) pointer arithmetic interesting.

krylon · 3 years ago

The C standard, as I recall defines a byte effectively as at least 8 bits. I've read that some DSP platforms use a byte (and thus a char) that is 24 bits wide, because that's what audio samples use, but supposedly those platforms rarely, if ever, handle any actual text. The standard library contains a macro CHAR_BITS that tells you

I think I remember reading about a C compiler for the PDP-10 (or Lisp Machine?), also a 36-bit machine, that used a 9 bit byte. There even exists a semi-jocular RFC for UTF-9 and UTF-18.

andrewla · 3 years ago

Pointers are by far the most insidious thing about C. The problem is that nobody who groks pointers can understand why they had trouble understanding them in the first place.

Once you understand, it seems so obvious that you cannot imagine not understanding what a pointer is, but at the beginning, trying to figure out why the compiler won't let you assign a pointer to an array, like `char str[256]; str = "asdf"`, is maddening.

One thing I think would benefit many is if we considered "arrays" in C to be an advanced topic, and focused on pointers only; treating "malloc" as a magical function until the understanding of pointers and indexing is so firmly internalized that you can just add on arrays to that knowledge. Learning arrays first and pointers second is going to break your brain because they share so much syntax, but arrays are fundamentally a very limited type.

chasil · 3 years ago

When I've had to explain it, I describe memory as a street with house numbers (which are memory addresses).

A house can store either people, or another house number (for some other address).

If you use a person as a house number, it will inflict grievous harm upon that person. If you use a house number as a person, it will blow up some random houses. Very little in the language stops you from doing this, so you have to be careful not to confuse them.

Then I describe what an MMU does with a TLB, at which point the eyes glaze over.

patrick451 · 3 years ago

From my memory, the syntax of pointers really tripped me up. E.g., the difference between * and & in declaration vs dereferencing. I think this is especially confusing for beginners when you add array declarations to the mix.

Koshkin · 3 years ago

What is so difficult about the concept of a memory address? Is it the C syntax? Asking because I personally have never struggled with this.

int_19h · 3 years ago

For this example:

   char str[256]; str = "asdf"

both str and "asdf" are not pointer-type expressions; they're both arrays (which is exposed by sizeof). The reason why this doesn't work is because C refuses to treat arrays as first-class value types - which is not an obvious thing to do regardless of how well you understand pointers or not. Other languages with arrays and pointers generally haven't made this mistake.

Sohcahtoa82 · 3 years ago

One thing that helped me understand pointers was understanding that a pointer is just a memory address.

When I was still a noob programmer, my instructor merely stuck to words like "indirection" and "dereferencing" which are all fine and dandy, but learning that a pointer is just a memory address instantly made it click.

Pointers are a $1000 topic for a $5 concept.

Koshkin · 3 years ago

Well there’s a little bit more to it. There is a type involved, and then there’s pointer arithmetic.

sramsay · 3 years ago

When I’m teaching (a very high-level language), I make a point of saying that a variable is a named memory location. Where is that location? We don’t know. Now, I am absolutely aware that the address isn’t the “real” location, but I have this idea that talking about variables in this way might help them grok the lower-level concept later on.

colonCapitalDee · 3 years ago

My experience with pointers was the inverse of yours. My first programming language was Java, and I spent many hours puzzling out reference types (and how they differed from primitive types). I only managed to understand references after somebody explained them as memory addresses (e.g. the underlying pointer implementation). When I later learned C, I found pointers to be delightfully straightforward. Unlike references in Java, pointers are totally upfront about what they really are!

krylon · 3 years ago

When I got to Java, I experienced the same problem. Much later, I learned C# and found that it apparently had observed and disarmed some of Java's traps, but they also got a little baroque in some places, e.g. with pointers, references in/out parameters, values types, nullable types, ... A lot of the time one doesn't need it, but it is a bit of a red flag if language has two similar concepts expressed in two similar ways but with "subtle" differences.

I did like the const vs readonly solution they came up with. I wish Go (my current goto (pun not necessarily unintentional) language) had something similar

nayuki · 3 years ago

From over a decade ago, I really enjoyed this clay animation on C pointers: https://www.youtube.com/watch?v=5VnDaHBi8dM , http://cslibrary.stanford.edu/104/

harry8 · 3 years ago

"The C Puzzle Book" is the thing I recommend to anyone who knows they want to have a good, working understanding of how to use pointers programming in C.

Many years ago I did the exercises on the bus in my head, then checking the answers to see what I got wrong and why over the space of a week or so. It's a really, really good resource for anyone learning C. It seemed to work for several first year students who were struggling with C in my tutorials as well and they did great. Can't recommend it highly enough to students and the approach to anyone tempted to write an intro C programming text.

mumblemumble · 3 years ago

I would highly recommend the video game Human Resource Machine for getting a really good understanding of how pointers work.

It's more generally about introducing assembly language programming (sort of) in gradual steps, so you'll need to play through a fair chunk of the game before you get to pointers. But by the time you get to them, they will seem like the most obvious thing in the world. You might even have spent the preceding few levels wishing you had them.

DSMan195276 · 3 years ago

I like it, but the array details are a little bit off. An actual array does have a known size, that's why when given a real array `sizeof` can give the size of the array itself rather than the size of a pointer. There's no particular reason why C doesn't allow you to assign one array to another of the same length, it's largely just an arbitrary restriction. As you noted, it already has to be able to do this when assigning `struct`s.

Additionally a declared array such as `int arr[5]` does actually have the type `int [5]`, that is the array type. In most situations that decays to a pointer to the first element, but not always, such as with `sizeof`. This becomes a bit more relevant if you take the address of an array as you get a pointer to an array, Ex. `int (*ptr)[5] = &arr;`. As you can see the size is still there in the type, and if you do `sizeof *ptr` you'll get the size of the array.

Jasper_ · 3 years ago

I really wish that int arr[5] adopted the semantics of struct { int arr[5]; } -- that is, you can copy it, and you can pass it through a function without it decaying to a pointer. Right now in C:

will print 4, 20, 20, 20. I understand that array types having their sizes in their types was one of Kernighan's gripes with Pascal [0], which likely explains why arrays decay to pointers, but for those cases, I'd say you should still decay to a pointer if you really wanted to, with an explicit length parameter.

[0] http://www.lysator.liu.se/c/bwk-on-pascal.html

WalterBright · 3 years ago

> I really wish that int arr[5] adopted the semantics of struct { int arr[5]; }

You and me both. In fact, D does this. `int arr[5]` can be passed as a value argument to a function, and returned as a value argument, just as if it was wrapped in a struct.

It's sad that C (and C++) take every opportunity to instantly decay the array to a pointer, which I've dubbed "C's Biggest Mistake":

https://www.digitalmars.com/articles/C-biggest-mistake.html

danoman · 3 years ago

Beware of struct padding.

sizeof(b.arr) != sizeof(b)

Consider:

  #include <stddef.h>

  #include <inttypes.h>

  typedef struct Array Array;

  struct Array {
      int32_t data[8];
  };

  void foo(Array const* arr) {
      size_t sz = sizeof(arr->data);
  }

a1369209993 · 3 years ago

> There's no particular reason why C doesn't allow you to assign one array to another of the same length

Actually, there is a particular (though not necessarily good) reason, since that would require the compiler to either generate a loop (with conditional branch) for a (unconditional) assignment or generate unboundedly many assembly instructions (essentially a unrolled loop) for a single source operation.

Of course, that stopped being relevant when they added proper (assign, return, etc) support for structs, which can embed arrays anyway, but that wasn't part of the language initially.

pjmlp · 3 years ago

It was initially available in 1982, so plenty of time to add the other features.

https://www.bell-labs.com/usr/dmr/www/chist.html

dahfizz · 3 years ago

Another weird property about C arrays is that &arr == arr. The reference of an array is the pointer to the first element, which is what `arr` itself decays to. If arr was a pointer, &arr != arr.

Is today international speak like a pirate day? arr arr arr

I think it is clearer to say that arr == &arr[0] but your mileage may vary.

naniwaduni · 3 years ago

&arr is a pointer to the array. It will happen to point to the same place as the first element, but in fact they have different types, and e.g. (&arr)[0] == arr != arr[0].

derefr · 3 years ago

> There's no particular reason why C doesn't allow you to assign one array to another of the same length, it's largely just an arbitrary restriction.

IIRC C has an informal guarantee that no primitive syntax will ever cause the CPU to do more than O(1) work at runtime. Assignment is always O(1), and therefore assignment is limited to scalars. If you need assignment that might do O(N) work, you need to call a stdlib function (memcpy/memmove) instead. If you need an allocation that might do O(N) work, you either need a function (malloc) or you need to do your allocation not-at-runtime, by structuring the data in the program's [writable] data segment, such that it gets "allocated" at exec(2) time.

This is really one of the biggest formal changes between C and C++ — C++ assignment, and keywords `new` and `delete`, can both do O(N) work.

(Before anyone asks: a declaration `int foo[5];` in your code doesn't do O(N) work — it just moves the stack pointer, which is O(1).)

> Assignment is always O(1)

This depends on what you consider to be O(1) - being that the size of the array is fixed it's by definition O(1) to copy it, but I might get your point. I think in general your point isn't true though, C often supports integer types that are too large to be copied in a single instruction on the target CPU, instead it becomes a multi-instruction affair. If you consider that to still be O(1) then I think it's splitting hairs to say a fixed-size array copy would be O(N) when it's still just a fixed number of instructions or loop iterations to achieve the copy.

Beyond that, struct assignments can already generate loops of as large a size as you want, Ex: https://godbolt.org/z/8Td7PT4af

jcelerier · 3 years ago

> IIRC C has an informal guarantee that no primitive syntax will ever cause the CPU to do more than O(1) work at runtime. Assignment is always O(1), and therefore assignment is limited to scalars.

this is absolutely and entirely wrong. You can assign a struct in C and the compiler will call memcpy when you do.

Enjoy: https://godbolt.org/z/98PnhYoev

leni536 · 3 years ago

This reasoning falls apart for structs with array members.

pratk · 3 years ago

Well C does allow "copying" an array if it's wrapped inside a struct, which does not make it O(1). gcc generates calls to memcpy in assembly for copying the array.

xigoi · 3 years ago

> IIRC C has an informal guarantee that no primitive syntax will ever cause the CPU to do more than O(1) work at runtime.

How about CPUs that have no, say, division instruction, so it has to be emulated with a loop?

tmewett · 3 years ago

Interesting, I didn't fully realise that. That it's arbitrary is annoying, I clearly had tried to rationalise it to myself! Thanks for the comments, will get around to amending

emmelaich · 3 years ago

Hi, great article. Regarding char, I'd remark that getchar() etc return int so it can return -1 for EOF or error.

I'm pretty sure this implies int as a declaration is always signed, but tbh I'm not completely sure!

Joker_vD · 3 years ago

Another one corner of the language where arrays actually being arrays is important is multidimensional array access:

    int arr[5][7];
    arr[3][5] = 4; // equivalent to *(*(arr + 3) + 5) = 4;

This works because (arr + 3) has type "pointer to int[7]", not "pointer to int". The resulting address computation is

    (char*)arr + 3 * sizeof(int[7]) + 5 * sizeof(int) ==
    (char*)arr + 26 * sizeof(int)

That's also another reason why types like "int [5][7][]" are legal but "int [5][][]" are not.

ErikCorry · 3 years ago

Really, are multidimensional arrays an important part of the language?

The above code looks like it's indexing into an array of pointers. If you want a flat array, make a few inlined helper functions that do the multiplying and adding. Your code will be much cleaner and easier to understand.

Return pointer to array of 4 integers:

  int32_t (* bar(void))[4] {
      static int32_t u[4] = {1, 0, 1, 0};
      return &u;
  }

Return a pointer to a function taking a char:

  void f(char a) {
      // ...
  }

  void (* baz(void))(char) {
      return f;
  }

wnoise · 3 years ago

This is where you really want to start using typedefs.

retrac · 3 years ago

> it's largely just an arbitrary restriction

Kind of. But the restriction is in keeping with the C philosophy of no hidden implementation magic. C has the same restriction on structs. That's the same question; an array of bytes of known size to the compiler it could easily abstract away. But assignment is always a very cheap operation in C. If we allow assigning to represent memcpy() that property is no longer true.

Same reason why Rust requires you to .clone() so much. It could do many of the explicit copies transparently, but you might accidentally pass around a 4 terabyte array by value and not notice.

> But assignment is always a very cheap operation in C.

That's just not true though, you can assign struct's of an arbitrarily large size to each-other and compilers will emit the equivalent of `memcpy()` to do the assignment. They might actually call `memcpy()` automatically depending on the particular compiler.

The fact that if you wrap the array in a struct then you're free to copy it via assignment makes it arbitrary IMO.

mytherin · 3 years ago

Perhaps I am missing something in the spec - but trying this in various compilers, it seems that you *can* assign structs holding arrays to one another, but you *cannot* assign arrays themselves.

This compiles:

  struct BigStruct {
    int my_array[4];
  };
  int main() {
    struct BigStruct a;
    struct BigStruct b;
    b = a;
  }

But this does not:

  int main() {
    int a[4];
    int b[4];
    b = a;
  }

That seems like an arbitrary restriction to me.

brundolf · 3 years ago

Ironically, Rust does allow you to implicitly copy an array as long as it reduces to a memcpy

unwind · 3 years ago

This looks decent, but I'm (highly) opposed to recommending `strncpy()` as a fix for `strcpy()` lacking bounds-checking. That's not what it's for, it's weird and should be considered as obosolete as `gets()` in my opinion.

If available, it's much better to do the `snprintf()` way as I mentioned in a comment last week, i.e. replace `strcpy(dest, src)` with `snprintf(dst, sizeof dst, "%s", src)` and always remember that "%s" part. Never put src there, of course.

There's also `strlcpy()` on some systems, but it's not standard.

masklinn · 3 years ago

strncpy does have its odd and rare use-case, but 100% agree that it is not at all a “fix” for strcpy, it’s not designed for that purpose, and unsuited to it, being both unsafe (does not guarantee NUL-termination) and unnecessary costly (fills the destination with NULs).

The strn* category was generally designed for fixed-size NUL-padded content (though not all of them because why be coherent?), the entire item is incorrect, and really makes the entire thing suspicious.

Deleted Comment

AstralStorm · 3 years ago

Then there are strn*_s since C11 (and available before that on many platforms) which do exactly what you want.

mtlmtlmtlmtl · 3 years ago

>sizeof dst

Note that this only works if dst is a stack allocated(in the same function) array and not a char *

sedatk · 3 years ago

> and always remember that "%s" part. Never put src there, of course

> Note that this only works if dst is a stack allocated array

Even this "ideal" solution is full of pitfalls. The state of memory safety is so sad in the world of C.

Yes it should be read as a placeholder for whatever you need to do.

Could be an array inside a struct too for instance, that is quite common.

Ah, that was one of my less considered additions - thank you for the feedback!

0xbadcafebee · 3 years ago

Would it be a sin to use memcpy() and leave things like input validation to a separate function? I'm nervous any time somebody takes a function with purpose X and uses it for purpose Y.

Uh, isn't using `memcpy()` to copy strings doing exactly that?

The problem is that `memcpy()` doesn't know about (of course) string terminators, so you have to do a separate call to `strlen()` to figure out the length, thus visiting every character twice which of course makes no sense at all (spoken like a C programmer I guess ... since I am one).

If you already know the length due to other means, then of course it's fine to use `memcpy()` as long as you remember to include the terminator. :)

If you really need a fast strcpy then probably not, but in most situations snprintf will do the job just fine. And will prevent heartache.

saagarjha · 3 years ago

snprintf is pretty slow, partly because it returns things people typically don't want.

quietbritishjim · 3 years ago

> Declaring a variable or parameter of type T as const T means, roughly, that the variable cannot be modified.

I would add "... cannot be modified through that pointer". (Yes, in fairness, they did say "roughly".) For example consider the following:

    void foo(int* x, const int* y)
    {
        printf("y before: %d\n", *y);
        *x = 3;
        printf("y after: %d\n", *y);
    }

This will print two different values if you have `int i = 1` and you call `foo(&i, &i)`. This is the classic C aliasing rule. The C standard guarantees that this works even under aggresive optimisation (in fact certain optimisations are prevented by this rule), whereas the analogous Fortrain wouldn't be guaranteed to work.

pafje · 3 years ago

You already know this, but I would add that under strict aliasing rules, this is only valid because x and y point to the same type.

The most common example is when y is float* and someone tries to access its bitwise representation via an int*.

(Please correct me if I'm wrong)

https://gist.github.com/shafik/848ae25ee209f698763cffee272a5...

vardump · 3 years ago

A small detail: you probably meant

  printf("y before: %d\n", *y);

Oops you're right! Fixed now thanks.

pm2222 · 3 years ago

*y in both printf, right?

limaoscarjuliet · 3 years ago

I was born in '74 so the last generation to start with C and go to other, higher-level, languages like Python or JavaScript. Going in this direction was natural. I was amazed by all the magic the higher-level languages offered.

Going the other direction is a bit more difficult apparently. "What do you mean it does not do that?". Interesting perspective indeed!

user3939382 · 3 years ago

What was nice about C then was that, based on my study of CPUs at the time, you could pretty much get your head around what the CPU was doing. So you could learn the instructions (C) and the machine following them (the CPU).

When I got to modern CPUs it's so complex my eyes glazed over reading the explanation and I gave up trying to understand them.

lp251 · 3 years ago

I was born in the late 80s and C was my first language, in a community college intro to programming class.

Hadarai5 · 3 years ago

I started coding with C and OCaml in 2019. Everything in between these two was so unnatural. With JavaScript as the worst of all

untech · 3 years ago

This was my experience with learning programming as well, however, I am 2x younger :)

cbdumas · 3 years ago

Introductory programming courses at the University of Arizona were still taught in C when I was a freshman in 2008

sbaiddn · 3 years ago

Im a decade younger and my university taught C for its intro to programing class.

Granted it was a disaster of a programing class.

hddqsb · 3 years ago

Some constructive feedback:

> Here are the absolute essential flags you may need.

I highly recommend including `-fsanitize=address,undefined` in there (docs: https://gcc.gnu.org/onlinedocs/gcc/Instrumentation-Options.h...).

(Edit: But probably not in release builds, as @rmind points out.)

> The closest thing to a convention I know of is that some people name types like my_type_t since many standard C types are like that

Beware that names beginning with "int"/"uint" and ending with "_t" are reserved in <stdint.h>.

[Edited; I originally missed the part about "beginning with int/uint", and wrote the following incorrectly comment: "That shouldn't be recommended, because names ending with "_t" are reserved. (As of C23 they are only "potentially reserved", which means they are only reserved if an implementation actually uses the name: https://en.cppreference.com/w/c/language/identifier. Previously, defining any typedef name ending with "_t" technically invokes undefined behaviour.)"]

The post never mentions undefined behaviour, which I think is a big omission (especially for programmers coming from languages with array index checking).

> void main() {

As @vmilner mentioned, this is non-standard (reference: https://en.cppreference.com/w/c/language/main_function). The correct declaration is either `int main(void)` or the argc+argv version.

(I must confess that I am guilty of using `int main()`, which is valid in C++ but technically not in C: https://stackoverflow.com/questions/29190986/is-int-main-wit...).

> You can cast T to const T, but not vice versa.

This is inaccurate. You can implicitly convert T* to const T*, but you need to use an explicit cast to convert from const T* to T*.

UPDATE regarding "_t" suffix:

POSIX reserves "_t" suffix everywhere (not just for identifiers beginning with "int"/"uint" from <stdint.h>); references: https://www.gnu.org/software/libc/manual/html_node/Reserved-..., https://pubs.opengroup.org/onlinepubs/9699919799/functions/V....

So I actually stand by my original comment that the convention of using "_t" suffix shouldn't be recommended. (It's just that the reasoning is for conformance with POSIX rather than with ISO C.)

Well, semantically, "size_t" makes sense to me ("the type of a size variable"), while "uint_t" does not ("the type of a uint variable"), because "uint" is already a type, obviously - just like "int".

nwellnhof · 3 years ago

> -fsanitize=address,undefined

In addition, I recommend -fsanitize=integer. This adds checks for unsigned integer overflow which is well-defined but almost never what you want. It also checks for truncation and sign changes in implicit conversions which can be helpful to identify bugs. This doesn't work if you pepper your code base with explicit integer casts, though, which many have considered good practice in the past.

Good one, thanks. Note that it requires Clang; GCC 12.2 doesn't have it.

Why the hell "potentially reserved" was introduced? How is it different from simply "reserved" in practice except for the fact such things can be missing? How do you even use a "potentially reserved" entity reliably? Write your own implementation for platforms where such an entity is not provided, and then conditionally not link it on the platforms where it actually is provided? Is the latter even possible?

Also, apparently, "function names [...] beginning with 'is' or 'to' followed by a lowercase letter" are reserved if <ctype.h> and/or <wctype.h> are included. So apparently I can't have a function named "touch_page()" or "issue_command()" in my code. Just lovely.

From https://www.open-std.org/JTC1/sc22/wg14/www/docs/n2625.pdf:

> The goal of the future language and library reservations is to alert C programmers of the potential for future standards to use a given identifier as a keyword, macro, or entity with external linkage so that WG14 can add features with less fear of conflict with identifiers in user’s code. However, the mechanism by which this is accomplished is overly restrictive – it introduces unbounded runtime undefined behavior into programs using a future language/library reserved identifier despite there not being any actual conflict between the identifier chosen and the current release of the standard. ...

> Instead of making the future language/library identifiers be reserved identifiers, causing their use to be runtime unbounded undefined behavior per 7.1.3p1, we propose introducing the notion of a potentially reserved identifier to describe the future language and library identifiers (but not the other kind of reservations like __name or _Name). These potentially reserved identifiers would be an informative (rather than normative) mechanism for alerting users to the potential for the committee to use the identifiers in a future release of the standard. Once an identifier is standardized, the identifier stops being potentially reserved and becomes fully reserved (and its use would then be undefined behavior per the existing wording in C17 7.1.3p2). These potentially reserved identifiers could either be listed in Annex A/B (as appropriate), Annex J, or within a new informative annex. Additionally, it may be reasonable to add a recommended practice for implementations to provide a way for users to discover use of a potentially reserved identifier. By using an informative rather than normative restriction, the committee can continue to caution users as to future identifier usage by the standard without adding undue burden for developers targeting a specific version of the standard.

sedeki · 3 years ago

How can a (badly chosen) typedef name trigger _undefined behavior_, and not just, say, a compilation error...?

I find it difficult to imagine what that would even mean.

layer8 · 3 years ago

You can declare a type without (fully) defining it, like in

    typedef struct foo foo_t;

and then have code that (for example) works with pointers to it (*foo_t). If you include a standard header containing such a forward declaration, and also declare foo_t yourself, no compilation error might be triggered, but other translation units might use differing definitions of struct foo, leading to unpredictable behavior in the linked program.

One potential issue would be that the compiler is free to assume any type with the name `foobar_t` is _the_ `foobar_t` from the standard (if one is added), it doesn't matter where that definition comes from. It may then make incorrect assumptions or optimizations based on specific logic about that type which end up breaking your code.

The problem being that to trigger a compile error the compiler would have to know all its reserved type names ahead of time.

It is not required to do so, hence undefined behavior. You might get a wrong underlying type under that name.

I'm not sure, but in general having incompatible definitions for the same name is problematic.

Thank you so much! I will definitely be amending a few things. WRT no section on undefined behaviour - you're so right, how could I forget?

rmind · 3 years ago

Certainly yes, but for debug builds and tests. It can be heavyweight for production.

matthewaveryusa · 3 years ago

C spec:

>That shouldn't be recommended, because names ending with "_t" are reserved.

Also C spec naming new things:

>_Atomic _Bool

I'm glad to see the C folks have a sense of humor.

Not all reserved names are reserved for all purposes. _t is reserved only for type names (typedefs), whereas _Atomic and _Bool are keywords.

The standard reserves several classes of identifiers, "_t" suffix [edit: with also "int"/"uint" prefix] is just one of several rules. Another rule is "All identifiers that begin with an underscore followed by a capital letter or by another underscore" (and also "All external identifiers that begin with an underscore").

That only because bool was usually an old alias to int.. It's defined as alias to _Bool in stdbool.h, highly recommended.

> C has no environment which smooths out platform or OS differences

Not true - C has little environment, not no environment. For example, fopen("/path/file.txt", "r") is the same on Linux and Windows. For example, uint32_t is guaranteed to be 32 bits wide, unlike plain int.

> Each source file is compiled to a .o object file

Is this a convention that compilers follow, or are intermediate object files required by the C standard? Does the standard say much at all about intermediate and final binary code?

> static

This keyword is funny because in a global scope, it reduces the scope of the variable. But in a function scope, it increases the scope of the variable.

> Integers are very cursed in C. Writing correct code takes some care

Yes they very much are. https://www.nayuki.io/page/summary-of-c-cpp-integer-rules

> Is this a convention that compilers follow, or are intermediate object files required by the C standard? Does the standard say much at all about intermediate and final binary code?

The standard only says that the implementation must preprocess, translate, and link the several "preprocessing translation units" to create the final program. It doesn't say anything about how the translation units are stored on the system.

> This keyword is funny because in a global scope, it reduces the scope of the variable. But in a function scope, it increases the scope of the variable.

Not quite: in a global scope, it gives the variable internal linkage, so that other translation units can use the same name to refer to their own variables. In a block scope, it gives the variable static storage duration, but it doesn't give it any linkage. In particular, it doesn't let the program refer to the variable outside its block.

trap_goes_hot · 3 years ago

On Windows you can directly access UNC paths (without mounting) with fopen. You can't do this on POSIX platforms. Also, not all API boundaries are fixed width so you're going to be exposed to the ugliness of variable width types.

I think the article is correct that one must be aware of the platform and the OS when writing C code.

manv1 · 3 years ago

fopen will/should fail on windows with the unix path syntax.

The reason it's indeterminate is because some stdc lib vendors will do path translation on Windows, some won't. I believe cygwin does (because it's by definition a unix-on-windows), but I'm pretty sure the normal stdclib vendors on windows do not.

I'm almost positive that MacOS (before MacOS X) will fail with unix path separators, since path separators are ':' not '/'.

spc476 · 3 years ago

It will work on Windows, since it inherits the behavior from MS-DOS. It's the shell on Windows (or MS-DOS) where it fails since the shell uses '/' to designate options, so when MS-DOS gained subdirectories (2.0) it used '\' as the file separator on the shell. The "kernel" will accept both. There even used to be an undefined (or underdefined) function in MS-DOS to switch the option character.