Writing a C compiler in 500 lines of Python (2023)

Buttons840 · 4 months ago

After many years of programming in other languages, I finally learned C, and came to realize that there aren't actually any compilers that implement all of the C spec. Even GCC and Clang have their grey areas and their bugs.

Before this, I had thought that C was a simple language. An idea propped up by articles likes this, as well as the oft touted fact that nearly every embedded system has a C compiler; no matter what you'll always have a C compiler.

This point was driven home by part of a blog post that simply states "you can't actually parse a C header"[0]. The blog makes a good supporting case for their claim. They link to a paper that says[1]:

> There exist many commercial and academic tools that can parse C.... Unfortunately, these parsers are often either designed for an older version of the language (such as C89) or plain incorrect. The C11 parsers found in popular compilers, such as GCC and Clang, are very likely correct, but their size is in the tens of thousands of lines.

And sure enough, in the OP linked blog post, they state they are only implementing a subset of the language. Of course, it still has value as a teaching tool; this is just a tangential fact about C I wanted to discuss.

[0]: https://faultlore.com/blah/c-isnt-a-language/#you-cant-actua...

[1]: https://hal.science/hal-01633123/document

flohofwoe · 4 months ago

> I finally learned C, and came to realize that there aren't actually any compilers that implement all of the C spec.

I think the main reason for this is that the C spec was always just an attempt to somewhat harmonize the already existing features of different C compilers, e.g. implementations come first and then after one or two decades, the C committee tries to standardize the features that have survived. That doesn't mean that all C compiler vendors immediatedly hop on board.

But then there's of course MSVC which after a few hopeful years of modernizing their C frontend now seems to have abandondend it again (I wonder though if MSVC as a whole has been abandondend looking at the abundance of red for MSVC in the C++23 and C++26 tables: https://en.cppreference.com/w/cpp/compiler_support.html)

While looking like a weird approach at first, it has definitely worked better than the C++ committee's approach to accept 'ideas' into the standard without even a proof-of-concept implementation - e.g. the one good thing about C++ is that it acts as a filter to stupid ideas before they can make it into C ;)

pjmlp · 4 months ago

I think it is a side effect of SFI (Secure Safety Initiative) at Microsoft, Azure and Windows development guidelines to use managed safe languages or Rust, leaving C and C++ for existing code bases.

Even though Microsoft employees tend to dismiss this at Reddit discussions, the lack of resources is quite visible.

4gotunameagain · 4 months ago

The first link is full of youthful anger (and a bit of cringe) not recognising the immense technical debt C is carrying. It is a 50 year old language written on a PDP.

For example the author rages about things like integer sizes, while every single serious C programmer does not use ambiguously sized types.

Sure, C has issues. A lot. But it is the cornerstone of our technological marvel. Everything around you runs C, everything in space runs C as well.

Do we have much better options for some applications nowadays ? Of course.

Will C go away ? No.

dlcarrier · 4 months ago

C has a lot of feature creep, and C++ is just C with extra feature creep.

The original C compiler ran on a PDP-11, which usually had just kilobytes of RAM. The syntax was written around compiling with such limited resources, hence the need for headers, primitives, semicolons, linkers, and so on.

It has changed a lot over time, but seems to be adding baggage, not removing it.

Joker_vD · 4 months ago

The original C compiler had no need for headers or function prototypes/forward declarations. Of course, it also was not a single-pass compiler: it had two (and a half) passes and generated assembly that would then be assembled by a two-pass assembler.

pjmlp · 4 months ago

Only for C89 versus C++98, in current days of C2y versus C++23, they are two worlds apart.

1vuio0pswjnm7 · 4 months ago

"Before this, I had thought that C was a simple language."

It was a simple language. It can still be used that way

As hobbyist I write simple programs that can be compiled with -std=c89

I use these programs every day. They are faster than their equivalents in python, smaller than their equivalents in go, and require less resources or dependencies to compile than their equivalents in rust

It is easy to take something simple and make it complex

Software developers do this consistently; software/language "by committee" faciltates it

Generally developers commenting publicly do not like "simple", they prefer "easy"

C89 is still useful and there are lots of things that rely on it

wredcoll · 4 months ago

C the language is simple until you actually have to do something useful with it then you have to memorize the apis of every library you import.

1vuio0pswjnm7 · 4 months ago

I have done something useful (to me) with "C the language"

Generally, I do not use any third party libraries besides the "standard" ones that come with UNIX-llike OS

I do not _have_ to use other third party libraries, it's a choice; I choose not to use them for personal reasons that may or may not apply to anyone else

sylware · 4 months ago

More real life C compilers is always a good thing. But remember: to be able to build a linux kernel, you will need zillions of gcc extensions... (I have a sweet spot for the alignment attribute of stack variables in printk).

That said, "C" (C99 with benign bits of c11 required for modern hardware programming) is only the "less worse" compromise for a computer language: its syntax is already way too rich (I let you think about the beyond sanity computer language syntax out there, yep... even for currently hyped ones).

For instance: C should have only one loop keyword loop{}, finally a real hard compile time constant definition, no integer promotion, no implicit cast (unless void* or some generic number literals), probably explicit static/dynamic casts (but certainly not with the c++ syntax), sized types should be primitives and not the other way around (s32,u32,f64...), no typedef/typeof/generic/etc, but should include inline support for modern hardware architecture programming aka memory barriers/"atomics"/explicit memory unaligned access, etc.

The benchmark is a small team of average devs should be able to develop a real-life compiler in reasonable amount of time.

Ofc, the more the merrier.

1718627440 · 4 months ago

> C should have only one loop keyword loop{}

How do you implement do while loops? Also I find the niceties of for loops a good improvement. You are able to limit the scope of variables to only the loop, while still being able to use it in the condition, and it separates the transition from the real loop code. I think idiomatic in C is the overuse of for, resulting in while used really seldomly.

What else do you think is really excessive in syntax alone?

pjmlp · 4 months ago

That is a myth often spread by folks that think K&R C book is everything there is to know, never opened the ISO C draft PDF, learned the differences between POSIX and standard library, tried to make their code portable outside GCC or nowadays clang, or even checked the extensions chapter on the compiler.

anta40 · 4 months ago

Who told that? I still hear K&R being recommended as introduction material.

If you want to write portable/production-grade C code, well definitely need to study another references.

1718627440 · 4 months ago

> [0]: https://faultlore.com/blah/c-isnt-a-language/#you-cant-actua...

This blog post is full of misconceptions.

It starts by asserting, that C defines an ABI and then complains that everything is so complicated, because actually C doesn't define it. C is defined in terms of behaviour of the abstract C machine. As far as C is concerned, there is no ABI. C only prescribes meaning to the language, it does not prescribe how you actually implement this; the compiler is free to do as it pleases, including choosing the ABI in some limits.

What defines the ABI is the platform consisting of the machine and the OS. The OS here includes at least the kernel (or some bare metal primitives) and the compiler. And the standard C library really IS part of the compiler. That's why GCC vs Clang or glibc vs muslc always comes with incompatibilities. Because these ARE different OSs. They can choose to do things the same, but this is because of some formal (POSIX, the platform vendor) or some informal (GCC and Clang) standards.

Yes a lot of ABIs are defined with C syntax, but this is, because C has terms for that and isn't too new. You can specify this in the language of your choice and it will describe the same ABI. Yes, int doesn't have a size independent of the platform. But if the specification wouldn't use C as a syntax, it would just write "this argument has the same size as described in the 'Appendix X Definitions' under 'Platform's default integer size'". Writing "int" is just a shorter notation for exactly this.

> You Can’t Actually Parse A C Header

I don't know why the choice to use the compiler to implement parsing a C header is framed as a bad thing. Is relying on write(2) from the kernel a bad thing instead of trying to bypass the kernel? The compiler is what defines the meaning of a header, why don't ask it about the result? If you don't feel like reimplementing the C preprocessor, you can also just parse preprocessed headers. These are self-contained, i.e. don't need knowing the include path. But of course this approach comes with the caveat that when the user updated the C compiler, your knowledge has become outdated or wrong. I don't know why it is framed as weird, that you need a C parser to parse C code. This is the definition of a C parser. You can't just write code that parses C and is somehow not a C parser.

> 176 triples. I was originally going to include them all for the visual gag/impact but it’s literally too many for even that.

No, they are ONLY 176 target triples (this is the LLVM term, other terms are "gnu tuple" or "gnu type") that your tool supports. There is also not the definite list, it's a syntax to describe the major components of a platform. There are decades of vendors improving their platform in incompatible ways, of course the description of this is messy.

See for example: https://news.ycombinator.com/item?id=43698363

And this is the test data for the source of GNU types: https://cgit.git.savannah.gnu.org/cgit/config.git/tree/tests... See that this contains 1180 types, but of course that's also not definite.

> pub type intmax_t = i64;

> A lot of code has completely given up on keeping C in the loop and has started hardcoding the definitions of core types. After all, they’re clearly just part of the platform’s ABI! What are they going to do, change the size of intmax_t!? That’s obviously an ABI-breaking change!

There is a reason it is called intMAX_t! It does not HAVE a definite size, it is the MAXimal size of an integer on that platform. Yes, they are problems nowadays due to ossification, but they come exactly from people like that blog author. When you want your program to have a stable ABI, that doesn't change when your platform supports larger integer types, you just don't use intMAX_t!

> And even then you have the x64 int problem: it’s such a fundamental type, and has been that size for so long, that countless applications may have weird undetectable assumptions about it. This is why int is 32-bit on x64 even though it was “supposed” to be 64-bit: int was 32-bit for so long that it was completely hopeless to update software to the new size even though it was a whole new architecture and target triple!

That is called ossification. When you program C you are not supposed to care about the sizes. When your program does, your program is broken/non-portable. Yes, this limits the compilers, because they don't want programs to be broken. But is really the same as e.g. MS Windows catering to a specific program's bugs. This is not a design mistake of C:

> sometimes you make a mistake so bad that you just don’t get to undo it.

aw1621107 · 4 months ago

> I don't know why the choice to use the compiler to implement parsing a C header is framed as a bad thing.

Not sure I agree with this interpretation, though maybe I'm focusing on a different part of the article than you are. Where you are getting the negative sense from?

That being said, I don't think it's too hard to imagine why someone might be a bit hesitant to use a C/C++ compiler to parse C/C++ headers - for example, it can be a pretty big dependency to take on, may add friction for devs/users, and integration with your own tool may be awkward and/or an ongoing time sink especially if you're crossing an FFI boundary or if the API you're using isn't stable (as I believe it is the case for LLVM).

> There is a reason it is called intMAX_t! It does not HAVE a definite size, it is the MAXimal size of an integer on that platform.

I think this somewhat misses the point of the bit you quoted. In context, it's basically saying that grabbing "real" C type info for interop is so painful that people will hard-code "reasonable" assumptions instead.

> When you want your program to have a stable ABI, that doesn't change when your platform supports larger integer types, you just don't use intMAX_t!

My impression is that the problem is less intmax_t changing and more that intmax_t can change out of sync. Even if you assume every use of intmax_t in a public API corresponds to an intentional desire for the bit width to evolve over time, you can still run into nasty issues if you can't recompile everything at once (which is a pretty strong constraint on the C/C++ committees if history is any indication).

tomhow · 4 months ago

Previously:

Writing a C compiler in 500 lines of Python - https://news.ycombinator.com/item?id=37383913 - Sept 2023 (165 comments)

MarsIronPI · 4 months ago

I find it surprising that a single-pass compiler is easier to implement than a traditional lexer->parser->AST->emitter. (I'm not a compiler expert, though.) I'd have expected that generating an AST would be at least as simple, if not simpler. Plus by generating an AST, doing some simple optimization is a lot easier: one can pattern-match parts of the AST and replace them with more efficient equivalents. Maybe I'm overthinking this, though. I tend to like extensible program designs, even when they don't necessarily make sense for the scale of the program…

Still a really cool article and an impressive project, though. I especially like the StringPool technique; I'll have to keep it in mind if I ever write a compiler!

marssaxman · 4 months ago

A single-pass compiler is easier to implement in part because you're not going to do any of that optimization. You're writing a single-pass compiler either because you're banging out a quick sketch of an idea, and you don't care about production use, or because you've time-traveled back to the '70s or the '80s, where processors were so painfully slow and memory so eye-wateringly expensive that you might not even be able to read the entire source file into RAM at once, much less convert it all into some intermediate representation before starting to write out the machine code.

dlcarrier · 4 months ago

I used to write software on my HP calculator, when I was bored in classes, before smartphones existed. It used a tokenized language called RPL, and the editor would read and parse just enough tokens to fill up the screen.

I miss how fast it was, compared to modern computers. VS Code is much laggier in comparison.

kragen · 4 months ago

I think this might depend on the language you're writing in.

Historically, at least, it's pretty verbose to define a data type in Python compared to languages that are more designed for writing compilers. Consider these definitions from my prototype Bicicleta interpreter, which is written in ML, specifically OCaml:

    type methods = NoDefs
                               (* name, body, is_positional ... *)
                   | Definition of string * bicexpr * bool * methods
    and bicexpr = Name of string
                  | Call of bicexpr * string
                  | Literal of string option * methods
                  | Derivation of bicexpr * string option * methods
                  | StringConstant of string
                  | Integer of int
                  | Float of float
                  | NativeMethod of (lookup -> bicobj)

Those ten lines of code would be ten classes in Python with an average of 1.6 attributes each. Using dataclasses or attrs, that would be 36 lines of code, and then (if you're doing it the OO way) every function that I defined on one of these OCaml types becomes a method implemented in each class implementing a particular protocol, with a copy of its argument signature in every class. (If you used namedtuple instead, it's no less code, but you write it on less lines.) So, for example, this function on bicexprs

    let rec freevars = function
        Name n -> stringset [n]
      | Integer _ | StringConstant _ | Float _ -> stringset ["prog"]
      | NativeMethod _ -> stringset []
      | Literal (Some selfname, methods) -> 
            StringSet.diff (freevars_methods methods) (stringset [selfname])
      | Literal (None, methods) -> freevars_methods methods
      | Derivation(object_, self, methods) ->
            StringSet.union (freevars object_) (freevars (Literal(self, methods)))
      | Call(object_, _) -> freevars object_

becomes six to eight method definitions in the different classes. (You can cut it down to six if you define an abstract base class for the constant classes.) And Literal.freevars needs an if-then-else. So that's another 20 lines of code.

Python does support pattern-matching now, so functions like this might not be any more verbose than the ML version if you program them the same way instead of in the OO fashion. I haven't tried using Python pattern-matching, so I don't really know.

In general, though, Python is more verbose than ML-family languages for this kind of thing by a factor of about 2–4, and that's before you count the test code you need in Python to get the kind of confidence in correctness that ML's type-checking gives you with no extra code. To my knowledge, Mypy doesn't do the kinds of pattern-matching-exhaustiveness checks that ML compilers do.

I've sometimes "cheated" by trying to write code like this in Python using regular tuples rather than named tuples. You can definitely make it work, but it's a real pain to debug.

Quoting Andy Chu from https://andychu.net/projects/:

> Python is not the right language for [implementing] languages. I will use OCaml for subsequent projects like this.

Python does have GC and dynamic dispatch, though, and those count for a lot.

vidarh · 4 months ago

GC doesn't matter much for a simple compiler, as you either don't need to allocate much (single pass, Wirth-style compilers that generate code inline) or most of what you allocate becomes garbage all at once at the end (AST).

In my half-finished Ruby compiler prototype, even before I added type tagging, and so allocated every integer on the heap, I just didn't add a GC for a long time because it was fine to just leak, because the compiler isn't generally long running.

kragen · 4 months ago

Eli Bendersky's blogpost on the Expression Problem: https://news.ycombinator.com/item?id=45155877

MarsIronPI · 4 months ago

My impression is that in general the traditional approach of methods as members of a class is more verbose and less extensible than the ML/Lisp generic function approach. I know I certainly prefer generic functions when I have to design polymorphic interfaces.

comonoid · 4 months ago

C was designed to be compiled with a single-pass compiler.

Joker_vD · 4 months ago

It wasn't; the original C compiler had two passes, and built expression trees in the first pass which the second pass would turn into assembly (and the original as on UNIX also had two passes).

whobre · 4 months ago

C was not designed at all. It evolved from B, by adding types to it.

arjvik · 4 months ago

Not sure if fewer LoC necessarily implies easier!

markus_zhang · 4 months ago

I think it depends on the language. I heard Turbo Pascal was pretty fast because 1) Pascal’s language features, 2) no optimization in TP 1.0 at least.

MarsIronPI · 4 months ago

Fast to run or fast to compile?

ceronman · 4 months ago

Very cool. I think Wasm is a nice instruction set, but I agree that its structured control flow is a bit weird and also the lack of instructions to handle the memory stack. But it's much more cleaner than something like x86_64.

If you are interesting in learning in more detail how to write a C compiler, I highly recommend the book "Writing a C Compiler" by Nora Sandler [0]. This is a super detailed, incremental guide on how to write a C compiler. This also uses the traditional architecture of using multiple passes. It uses its own IR called Tacky and it even includes some optimization passes such as constant folding, copy propagation, dead code elimination, register allocation, etc. The book also implements much more features, including arrays, pointers, structs/unions, static variables, floating point, strings, linking to stdlib via System V ABI, and much more.

[0] https://norasandler.com/book/

Liftyee · 4 months ago

This article breaks it down well enough to make me feel like I could write my own C compiler targeting AVR. (I probably could... but it would not be easy.)

Never actually looked into how compilers work before, it's surprisingly similar/related to linguistics.

measurablefunc · 4 months ago

It's b/c when Chomsky invented the theory of formal grammars he was studying natural languages & the universality of abstract grammar¹. Computer scientists realized later that they could use the same theory as a foundation for formalizing the grammatical structures of programming languages.

¹https://en.wikipedia.org/wiki/Chomsky_hierarchy

userbinator · 4 months ago

You should study C4, which is a C(subset) compiler in ~500 lines, but more interestingly, it can compile itself: https://news.ycombinator.com/item?id=8558822

dekhn · 4 months ago

Similar experience in DNA/genome analysis. A large part of DNA analysis was based on parser theory.

This paper was my introduction to DNA analysis as well as Chomsky hierarchy: https://www.jstor.org/stable/29774782 (I wasn't able to find a free copy).

IIRC, pseudoknots in RNA require context-free grammars to parse.

lukan · 4 months ago

"compilers work before, it's surprisingly similar/related to linguistics."

Since compilers transform languages with a clearly defined grammar ... the connection to linguistics is maybe not so surprising after all.

alienbaby · 4 months ago

I thought I had learned a new word reading this, but instead I just have something that seems like it should be a word given the context it was discovered in. Perhaps that in itself should be considered cremement. A word that looks like it should be a word but isn't.

1718627440 · 4 months ago

Huh? So what is the meaning of the word???

keyle · 4 months ago

I love that graphic, so many nuggets in there, a very cute depiction of a compiler in general.

weregiraffe · 4 months ago

Now write a Python compiler in 500 lines of C.

bluGill · 4 months ago

I could probably do it - but you wouldn't like it. My dictionaries would be a linked-list, looking for a key becomes a linear search... (if you gave me C++ I'd use std::map) I'm assuming you will allow me to use the C standard library, if I have to implement strlen or malloc in that 500 lines of C I'm not sure I can pull that off. 500 lines is aggressive, but IOCCC gives me plenty of tricks to get the line count down and the language isn't that big. I'm also going to assume 100% valid python code is fed in, if there is a bug or error of any sort that is undefined behavior.

Note that most of what makes python great isn't the language, it is the library. I believe that large parts of the python library are also written in C (for speed), and thus you won't be able to use my 500 line python compiler for anything useful because you won't have any useful libraries.

dekhn · 4 months ago

A hash table in C is about 30 lines of code, so I don't think you have to stick to linked lists for dictionaries.

jlokier · 4 months ago

If you can search (and do other operations) on a linked list, and the lists are long enough to matter, you can trivially speed it up to a fixed-size hash table with hardly any code changes.

This:

    entry = list_search(key, list);

becomes:

    entry = list_search(key, lists[hash(key) % (sizeof(lists)/sizeof(lists[0]))]);

This:

    list_add(key, entry, list);

becomes:

    list_add(key, entry, lists[hash(key) % (sizeof(lists)/sizeof(lists[0]))]);

etc.

A simple hash(), not high quality but good enough for non-adversarial inputs, can be a one-liner.

A good dictionary does more than this of course, especially dynamic sizing, but this simple change can radically speed up simple linked list code when the lists are long, you don't have a hash table implemention you can use, and you want to keep the code change very small.

The same principle can be used with other things than lists too.

wyldfire · 4 months ago

A python VM that consumes bytecode might be doable in not-ludicrous-amounts of C. Not 500 lines I suppose. But something manageable I think? Especially if you targeted the older releases.

jonjacky · 4 months ago

In the CPython reference interpreter, that VM can be found at

https://github.com/python/cpython/blob/main/Python/ceval.c

It's 3619 lines. It's explained in this 515 line file:

https://github.com/python/cpython/blob/main/InternalDocs/

For comparison, there is a pure Python bytecode interpreter, its VM is here:

https://github.com/nedbat/byterun/blob/master/byterun/pyvm2....

It's 1043 lines.

nickpsecurity · 4 months ago

Maybe 500 lines of Pythonic, macro-heavy C. If the macros' LOC don't count. Maybe.

TZubiri · 4 months ago

Not to be that guy, but Python is an interpreted language.

That said, I guess technically you could make something that compiles python to an executable? This is hacker news after all

vidarh · 4 months ago

A language is not inherently interpreted or compiled.

Some languages are more or less easy to compile efficiently and without embedding a JIT compiler, but any language can be compiled.

For Python in particular, there are already compilers.

If you want a nightmarish language to compile, look at Ruby. There are compilers even for Ruby.

zahlman · 4 months ago

This isn't even correct for the one specific reference implementation you're presumably thinking of. It is just as much "compiled" as Java or C#, just that the compilation is often done on the fly. (Although IIRC C# does some additional trickery to pretend that its bytecode is "real" executable code, wrapped in a standard .exe format.) Presumably you've noticed the __pycache__ folders containing .pyc files; those are compiled bytecode. When you install Python, typically the standard library will all be precompiled (at least in part, so that those bytecode files can be created by an admin user now, and used by a standard user later). There is an interpretive REPL environment, but that works by doing the same bytecode-compilation each time.