Writing a C compiler in 500 lines of Python

brundolf · 3 years ago

> Instead, we'll be single-pass: code generation happens during parsing

IIRC, C was specifically designed to allow single-pass compilation, right? I.e. in many languages you don't know what needs to be output without parsing the full AST, but in C, syntax directly implies semantics. I think I remember hearing this was because early computers couldn't necessarily fit the AST for an entire code file in memory at once

speps · 3 years ago

Linked from another thread: http://cm.bell-labs.co/who/dmr/chist.html

It explains the memory limits and what happened :)

> After the TMG version of B was working, Thompson rewrote B in itself (a bootstrapping step). During development, he continually struggled against memory limitations: each language addition inflated the compiler so it could barely fit, but each rewrite taking advantage of the feature reduced its size. For example, B introduced generalized assignment operators, using x=+y to add y to x. The notation came from Algol 68 [Wijngaarden 75] via McIlroy, who had incorporated it into his version of TMG. (In B and early C, the operator was spelled =+ instead of += ; this mistake, repaired in 1976, was induced by a seductively easy way of handling the first form in B's lexical analyzer.)

bsder · 3 years ago

Infix parsing chews up a remarkable amount code and memory.

It's scary just how much easier it is to parse languages without infix parsing.

bee_rider · 3 years ago

I wonder why =+ is so obviously a mistake. It does look vaguely wrong for some reason, but I’m prejudiced by current languages.

WalterBright · 3 years ago

You're exactly right. This makes for a small, memory-efficient compiler. But this entails a lot of compromises that we're not willing to put up with anymore, because there's no longer a reason to.

vgel · 3 years ago

I'm not sure, haven't looked at the codebases of old compilers in a long time. Definitely a lot of the language is pretty amenable to it, especially if you have unstructured jumps for e.g. the for advancement statement. I had a distinct feeling while writing the compiler every time I added a new feature that "wow, the semantics work exactly how I'd like them to for ease of implementation."

Compare that to, say, Rust, which would be pretty painful to single-pass compile with all the non-local behavior around traits.

eru · 3 years ago

What you are saying is true for a naive C compiler.

Once you want to optimize or analyse, things become more complicated.

> Compare that to, say, Rust, which would be pretty painful to single-pass compile with all the non-local behavior around traits.

Type inference also spans a whole function, so you can't do it in a single pass through the code. (But it's still tamer than in eg Haskell, where type inference considers your whole program.)

Deleted Comment

kazinator · 3 years ago

In C, nothing you have not parsed yet (what it to the right) is necessary for what you've already parsed (what lies to the left). (Necessary for type checking it or translating it.)

E.g. to call a function later in the file, you need a prior declaration. Or else, an implicit one (possibly wrong) will be assumed from the call itself.

This is not true in some C++ situations, like class declarations. In a class definition there can be functions with bodies. Those are inline functions. The functons can freely refer to each other in either direction. Type checking a class declaration therefore requires all of it to be parsed.

A one pass language is advantageous even if you're building a serious multi-pass compiler with optimization. This is because that exercise doesn't require an AST! Multi-pass doesn't mean AST.

Building an AST doesn't require just more memory, but more code and development work: more complexity in the code to build the abstraction and to traverse it. It's useful if you need to manipulate or analyze the program in ways that are closely related to the source language. If the language cannot be checked in one pass, you might need one; you wouldn't want to be doing the checking on an intermediate representation, where you've lost the relationship to the code. AST building can be reused for other purposes, like code formatting, refactoring, communicating with an IDE for code completion and whatnot.

If the only thing you're going to do with an AST is walk it up and down to do some checks, and then to generate code, and you do all that in an order that could have been done without the AST (like a bottom-up, left to right traversal), then it was kind of a waste to construct it; those checks and generation could have been done as the phrase structure rules were parsed.

frutiger · 3 years ago

I once read that this is why the MSVC compiler didn't support two-pass template instantiation until very recently: the original compiler implemented templates almost like a macro that re-emitted a stream of tokens with the template parameters replaced.

pragma_x · 3 years ago

I can't say if that was a design goal, but it sure looks like it. That's also the way to avoid scaling compiler memory use to program size.

At first I thought that it wasn't possible for C. After I thought about it, as long as you disallow forward references, and rely on a single source file as input, it's possible to compile a complete C program in one pass. Anything else requires a preprocessor (e.g "#include") and/or linker (e.g. "extern" and prototypes) to solve. The implementation in the article dodges all of these and focuses on a very pure subset of C.

brundolf · 3 years ago

I think this goal may also have shifted over time. I remember when I learned C, we used c89, which required declaring all local variables at the top of the block. This seemed like a weird/arbitrary requirement at the time (and is no longer required in later versions), but it makes a lot of sense in a single-pass context! It would allow the stack frame for the current function to be fully sized before any other logic is compiled

Gibbon1 · 3 years ago

I think one thing some early compilers did was read the source serially in one pass and write the output serially in one pass. If you were doing multiple passes you had do that for each pass. That means your compiler speed is IO bound. So one pass is faster.

My cousins ex had a workflow in the late 70's that involved two floppy drives and a little dance to compile and link. Later he got a 5M hard drive which improved things a lot.

zabzonk · 3 years ago

using recursive descent, you don't need to build an ast

jjtheblunt · 3 years ago

the call stack during recursive descent is an ephemeral ast, for the recursive descent parsers I've written.

pjmlp · 3 years ago

Only if the compiler doesn't do anything beyond basic peephole optimizations.

mati365 · 3 years ago

I made similar project in TypeScript[1]. Basically multipass compiler that generates x86 assembly, compiles it to binary and runs it. The worst thing were register allocator, designing IR code and assembler.

[1] https://github.com/Mati365/ts-c-compiler

vgel · 3 years ago

Ooh, this is cool! Using WASM let me avoid writing a register allocator (though I probably would have just used the stack if I had targeted x86/ARM since I wasn't going for speed).

amedvednikov · 3 years ago

Nice project!

Joker_vD · 3 years ago

I am pretty certain the following is a valid "for"-loop translation:

    block
        ;; code for "i = 0"
        loop
            ;; code for "i < 5"
            i32.eqz
            br_if 1
        
            i32.const 1
            loop
                if
                    ;; code for "i = i + 1"
                    br 2
                else
                end
        
                ;; code for "j = j * 2 + 1"

                i32.const 0
            end
        end
    end

It doesn't require cloning the lexer so probably would still fit in 500 lines? But yeah, in normal assembly it's way easier, even in one-pass:

        ;; code for "i = 0"
    .loop_test:
        ;; code for "i < 5"
        jz  .loop_end
        jmp .loop_body
    .loop_incr:
        ;; code for "i = i + 1"
        jmp .loop_test
    .loop_body:
        ;; code for "j = j * 2 + 1"
        jmp .loop_incr
    .loop_end:

Of course, normally you'd want to re-arrange things like so:

        ;; code for "i = 0"
        jmp .loop_test
    .loop_body:
        ;; code for "j = j * 2 + 1"
    .loop_incr:
        ;; code for "i = i + 1"
    .loop_test:
        ;; code for "i < 5"
        jnz .loop_body
    .loop_end:

I propose the better loop syntax for languages with one-pass implementations, then: "for (i = 0) { j = j * 2 + 1; } (i = i + 1; i < 5);" :)

vgel · 3 years ago

Oh, interesting--I remember messing around with flags on the stack but was having issues with the WASM analyzer (it doesn't like possible inconsistencies with the number of parameters left on the stack between blocks). I think your solution might get around that, though!

tptacek · 3 years ago

A time-honored approach!

https://www.blackhat.com/presentations/win-usa-04/bh-win-04-...

(minus directly emitting opcodes, and fitting into 500 lines, of course.)

ak_111 · 3 years ago

Somewhat unrelated question, but I think one of the second most difficult things of learning C for coders who are used to scripting languages is to get your head around how the various scaler data types like short, int, long,... (and the unsigned/hex version of each) are represented and how they relate to each other and how they relate to the platform.

I am wondering if this complexity exists due to historical reasons, in other words if you were to invent C today you would just define int as always being 32, long as 64 and provide much more sane and well-defined rules on how the various datatypes relate to each other, without losing anything of what makes C a popular low-level language?

foldr · 3 years ago

>if you were to invent C today you would just define int as always being 32, long as 64 and provide much more sane and well-defined rules on how the various datatypes relate to each other, without losing anything of what makes C a popular low-level language?

You'd lose something because those decisions would be impractical for 8-bit and 16-bit targets (which still exist in the world of embedded programming).

circuit10 · 3 years ago

If you’re writing code for those why not just use the smaller data types if you don’t need bigger ones? That way it will work efficiently on both and the behaviour will be consistent

ak_111 · 3 years ago

Would it be possible to create two versions of the language: microC for embedded C programs where these decisions matter, and standard C for PC/servers which basically all use the same representation for these scalers?

My main point is that a C programmer today is forced to learn dozens of rules just to cater for many niche platforms that they will probably never target, so if you were to separate those two use cases you can get a much more neat C that targets modern 64 bit architectures with all the power of traditional C but a bit less portability.

mananaysiempre · 3 years ago

In fact, the default integer promotions together with the requirement that unsigned int hold numbers until at least 2^16 is the main reason C code for 8-bitters is somewhat... stylized, to put it mildly.

Joker_vD · 3 years ago

The int was supposed to be the native word size. So 16-bit on 286 and earlier, 32-bit on 386 and later, and 64-bit on x64. Except, of course, int has been 32 bits for so long on x86 (which has been the single most important ISA for just as long), and short was 16-bit for even longer that moving to 64-bit-wide int and 32-bit-wide short (which is what x64 naturally suited for) was just impossible, so it didn't happen, we're stuck with LP64 (on Linux) and LLP64 (on Windows) data models.

rkangel · 3 years ago

The simple version is that there are two use cases - the world where you want the size of types to match the target (e.g. int) and the world where sizes are defined by the coder (uint32_t). You want to handle both of those.

That's a nice theory and is what we've got, but it falls down in a few places.

The first is that the "int" world has got a bit munged - some platforms make some slightly strange choices for long and short and so you can't always rely on it (although int is usually pretty sensible).

The other is that when doing unsigned maths, rollover is silent so generally you really need to know the exact size at coding time so that you can ensure that rollover doesn't happen silently.

Together, these mean that you're generally just better using uint32_t (etc.) all over the place and you get more predictable results.

ricardo81 · 3 years ago

I learnt C about a decade ago (after using scriping languages 10 years prior) and just stuck with using the uint values, no second thoughts about how big a uint32_t is.

mfgs · 3 years ago

Explicit variable size is useful for limited memory space envs like microcontrollers.

PartiallyTyped · 3 years ago

If we were to write C today, we would never have such cases.

kaycebasques · 3 years ago

Is there a C compiler written in Python that aims for maximum readability rather than trying to get as much done under X lines of code?

vgel · 3 years ago

I think the code is fairly readable! It's formatted with Black (and therefore limited to reasonable line lengths) and well-commented.

IMO, being under X lines of code is part of the readability—10,000 lines of code is hard to approach no matter how readable it otherwise is.

muth02446 · 3 years ago

Not quite a C compiler but arguably better:

http://cwerg.org

WalterBright · 3 years ago

This looks a lot like the Tiny Pascal compiler that BYTE published a listing of back in 1978.

http://www.trs-80.org/tiny-pascal/

I figured out the basics of how a compiler works by going through it line by line.

vgel · 3 years ago

Oh, that's neat (funny that they skipped out on similar things to me, like GOTO and structs :-)

I didn't see a link to the source in the article, but this seems to be it: https://sourceforge.net/p/tiny-pascal/code/HEAD/tree/NorthSt...

dugmartin · 3 years ago

I think Borland’s Turbo Pascal was also a single pass compiler that emitted machine code as COM files.

kwhitefoot · 3 years ago

Surely it is a feature of all Pascal compilers that they are single pass. I thought that it was part of the specification of the language that it be possible to compile in a single pass.

bemmu · 3 years ago

It makes development so much more fun when you see the results right away.

Pressing "build" in Turbo Pascal on my 386sx it was already done before you could even perceive any delay. Instant.

andrewmcwatters · 3 years ago

Thanks for sharing this, Walter. I'm always curious where language developers get their experience from.

WalterBright · 3 years ago

Figuring out how recursive descent worked was just magical.

Deleted Comment

marcodiego · 3 years ago

It is interesting to think that 500 lines of code is something one can write in one or two days. But, writing a C compiler in 500 of comprehensible code (even in python) is challenge in itself that may take months after a few years of solid learning.

I wonder if is this a good path to becoming an extremely productive developer. If some one spends time developing projects like this, but for different areas... A kernel, a compressor, renderer, multimedia/network stack, IA/ML... Will that turn a good dev into a 0.1 Bellard?

pitherpather · 3 years ago

> 0.1 Bellard

Off topic, but a log scale might be useful: 0.1 Bellard --> -10 deciBellards. That allows for: 0.001 Bellard --> -30 deciBellards.

Problem: Programmers with negative productivity cannot be represented on the same log scale.

araes · 3 years ago

Sure they can. Abs(). Or if you prefer to not have quite so much micro-nano, you could also use Cumulative Distribution Functions [1] which are basically just sums of Probability Density [2].

Are they a 4σ programmer, 1σ programmer, -0.5σ programmer, -2σ programmer?

Plus, most people are "average" not negative productivity, and CDFs let you use really fun stuff like Beta Distributions (variable, shapable distributions) and Gamma Distributions (exponential distributions). They're super sweet as far as probability statistics.

[1] https://en.wikipedia.org/wiki/Cumulative_distribution_functi...

[2] https://en.wikipedia.org/wiki/Probability_density_function

[3] https://en.wikipedia.org/wiki/Beta_distribution

[4] https://en.wikipedia.org/wiki/Gamma_distribution

analog31 · 3 years ago

>>> Problem: Programmers with negative productivity cannot be represented on the same log scale.

This is similar to the problem of price-to-earnings ratio. The ratio goes asymptotic as earnings goes through zero. It would be better to quote earnings-to-price ratio. Another screwy reciprocal unit is miles per gallon for cars.

Barrin92 · 3 years ago

at the very least it'll remove a lot of 'magic' from programming. Today a lot of people seem to be not so fond of university education but I'm personally very glad it made me go through implementing a shell, a compiler, a little toy kernel and so on.

The feeling that you write code somewhere in the skies and have no idea how something works underneath has always really bugged me when I've used something.

chaxor · 3 years ago

You don't need a university education to do those things, just some curiosity.

The function of the university in the near future will probably just be to have like-minded curious people to discuss ideas with, and to get a better grasp of what problems need to be solved (specifically scientific ideas, rather than just applying engineering).

The prestige element (specifically of certain universities over others, perhaps not university over high school) is dwindling, and hopefully will be abolished with this new generation.

sciolist · 3 years ago

It does remind me of a project [1] Andrej Karpathy did, writing a neural network and training code in ~600 lines (although networks have easier logic to code than a compiler).

[1] https://github.com/karpathy/nanoGPT

pama · 3 years ago

This is an implementation of GPT using the pytorch library. It is not meant to be the shortest implementation of a trainable GPT, however it is very clean code. Pytorch does a lot of the heavy lifting, especially when it comes to training on multiple GPU. This implementation only works with data distributed parallel training, so one could not train models of the size of GPT-4 with it out of the box.

rewmie · 3 years ago

> But, writing a C compiler in 500 of comprehensible code (even in python) is challenge in itself that may take months after a few years of solid learning.

The people behind this project avoided that caveat by simply not implementing C. Apparently they kept a bit of the syntax but then proceeded to cherry-pick features that suited them and not make.an effort to even try to comply with any version of the standard.

fanf2 · 3 years ago

It’s a little bit bigger than Small C https://en.m.wikipedia.org/wiki/Small-C

laurencerowe · 3 years ago

As an experienced developer who did not do a compilers course at university I was able to write a SQL/JSONPath evaluator in TypeScript in a week or so. I don’t expect a minimal C compiler would be that much more complex.

Essentially all you need is a grammar, parser library and a couple of tree walkers to convert the AST first to expand macros and then convert to assembly.

A production compiler with all its optimisation steps is of course far more complex and more modern languages have many more features, but C is really pretty simple (the K&R book is concise and good!) as it was built to work on the computers of half a century ago.

59nadir · 3 years ago

You don't need a parser library; writing the tokenizer + parsing logic isn't a very time consuming endeavor and the code doesn't suffer from it either. The upside is also that you're not taking on some monstrosity of a library for something you can implement in a tenth of the lines yourself. You'll also end up with something you actually fully understand yourself as well, which should be a big plus.

marcodiego · 3 years ago

Using a library is cheating. Any one with enough knowledge can create their own programming language in two weekends using parser combinators.

iudqnolq · 3 years ago

You might like the book 500 Lines or Less, Experienced Programmers solve interesting problems

https://www.amazon.com/500-Lines-Less-Amy-Brown/dp/132987127...

Deleted Comment