Show HN: My own C compiler (just for fun)

rcorcs · 10 years ago

I know there are hundreds of C compilers and that I am just reinventing the wheel. But my main goal is just learning. I have already written another compiler before, but it was just for a toy language. For this reason, I would like to build a compiler for a "real-life" language, and C is an important but yet small language. I want to master the whole process of a fully working compiler for a "real-life" language, and afterwards continue to build on top of this knowledge, since I have been doing research on automatic parallelisation, and I am interested in optimising compilers in general.

Even if this project doesn't turn out to be useful for a lot of people, I hope at least to inspire a few people to tackle big problems.

userbinator · 10 years ago

I'm curious to know your reasoning behind using a generated parser instead of writing your own. This might be the first "just for fun" C compiler I've seen on HN or elsewhere that doesn't use some variant of recursive-descent/precedence, which is employed even by the "real" compilers like GCC, Clang, ICC, and MSVC.

If you are doing this for learning then I'd recommend studying C4's parser (https://news.ycombinator.com/item?id=8558822 ) and this article:

https://www.engr.mun.ca/~theo/Misc/exp_parsing.htm

The precedence-climbing algorithm is so amazingly simple and concise that I think anyone playing with compilers should implement a parser using it at least once, just to experience its astounding elegance. IMHO seeing an entire programming language's AST parsed almost entirely using a single recursive function with very little code and a table concisely describing its grammar can be quite exhilarating.

bluetomcat · 10 years ago

> IMHO seeing an entire programming language's AST parsed almost entirely using a single recursive function with very little code and a table concisely describing its grammar can be quite exhilarating.

I recently implemented a toy interpreter for a small language which uses a hand-written, table-driven shift-reduce parser. The parser simply maintains a stack of terminals and non-terminals and applies the corresponding reduction if a suffix of the stack matches a rule in the table. Precedence rules are resolved through an additional "should_shift" function, keeping the grammar very understandable. I think it's a very elegant algorithm:

https://github.com/bbu/simple-interpreter

d99kris · 10 years ago

The code base looks nice and easy to navigate. I didn't see any license mentioned - I'd suggest stating one clearly for those considering to build something on top of it, or to contribute.

vasili111 · 10 years ago

You can also look here for advices: https://www.reddit.com/r/Compilers/ There are people who is specialized on developing compilers.

imtringued · 10 years ago

Automatic parallelisation is a bad idea because the overhead caused by synchronisation is bigger than the gain. It's also a bad idea to take away control from the programmer. What if every compiler did automatic parallelisation differently? Then you'd have impossible to fix performance problems.

bgwhn · 10 years ago

> Automatic parallelisation is a bad idea because the overhead caused by synchronisation is bigger than the gain.

A sufficiently intelligent compiler would be able to heuristically determine if the synchronization overhead for a program segment would be greater than the performance gain. Plus, there's lots of little cases where synchronization overhead is trivial, like map().

C isn't the best language for this, but there's plenty of languages (e.g. Haskell) that have data flows that can be thoroughly statically analyzed for dependencies.

> It's also a bad idea to take away control from the programmer.

Ideally you'd have compiler pragma to disable this on a case by case basis, and you'd still have more traditional parallel programming tools available.

> What if every compiler did automatic parallelisation differently? Then you'd have impossible to fix performance problems.

If you had a bad enough performance regression, you'd disable automatic parallelization and file a compiler bug (because it shouldn't regress below single-core performance). You could make this same argument against any optimizing compiler. They all apply different optimizations, and do sometimes lead to "impossible" to fix performance problems or even contradictory execution behaviors between implementations. But I still think `gcc -03` is invaluable.

I don't think automatic parallelization is a cure-all, or even that it should be enabled by default on compilers any time soon, but it's better to have it as an option than not.

markchristian · 10 years ago

This is awesome. Thank you for sharing it with us. :)

warmwaffles · 10 years ago

kudos to you, this is something I don't think I could ever accomplish but I know it's not easy. Keep learning and growing.

pmiller2 · 10 years ago

I will always upvote something like this, but I am wondering: why write it in C++? I didn't really read the whole thing, but I skimmed a bit, and didn't see a lot of advantage taken of the higher-level facilities of C++. It seems like you've picked a route that gives you all the disadvantages of using C, while simultaneously destroying the possibility of it ever being self-hosting.

Still, a fun project, and you have my upvote. :)

unsignedqword · 10 years ago

Not everyone wants to write the most idiomatic C++, so writing variably C-ish code in C++ isn't really all that uncommon nowadays. If you're a C programmer but you notice C++ has a convenient feature (or features) that you might find useful in a pinch, hell, maybe it's worth it to bite the bullet (although that's not to say that using C++ in lieu of C can have its drawbacks, though)

pmiller2 · 10 years ago

I would completely agree with you if this weren't a C compiler project. In this case, the advantage of having a self-hosting compiler is huge! Among other things, you can use the compiler source itself to test the compiler, which is pretty cool.

I have worked on projects that were essentially "C with classes," and it was fine. I'm not saying every C++ project needs to go all out using every feature from C++ 11, but I didn't even see much beyond use of the string class here. It seems not worth giving up self-hosting to me.

userbinator · 10 years ago

while simultaneously destroying the possibility of it ever being self-hosting.

...at least until it begins turning into a C++ compiler ;-)

That makes me wonder, given that there seem to be quite a lot of "toy" C/C-subset compilers being written these days, just how much harder C++ would be. For a long time it was commonly thought that even the simplest C compiler would be extremely complex and difficult, but that notion appears to have been somewhat defeated, so the next logical step in that direction is C++ or maybe C++-to-C (like Cfront) compilers. The closest I'm aware of would be the C++-subset used in TempleOS, which is self-hosting.

apk17 · 10 years ago

Horribly. C++ has so many features, and interactions between them, that you really don't want to be dragged into that as a toy project. I'd wager that the C++ language spec is longer than the source of a functional C compiler (both without the standard libraries).

codewritinfool · 10 years ago

I was wondering the same thing... ?? Oh, and you get an upvote from me too.

reacweb · 10 years ago

Having a self hosting compiler gives a very handy tool to assess the gains of a speed optimisation: if the speed improvement of the compiler compensate the more complex compilation algorithm, then it is justified.

cyphar · 10 years ago

Can you please add a free software license to your project, as it's currently proprietary? I'd recommend GPL, but go with whatever you want.

Contrary to popular opinion, copyright applies automatically to all of your works (unless you are an employee of the US Government). Due to the draconian nature of copyright laws, people have very few practical freedoms with your work unless you use a free software license (GPL, MIT, Apache, etc).

EDIT: Actually, looking through your GitHub profile it looks like most of your projects don't have free software licenses. Can you please rectify this, as it's clear you're doing cool work but it's not free software (or even "open source").

sheepleherd · 10 years ago

the computer science / computer programming problem I'd like to see solved is, keeping projects "fresh" and open/accessible enough that people like this could feel like they were learning in an unencumbered way, and at the same time contributing something useful to an existing project, while at the same time pushing the capabilities of what available open source projects can provide.

"Reinventing the wheel" projects absolutely litter public source nodes; believe me, I know why people do it; but my dream is the dream of software that most of us have given up on, code reuse, "reentrancy", shared libraries, etc.

Maybe something like a "wikipedia of source code".

I'm not discounting the benefit of doing a project to learn about it; what I'm saying is, too bad it's not code that will be useful for anything else without a lot more work; and too bad work is going into something that is not reuseful-able.

adrianN · 10 years ago

Any nontrivial project requires lots of time just to understand its design. Even minimal contributions will likely require comparable amounts of work to completing a toy project. And as any professional programmer knows, reading other people's code is a lot less fun than writing your own.

sheepleherd · 10 years ago

like you are saying "it's a problem", and like I'm saying "that's the problem I'd like to see solved"

as an example, what they teach us in school, and what large projects like NASA have do do, is to first agree on a specification for interfaces, then to write code to the interface, then iron out the kinks. Working on a project like that, and the bigger the project, soon we discover that there are many local wins if we can only change the interface that we agreed on because "we didn't know enough when we agreed" etc. etc.

As an example of what I'm saying (as a thought experiment solution) is that if a real live compiler project was written to clean specs (even if the specs came after the code), then there'd be a lexer, parser, etc. and for a little homebrew project like this one, you could write your own lexer from scratch, testing it all the while against the rest of a functioning compiler. Probably, you would not finish it because you would learn in a series of "aha" moments what "the hard parts" are, and how they are solved.

So you could abandon your own piece, but at the same time you would be now equipped to contribute to the real project.

Or you could move on to working on the parser... lather, rinse, repeat.

No need to tell me what all "the reasons that doesn't work is"... I know the reasons, and it's useful to identify the laundry list of them, but the part I'm interested in is the attitude that "hey, this is worth solving" and "hey, this could be solved..."

nickpsecurity · 10 years ago

Seems like the ETH work on Oberon etc did that. The students kept learning by improving on or porting both the compilers and OS's. A2 Bluebottle is a fast, usable OS as a result. Barely usable given no ecosystem and few features. Usable, though, in a safe, simple, GC-ed language. :)

Far as C compilers, would this one previously posted fit your requirements?

http://c9x.me/compile/

https://news.ycombinator.com/item?id=11555527

It seems to be quite similar to what you describe. Designed to be compact, easy to understand, maybe easy to extend, and useful in practice. I remember liking it more than most of these submissions for those attributes.

skybrian · 10 years ago

Well that's an interesting idea. What sort of programming language would let us build a package repo that could survive a Wikipedia-style edit war?

Wikipedia works because articles are mostly independent. There are links, but if they're broken it's not fatal. There are also a fair number of stub articles and a bureaucracy around what counts as a notable subject.

In practice, projects have owners, and they're looking for help, but not just anyone's help.

dragonwriter · 10 years ago

> What sort of programming language would let us build a package repo that could survive a Wikipedia-style edit war?

Really, the programming language isn't important, its the package management and repository system that matters for that. And, a repository that keeps history and a package/dependency management system that lets you specify the particular version of a dependency would seem to suffice to address the particular problem you relate.

enqk · 10 years ago

I find Kartik Agaram's writings interesting with respect to that problem.

http://akkartik.name/about

akkartik · 10 years ago

Thanks for saving me the decision of whether or not to do another shameless plug :)

I recently wrote this (in a very similar thread to this one) about the tension between real-world and teaching software: https://www.reddit.com/r/Compilers/comments/4jmb88/open_sour...

marvy · 10 years ago

more details?

sheepleherd · 10 years ago

more detailed question please? I'm happy to discuss but I'm not sure whether you are looking for bottom up nitty gritty details or more top down grounded philosophizing.

m1n1 · 10 years ago

Where does one go for a fully representative sample of all kinds of valid C code to test a project like this? Or -- given a formal grammar, has someone produced a tool that generates representative code?

moosingin3space · 10 years ago

There's csmith: https://embed.cs.utah.edu/csmith/

nickpsecurity · 10 years ago

I give +1 to Csmith recommendation. They used it to tear up all kinds of compilers with practical bugs found as a result. They even found a few in CompCert despite formal proofs due to specification errors. The middle-end came out flawless, though.

So, such results speak for themselves. I think Csmith has to be the baseline. Not sure if it's easy to tune for the equivalent of basic, acceptance tests, though. If not, then it could be nice to have a long list of tests that each correspond to specific, C-language features to help a compiler writer gradually implement one.

userbinator · 10 years ago

Csmith is a tool that can generate random C programs that statically and dynamically conform to the C99 standard.

It sounds a lot like a fuzzer, although only for testing success cases.

robertelder · 10 years ago

For specific test cases, you could use gcc's test cases:

https://github.com/gcc-mirror/gcc/tree/master/gcc/testsuite

Not sure about the latter, but you could probably accomplish something similar without too much work using a fuzzer.

hardmath123 · 10 years ago

Ha, finally an opportunity to show this off! My pet parser project has a built-in "unparser" to turn a parser (specified as a formal grammar) into a randomized test-case generator.

https://github.com/Hardmath123/nearley#the-unparser

melling · 10 years ago

There are several other C compilers on Github. I list a few of them here.

https://github.com/melling/ComputerLanguages/blob/master/com...

They probably deserve their own section.

pkd · 10 years ago

Unrelated to the thread, but I opened a PR on your repo the last time you posted about it here. Right now it is still open :/

melling · 10 years ago

I'll have to fix it after my vacation. There's now a conflict. I didn't see it because I'm subscribed to too many github repos.

andrewchambers · 10 years ago

could you add my own self hosting C compiler: https://github.com/andrewchambers/c

melling · 10 years ago

Added it to the Compiler section:

https://github.com/melling/ComputerLanguages/blob/master/com...