I know there are hundreds of C compilers and that I am just reinventing the wheel. But my main goal is just learning. I have already written another compiler before, but it was just for a toy language. For this reason, I would like to build a compiler for a "real-life" language, and C is an important but yet small language. I want to master the whole process of a fully working compiler for a "real-life" language, and afterwards continue to build on top of this knowledge, since I have been doing research on automatic parallelisation, and I am interested in optimising compilers in general.
Even if this project doesn't turn out to be useful for a lot of people, I hope at least to inspire a few people to tackle big problems.
I'm curious to know your reasoning behind using a generated parser instead of writing your own. This might be the first "just for fun" C compiler I've seen on HN or elsewhere that doesn't use some variant of recursive-descent/precedence, which is employed even by the "real" compilers like GCC, Clang, ICC, and MSVC.
The precedence-climbing algorithm is so amazingly simple and concise that I think anyone playing with compilers should implement a parser using it at least once, just to experience its astounding elegance. IMHO seeing an entire programming language's AST parsed almost entirely using a single recursive function with very little code and a table concisely describing its grammar can be quite exhilarating.
> IMHO seeing an entire programming language's AST parsed almost entirely using a single recursive function with very little code and a table concisely describing its grammar can be quite exhilarating.
I recently implemented a toy interpreter for a small language which uses a hand-written, table-driven shift-reduce parser. The parser simply maintains a stack of terminals and non-terminals and applies the corresponding reduction if a suffix of the stack matches a rule in the table. Precedence rules are resolved through an additional "should_shift" function, keeping the grammar very understandable. I think it's a very elegant algorithm:
The code base looks nice and easy to navigate. I didn't see any license mentioned - I'd suggest stating one clearly for those considering to build something on top of it, or to contribute.
Automatic parallelisation is a bad idea because the overhead caused by synchronisation is bigger than the gain. It's also a bad idea to take away control from the programmer. What if every compiler did automatic parallelisation differently? Then you'd have impossible to fix performance problems.
> Automatic parallelisation is a bad idea because the overhead caused by synchronisation is bigger than the gain.
A sufficiently intelligent compiler would be able to heuristically determine if the synchronization overhead for a program segment would be greater than the performance gain. Plus, there's lots of little cases where synchronization overhead is trivial, like map().
C isn't the best language for this, but there's plenty of languages (e.g. Haskell) that have data flows that can be thoroughly statically analyzed for dependencies.
> It's also a bad idea to take away control from the programmer.
Ideally you'd have compiler pragma to disable this on a case by case basis, and you'd still have more traditional parallel programming tools available.
> What if every compiler did automatic parallelisation differently? Then you'd have impossible to fix performance problems.
If you had a bad enough performance regression, you'd disable automatic parallelization and file a compiler bug (because it shouldn't regress below single-core performance). You could make this same argument against any optimizing compiler. They all apply different optimizations, and do sometimes lead to "impossible" to fix performance problems or even contradictory execution behaviors between implementations. But I still think `gcc -03` is invaluable.
I don't think automatic parallelization is a cure-all, or even that it should be enabled by default on compilers any time soon, but it's better to have it as an option than not.
I will always upvote something like this, but I am wondering: why write it in C++? I didn't really read the whole thing, but I skimmed a bit, and didn't see a lot of advantage taken of the higher-level facilities of C++. It seems like you've picked a route that gives you all the disadvantages of using C, while simultaneously destroying the possibility of it ever being self-hosting.
Not everyone wants to write the most idiomatic C++, so writing variably C-ish code in C++ isn't really all that uncommon nowadays. If you're a C programmer but you notice C++ has a convenient feature (or features) that you might find useful in a pinch, hell, maybe it's worth it to bite the bullet (although that's not to say that using C++ in lieu of C can have its drawbacks, though)
I would completely agree with you if this weren't a C compiler project. In this case, the advantage of having a self-hosting compiler is huge! Among other things, you can use the compiler source itself to test the compiler, which is pretty cool.
I have worked on projects that were essentially "C with classes," and it was fine. I'm not saying every C++ project needs to go all out using every feature from C++ 11, but I didn't even see much beyond use of the string class here. It seems not worth giving up self-hosting to me.
while simultaneously destroying the possibility of it ever being self-hosting.
...at least until it begins turning into a C++ compiler ;-)
That makes me wonder, given that there seem to be quite a lot of "toy" C/C-subset compilers being written these days, just how much harder C++ would be. For a long time it was commonly thought that even the simplest C compiler would be extremely complex and difficult, but that notion appears to have been somewhat defeated, so the next logical step in that direction is C++ or maybe C++-to-C (like Cfront) compilers. The closest I'm aware of would be the C++-subset used in TempleOS, which is self-hosting.
Horribly. C++ has so many features, and interactions between them, that you really don't want to be dragged into that as a toy project. I'd wager that the C++ language spec is longer than the source of a functional C compiler (both without the standard libraries).
Having a self hosting compiler gives a very handy tool to assess the gains of a speed optimisation: if the speed improvement of the compiler compensate the more complex compilation algorithm, then it is justified.
Can you please add a free software license to your project, as it's currently proprietary? I'd recommend GPL, but go with whatever you want.
Contrary to popular opinion, copyright applies automatically to all of your works (unless you are an employee of the US Government). Due to the draconian nature of copyright laws, people have very few practical freedoms with your work unless you use a free software license (GPL, MIT, Apache, etc).
EDIT: Actually, looking through your GitHub profile it looks like most of your projects don't have free software licenses. Can you please rectify this, as it's clear you're doing cool work but it's not free software (or even "open source").
the computer science / computer programming problem I'd like to see solved is, keeping projects "fresh" and open/accessible enough that people like this could feel like they were learning in an unencumbered way, and at the same time contributing something useful to an existing project, while at the same time pushing the capabilities of what available open source projects can provide.
"Reinventing the wheel" projects absolutely litter public source nodes; believe me, I know why people do it; but my dream is the dream of software that most of us have given up on, code reuse, "reentrancy", shared libraries, etc.
Maybe something like a "wikipedia of source code".
I'm not discounting the benefit of doing a project to learn about it; what I'm saying is, too bad it's not code that will be useful for anything else without a lot more work; and too bad work is going into something that is not reuseful-able.
Any nontrivial project requires lots of time just to understand its design. Even minimal contributions will likely require comparable amounts of work to completing a toy project. And as any professional programmer knows, reading other people's code is a lot less fun than writing your own.
like you are saying "it's a problem", and like I'm saying "that's the problem I'd like to see solved"
as an example, what they teach us in school, and what large projects like NASA have do do, is to first agree on a specification for interfaces, then to write code to the interface, then iron out the kinks. Working on a project like that, and the bigger the project, soon we discover that there are many local wins if we can only change the interface that we agreed on because "we didn't know enough when we agreed" etc. etc.
As an example of what I'm saying (as a thought experiment solution) is that if a real live compiler project was written to clean specs (even if the specs came after the code), then there'd be a lexer, parser, etc. and for a little homebrew project like this one, you could write your own lexer from scratch, testing it all the while against the rest of a functioning compiler. Probably, you would not finish it because you would learn in a series of "aha" moments what "the hard parts" are, and how they are solved.
So you could abandon your own piece, but at the same time you would be now equipped to contribute to the real project.
Or you could move on to working on the parser... lather, rinse, repeat.
No need to tell me what all "the reasons that doesn't work is"... I know the reasons, and it's useful to identify the laundry list of them, but the part I'm interested in is the attitude that "hey, this is worth solving" and "hey, this could be solved..."
Seems like the ETH work on Oberon etc did that. The students kept learning by improving on or porting both the compilers and OS's. A2 Bluebottle is a fast, usable OS as a result. Barely usable given no ecosystem and few features. Usable, though, in a safe, simple, GC-ed language. :)
Far as C compilers, would this one previously posted fit your requirements?
It seems to be quite similar to what you describe. Designed to be compact, easy to understand, maybe easy to extend, and useful in practice. I remember liking it more than most of these submissions for those attributes.
Well that's an interesting idea. What sort of programming language would let us build a package repo that could survive a Wikipedia-style edit war?
Wikipedia works because articles are mostly independent. There are links, but if they're broken it's not fatal. There are also a fair number of stub articles and a bureaucracy around what counts as a notable subject.
In practice, projects have owners, and they're looking for help, but not just anyone's help.
> What sort of programming language would let us build a package repo that could survive a Wikipedia-style edit war?
Really, the programming language isn't important, its the package management and repository system that matters for that. And, a repository that keeps history and a package/dependency management system that lets you specify the particular version of a dependency would seem to suffice to address the particular problem you relate.
more detailed question please? I'm happy to discuss but I'm not sure whether you are looking for bottom up nitty gritty details or more top down grounded philosophizing.
Where does one go for a fully representative sample of all kinds of valid C code to test a project like this? Or -- given a formal grammar, has someone produced a tool that generates representative code?
I give +1 to Csmith recommendation. They used it to tear up all kinds of compilers with practical bugs found as a result. They even found a few in CompCert despite formal proofs due to specification errors. The middle-end came out flawless, though.
So, such results speak for themselves. I think Csmith has to be the baseline. Not sure if it's easy to tune for the equivalent of basic, acceptance tests, though. If not, then it could be nice to have a long list of tests that each correspond to specific, C-language features to help a compiler writer gradually implement one.
Ha, finally an opportunity to show this off! My pet parser project has a built-in "unparser" to turn a parser (specified as a formal grammar) into a randomized test-case generator.
Even if this project doesn't turn out to be useful for a lot of people, I hope at least to inspire a few people to tackle big problems.
If you are doing this for learning then I'd recommend studying C4's parser (https://news.ycombinator.com/item?id=8558822 ) and this article:
https://www.engr.mun.ca/~theo/Misc/exp_parsing.htm
The precedence-climbing algorithm is so amazingly simple and concise that I think anyone playing with compilers should implement a parser using it at least once, just to experience its astounding elegance. IMHO seeing an entire programming language's AST parsed almost entirely using a single recursive function with very little code and a table concisely describing its grammar can be quite exhilarating.
I recently implemented a toy interpreter for a small language which uses a hand-written, table-driven shift-reduce parser. The parser simply maintains a stack of terminals and non-terminals and applies the corresponding reduction if a suffix of the stack matches a rule in the table. Precedence rules are resolved through an additional "should_shift" function, keeping the grammar very understandable. I think it's a very elegant algorithm:
https://github.com/bbu/simple-interpreter
A sufficiently intelligent compiler would be able to heuristically determine if the synchronization overhead for a program segment would be greater than the performance gain. Plus, there's lots of little cases where synchronization overhead is trivial, like map().
C isn't the best language for this, but there's plenty of languages (e.g. Haskell) that have data flows that can be thoroughly statically analyzed for dependencies.
> It's also a bad idea to take away control from the programmer.
Ideally you'd have compiler pragma to disable this on a case by case basis, and you'd still have more traditional parallel programming tools available.
> What if every compiler did automatic parallelisation differently? Then you'd have impossible to fix performance problems.
If you had a bad enough performance regression, you'd disable automatic parallelization and file a compiler bug (because it shouldn't regress below single-core performance). You could make this same argument against any optimizing compiler. They all apply different optimizations, and do sometimes lead to "impossible" to fix performance problems or even contradictory execution behaviors between implementations. But I still think `gcc -03` is invaluable.
I don't think automatic parallelization is a cure-all, or even that it should be enabled by default on compilers any time soon, but it's better to have it as an option than not.
Still, a fun project, and you have my upvote. :)
I have worked on projects that were essentially "C with classes," and it was fine. I'm not saying every C++ project needs to go all out using every feature from C++ 11, but I didn't even see much beyond use of the string class here. It seems not worth giving up self-hosting to me.
...at least until it begins turning into a C++ compiler ;-)
That makes me wonder, given that there seem to be quite a lot of "toy" C/C-subset compilers being written these days, just how much harder C++ would be. For a long time it was commonly thought that even the simplest C compiler would be extremely complex and difficult, but that notion appears to have been somewhat defeated, so the next logical step in that direction is C++ or maybe C++-to-C (like Cfront) compilers. The closest I'm aware of would be the C++-subset used in TempleOS, which is self-hosting.
Contrary to popular opinion, copyright applies automatically to all of your works (unless you are an employee of the US Government). Due to the draconian nature of copyright laws, people have very few practical freedoms with your work unless you use a free software license (GPL, MIT, Apache, etc).
EDIT: Actually, looking through your GitHub profile it looks like most of your projects don't have free software licenses. Can you please rectify this, as it's clear you're doing cool work but it's not free software (or even "open source").
"Reinventing the wheel" projects absolutely litter public source nodes; believe me, I know why people do it; but my dream is the dream of software that most of us have given up on, code reuse, "reentrancy", shared libraries, etc.
Maybe something like a "wikipedia of source code".
I'm not discounting the benefit of doing a project to learn about it; what I'm saying is, too bad it's not code that will be useful for anything else without a lot more work; and too bad work is going into something that is not reuseful-able.
as an example, what they teach us in school, and what large projects like NASA have do do, is to first agree on a specification for interfaces, then to write code to the interface, then iron out the kinks. Working on a project like that, and the bigger the project, soon we discover that there are many local wins if we can only change the interface that we agreed on because "we didn't know enough when we agreed" etc. etc.
As an example of what I'm saying (as a thought experiment solution) is that if a real live compiler project was written to clean specs (even if the specs came after the code), then there'd be a lexer, parser, etc. and for a little homebrew project like this one, you could write your own lexer from scratch, testing it all the while against the rest of a functioning compiler. Probably, you would not finish it because you would learn in a series of "aha" moments what "the hard parts" are, and how they are solved.
So you could abandon your own piece, but at the same time you would be now equipped to contribute to the real project.
Or you could move on to working on the parser... lather, rinse, repeat.
No need to tell me what all "the reasons that doesn't work is"... I know the reasons, and it's useful to identify the laundry list of them, but the part I'm interested in is the attitude that "hey, this is worth solving" and "hey, this could be solved..."
Far as C compilers, would this one previously posted fit your requirements?
http://c9x.me/compile/
https://news.ycombinator.com/item?id=11555527
It seems to be quite similar to what you describe. Designed to be compact, easy to understand, maybe easy to extend, and useful in practice. I remember liking it more than most of these submissions for those attributes.
Wikipedia works because articles are mostly independent. There are links, but if they're broken it's not fatal. There are also a fair number of stub articles and a bureaucracy around what counts as a notable subject.
In practice, projects have owners, and they're looking for help, but not just anyone's help.
Really, the programming language isn't important, its the package management and repository system that matters for that. And, a repository that keeps history and a package/dependency management system that lets you specify the particular version of a dependency would seem to suffice to address the particular problem you relate.
http://akkartik.name/about
I recently wrote this (in a very similar thread to this one) about the tension between real-world and teaching software: https://www.reddit.com/r/Compilers/comments/4jmb88/open_sour...
So, such results speak for themselves. I think Csmith has to be the baseline. Not sure if it's easy to tune for the equivalent of basic, acceptance tests, though. If not, then it could be nice to have a long list of tests that each correspond to specific, C-language features to help a compiler writer gradually implement one.
It sounds a lot like a fuzzer, although only for testing success cases.
https://github.com/gcc-mirror/gcc/tree/master/gcc/testsuite
Not sure about the latter, but you could probably accomplish something similar without too much work using a fuzzer.
https://github.com/Hardmath123/nearley#the-unparser
https://github.com/melling/ComputerLanguages/blob/master/com...
They probably deserve their own section.
https://github.com/melling/ComputerLanguages/blob/master/com...