Thursday, September 23, 2010

High-level constructs for low-level C: exception handling, RAII, sum types and pattern matching

There are numerous high-level abstractions available in other languages that simply make programming easier and less error prone. For instance, automatic memory management, pattern matching, exceptions, higher order functions, and so on. Each of these features enable the developer to reason about program behaviour at a higher level, and factor out common behaviour into separate but composable units.

For fun, I've create a few small macro headers that enable some of these patterns in pure C. If anyone sees any portability issues, please let me know!

libex: Exception Handling and RAII


RAII in C is definitely possible via a well-known pattern used everywhere in the Linux kernel. It's a great way to organize code, but the program logic and finalization and error logic are not syntactically apparent. You have to interpret the surrounding context to identify the error conditions, and when and how finalization is triggered.

To address this, I encapsulated this RAII pattern in a macro library called libex, with extensions to support arbitrary local exceptions, and a small set of pre-defined exception types. Currently, this just consists of more readable versions of the error codes in errno.h.

No setjmp/longjmp is used, and libex provides little beyond case checking and finalization, because I wanted to provide a zero overhead exception handling and RAII that can supplant all uses of the undecorated pattern. Replacing all instances of the RAII pattern in Linux with these macro calls would incur little to no additional overhead, as it compiles down to a small number of direct branches.

There are also some convenience macros for performing common checks, like MAYBE which checks for NULL, ERROR which checks for a non-zero value, etc.

Example:
exc_type test(int i) {
THROWS(EUnrecoverable)
TRY(char *foo) {
MAYBE(foo = (char*)malloc(123), errno);
} IN {
// ... no exception was raised, so compute something with foo
// if EUnrecoverable thrown, it will propagate to caller
if (some_condition()) THROW(EUnrecoverable)
} HANDLE CATCH (EOutOfMemory) {
// ... handle error for foo
} CATCHANY {
// ... other errors?
} FINALLY {
// ... finalize any state that has already been allocated
}
DONE
}

There are a few small restrictions required for the full exception semantics to work, so please see the main page of libex for further details.

libsum: Pattern matching and sum types, aka disjoint/tagged/discriminated unions, aka variants


Functional languages have long enjoyed the succinct and natural construction and deconstruction of data structures via sum types and pattern matching. Now you can have some of that power via a few simple macros:
/* declare a sum type and its constructor tags */
SUM(foo) {
foo_one,
foo_two,
};
/* declare each sum case */
CASE(foo, foo_one) { int i; char c; };
CASE(foo, foo_two) { double d; };

void do_bar(foo f) {
MATCH(f) {
AS(foo_one, y) printf("foo_one: %d, %c\n", y->i, y->c);
AS(foo_two, y) printf("foo_two: %d\n", y->d);
MATCHANY
fprintf(stderr, "No such case!");
exit(1);
}
}

int main(int argc, char** argv) {
foo f;
LET(f, foo_one, (3, 'g')); /* (3,'g') is an initializer */
do_bar(f);
}

There are a few small requirements and caveats, eg. LET performs dynamic memory allocation. Please see the main libsum page for further details.

License


My default license is LGPL, but since these are macro libraries that's probably not appropriate choice, given there is no binary that can be replaced at runtime (one of the requirements of the LGPL). I like the freedoms afforded by the LGPL though, so I'm open to alternate suggestions with similar terms. I will also consider the MIT license if there are no viable alternatives.

19 comments:

spy974 said...

Really interesting. Thanks

Guillermo said...

It seems a nice, lightweight exception handling system. And zero overhead... I like it!

namekuseijin said...

what a load of manure. making C superficially look like haskell means nothing if the semantics and ease of use is not there...

let is also supposed to introduce lexical bindings...

Sandro Magi said...

namekuseijin, two points:

1. The semantics of pattern matching are in fact similar to what you'd find in ML and Haskell, but limited to one-level.

2. Do you agree that the macros here enable a clearer, algebraic expression of programs in C? If so, mission accomplished.

Further, every language extension is constrained by the language's extensibility. This is a self-evident. C's limits are sometimes severe, but it's interesting that we can achieve some degree of algebraic reasoning even within these limitations.

Finally, if you can't be civil, take your vitriol elsewhere please.

robohack said...

I too think your macros are rather useless and will get in the way more in _real_ C code than they can ever help.

Calling them "libraries" is really silly too. Call them headers, name them properly, and put them in the public domain as any attempt at licensing is pointless for headers.

W.R.T. your exception handling macros, your claim is that the "program logic and finalization and error logic are not syntactically apparent". However that is only technically true. To an experienced C programmer these patterns still appear just as clearly as if they are concrete syntax, and without the major limitations of macros such as those you call "libex".

The macros you call "ilbsum" are actually obscuring real C syntax that's far more expressive for the purposes you are attempting to achieve.

The use of well known programming idioms is far more flexible for real-world C code, and real C syntax is far cleaner and clearer to an experienced C programmer.

Steve Bourne tried this kind of nonsense back in the original V7 /bin/sh code, and that code has become an exemplar of how abuse of the pre-processor ruins an otherwise decent language.

Sandro Magi said...

Have to disagree robohack. You're essentially arguing that functions impede program understanding because they just obscure the real, more flexible gotos happening under the hood.

More pervasive, principled structure is good. Using the macros ensures a certain structure of programs and data structures, particularly in the case of libsum. The fact that you have to qualify all your arguments with "experienced C developers" already demonstrates the failure of C to provide the necessary structure to begin with.

As for whether libex is actually useful, I'm ambivalent. It's a proof of concept that you can do idiomatic C cleanup by mimicking exceptions with full RAII, and without resorting to expensive setjmp/longjmp, and without having to correctly implement the pattern manually. This focus on manual control is why C programs suffer from so many vulnerabilities and bugs.

robohack said...

Ah, sorry, I didn't mean to give the impression that I was arguing against more powerful constructs.

What I'm arguing for is to use the C language as it is, and to learn to use the idioms that experienced C programmers have come to use, and I'll add that one of those "idioms" of a sort is to use good naming conventions for different subsets of identifiers.

The folly of attempting to invent syntax extensions for a language like C is that it actually obscures your code from experienced programmers _even_ if they are familiar with the constructs you are trying to emulate. This is what I was trying to show by referencing Bourne's attempt to turn C into another language using macros.

I do agree that C has its limitations and that it is more safely exploited by experienced programmers. I just don't agree that one should attempt to work around those limitations with further limiting forms of macros. The better way, IMHO, is to learn more and better idioms in the native C syntax, if indeed you must write in C for some reason.

Sandro Magi said...

I sympathize with your belief that syntactic abstraction obscures the semantics, particularly with C's leaky macros. This isn't always the case though, so the lesson to take from this is to avoid those leaks.

libsum is an example of a good, non-leaky abstraction providing safe variants/tagged unions instead of C's default unsafe unions. libex is much more leaky because it needs the delimiters to be just right for compilation to succeed.

Syntactic abstraction has it's place IMO. Reinventing a language within C is generally not a good idea, although COS is a pretty ambitious and successful attempt.

robohack said...

IMNSHO your "libsum" is the worst example of macro abuse you've given. A good C programmer will never accept that kind of obfuscation of the real C syntax.

Those who want to be coddled by languages which force such abstractions should know where to find such languages and used them instead of C. :-)

I agree most strongly that attempting to extend C syntax through any kind of macros is not a good idea.

Sandro Magi said...

If your opinion isn't humble, then you should be able to provide a compelling argument or counterexample demonstrating the failure of the abstractions provided by libsum.

robohack said...

I thought the example of v7 /bin/sh should suffice for all cases of such macro abuse. Bourne suffered some ridicule for obfuscating what's arguably one of the most important core programs in unix, and worse his code was poorly maintained as a result for a decade afterwards. (There were other problems with its internals too -- it did its own memory allocation without malloc() for one -- though these were partly all due to 64kb restrictions.) Bourne's macros were part of the inspiration for the International Obfuscated C Code Contest.

For the particular use of unions as record variants, well a simple definition of an enum type, used as the type of the tag field, with well named enum identifiers of course, and with a pure switch statement for handling the record variables based on the tag field's value and you've got some highly readable code that any C programmer can read regardless of their experience level, and it's using an easily recognizable idiom which is hard to make mistakes with. You might argue that the tag field should have been forced into C unions the way it is in Pascal, but that would prevent some extremely powerful uses of unions that C supports very well. C gives you the rope -- you choose how to tie the knots and where to sling it.

Every time you modify syntax with macros, especially the syntax used to declare variables, you make your code harder to read for everyone who doesn't know your particular brand of magic. The issues with this are more important in a static-typed language like C too.

Also, every time you hide a malloc() or similar in your macros you make it much harder for everyone else, and perhaps even yourself, to spot memory leaks. In a language which requires manual explicit memory management you want to make such actions explicit and easily visible to any reader.

A huge part of programming is maintenance. You might wish C had some additional features, but when you invent them using things like macros you, be they CPP macros or some other front-end processor, you end up with far less maintainable code, especially if you're not guaranteed to be the only pers to ever have to maintain your not-quite-C code.

Sandro Magi said...

The Bourne example merely demonstrates the problem with overriding built-in language constructs, it does not demonstrate a problem with macros in general, or macros that introduce new abstractions.

Regarding "hiding" allocations, libraries perform allocations all the time. This is standard for C libraries as long as the responsibility for freeing the memory is clearly documented. This same argument applies to macros as it does to functions.

The implementation of the libsum pattern is exactly idiomatic C for tagged unions, and yes, it's not difficult to mimic this pattern manually, but the whole point of macros is to eliminate syntactic redundancy and the associated risk of mistakes. I could eliminate a few of the macros, like LET, SUM and MATCH, and make it look like standard C by forcing the developer to manually check tags via switch, perform casts and introduce locals, but why force that repetition?

We refactor redundancy into libraries, and macros are not excepted from C's code reuse abstractions. As long as the abstractions introduced are general and well documented, and the savings are meaningful, I don't see the problem.

robohack said...

I disagree w.r.t. to Bourne's macros. I believe they are an exemplar of the mess one can make with macros -- regardless of whether one thinks one is just adapting syntax or one thinks one is introducing new abstractions as you are thinking that you are doing.

Hiding allocations in libraries might be "standard" and occur in common libraries, but again such things are excellent and very well known examples of serious design problems with said libraries. (and of course by "hiding" I mean true hiding, not examples of library routines that are designed explicitly to do the equivalent of allocating, constructing, and initializing a new object) The problem with your LET() macro is that it has a name more indicative of assignment rather than creation (yet it does not appear to be either in the way that it is used), and because it is a macro it creates mystery about what it actually does. Hiding an allocation in a library routine is bad enough, but burying it in a macro that doesn't even look superficially like an assignment to a pointer is downright dangerous.

Good clean readable and maintainable C code must avoid the use of the preprocessor as much as possible, and I don't say that as my own invention, but rather both as a parrot repeating what I've heard over and over again over many years; as well as from my own experience with struggling with code using maros such as those you've offered. (That's not to say I avoid macros religiously -- indeed I've written my fair share of them!)

What's interesting is that you defend your libsum's use of hidden dynamic allocation, but at the same time you list this aspect as some kind of caveat in your notes about it.

My point remains that a good C programmer will be required to easily and immediately recognize the common idioms that you are hiding here with macros (because your macros will never be standard or even close to common), and so any good C programmer who is not intimately familiar with your particular brand of macros will be forced to learn them just to read your code.

Don't try to hide C syntax with your attempts to introduce new syntax and abstractions to C and think that you are somehow improving the situation when what you're really doing is just adding another layer of confusion that will be unique to the code which uses your macros -- instead learn to write C idiomatically so that you will be able to more easily read everyone else's C code, and so that everyone else will be able to more easily read your C code!

Sandro Magi said...

LET is by far the most controversial of the macros. It's also the least important. Assuming you eliminate LET or replace it with something more palatable, what are your objections?

and so any good C programmer who is not intimately familiar with your particular brand of macros will be forced to learn them just to read your code.

The same is true of any abstraction and any library. I still don't see why you think this is an objection against using them.

As I said before, if you find that argument convincing, you could replace "macros" with "abstraction" or "library", and convince yourself to never use any libraries or high-level abstractions of any kind.

It's a bit of an absurd argument. What's more important is whether the semantics of an abstraction are sound and unsurprising for a given language and promote code reuse, and not whether they are provided via syntactic means. LET may indeed cross the border into surprising, but that does not discount the whole macro library, merely the one macro.

robohack said...

I don't agree that libraries are adding similar abstractions.

Certainly some classes of libraries are more challenging in this way than others.

The difference is that a library is simply a collection of more code which conceptually at least can be thought of as being written in the same language as the program calling the library, even if it is not -- it's not something that permeates some or all of the code in the entire application with entirely new abstractions at the syntactic level.

Fixing the LET() macro doesn't solve the problem of hiding the switch statement -- something that's _exactly_ on the same level as the Bourne macros. The worst part of your MATCH() and AS() macros is that they restrict the ability to code anything much more complex than a simple single statement into the actions because you've hidden two separate syntactic elements in the same macro even when they are more useful as separate syntax elements.

Calling your "enum" declaring macro SUM() (I'm assuming it declares and enum) means your macros are pretty much infinitely less flexible than the bare C syntax. A better name would help, but hiding the fact that the identifiers are members of enum sets (or whatever they really are underneath) is dangerous.

So, basically you've taken a very simple C idiom for dealing with unions and structs and hidden all the good C syntax which can make code written in that idiom so powerful, thus preventing all the more powerful expressions of that idiom.

Sandro Magi said...

Re: MATCH/AS, that switch and enums are useful individually is irrelevant. They are still available for those who need them. The point is that now we have a new abstraction that is type safe in a way that C's unions are not.

The whole point of disjoint unions is that they cannot be decomposed unsafely without type errors. MATCH provides exactly these safe semantics, and you'd have to do something undefined to break it.

Re: enum/sum and hiding as danger, you have yet to provide any concrete argument for why any of this is dangerous. I have a cohesive abstraction which you cannot break unless you perform manual casting yourself, and which the C compiler type checks for you. The advantages are clear.

The Bourne shell is again inapplicable, because not only is this merely an example of renaming existing language keywords and overriding existing language abstractions, it's also syntactically unsafe unless you get the nesting/scoping exactly right as I do in libsum.

Finally, I disagree that disjoint unions are a simple C idiom. In fact, it's quite a bit of unnecessary boilerplate to use this idiom, and quite easy to get the casting wrong when dealing with pointer-heavy data structures. Most of that complexity is eliminated by using proper disjoint unions with an appropriate scoped pattern matching construct, which is exactly what libsum provides.

As for the naming, disjoint unions are known as sums in formal literature.

Sandro Magi said...

As an addendum, I'm not surprised you find pattern matching and sums to be less powerful than C's unions and switch. You would have to program in a functional language for awhile to really grok how they help structure programs in a much cleaner fashion.

robohack said...

FYI, since it doesn't seem I was clear enough, I have always been talking about the Bourne Shell implementation and its source code, _not_ the shell language it implements.

Nathan Geffen said...

I find the dogmatic "there is only one way of doing" things approach of some of the comments here disappointing. Sandro has coded up something really interesting. Some programmers will like it and use it. Others won't. The dogmatic, "Some guy did this on the bourne shell 20,000 years ago and the programming gods have deemed it beneath us since" is entirely unhelpful.