Efficient Curried Application

I just wanted to jot down some thoughts on compiling eval/apply curried application using an efficient register-based calling convention. This translation is actually quite simple once you get the trick. Leroy's Zinc abstract machine showed the way: evaluate arguments right-to-left instead of the typical left-to-right. The ZAM wrote these arguments to a stack, but in our case we want them to go to the register file first.

So we evaluate arguments right-to-left and write them into registers left-to-right, ie. argN goes in reg0, argN-1 goes to reg 1, ... arg0 goes to regN. At some threshold, we'll run out of registers. If your program is in direct-style with a stack, this threshold is often defined by the callee-save registers of the native calling convention. If your program is in CPS form, then you can use the entire register file.

Once you run out of registers, you have to spill the remaining arguments to the stack. However, the stack is simply a register pointing to some base address from which all other values are accessed by an offset from that base. But so is the closure record! If you lay out your closure record in memory so it has the same layout as the stack frame would, then you can just treat it like a small stack. This can save many memory access operations moving data between closures and registers.

Suppose our register spill threshold is 2 registers (reg0 and reg1), with a third register left for the remaining arguments (reg2), and consider the following C-style function:

int foo(int x, void *y, int z);

Here we have three arguments, one of which will need to be spilled to the stack since we can only allocate two when calling foo. Arguments z and y will go in registers, while argument x will go on the stack. However, the compilation of foo will require a certain layout in memory for optimal efficiency as well (assuming 32-bit pointers):

foo_papp2:
  loadi reg0, reg2 // reg0 = *reg2
  sub1 reg2, 4     // reg2 += sizeof(int)
foo_papp1:
  loadi reg1, reg2 // reg1 = *reg2
  sub1 reg2, 4     // reg2 += sizeof(void*)
foo:
  ...              // foo: reg0=arg2, reg1=arg1, reg2=&struct{arg0}

Hopefully the pseudo-assembly is clear. Function foo gets "prologues", one per argument which is used for partial applications. Each prologue loads an argument from the "stack"/closure record into the registers and decrements the pointer so it points to the next argument. By the time all the registers are used up, you should have exactly the "stack frame" that you need left, and the registers-passed arguments are all in their proper place.

For instance, when you partially apply foo to 1 argument, the function pointer value of the closure record is actually pointing to foo_papp1, and when you partially apply with two arguments, it's pointing to foo_papp2. This pattern is the same for any number of arguments, and the code will all have the same size for each argument, which we will need later when building partially applied closures.

Now consider the following curried applications of foo:

int(void*,int) foo_1 = foo(0);
int(int) foo_2 = foo(0, somePointer);

Hopefully the pseudo C-style closure syntax is clear, ie. int(char) is a closure taking a char and returning an int. The above are two different partial applications, foo_1 which applies only the first argument, and foo_2 which partially applies the first and second arguments. These are what the closures will look like in memory:

foo_1 →

function = &foo_papp1

arity = 2

applied = 1

x = 0

foo_2 →

function = &foo_papp2

arity = 1

applied = 2

y = somePointer

x = 0

Now when you try to apply values foo_1 and foo_2 to the remaining arguments, this actually performs a dynamic check to see how many arguments are required. For instance, fully applying a closure with the signature int(int, int) elaborates to the following pseudo-code:

obj_t returnValue;
RETRY:
switch (closure→arity) {
case 0:
  closure = closure→function();
  goto RETRY;
case 2:
  returnValue = closure→function(arg0, arg1);
  // reg0=arg1, reg1=arg0, reg2=&closure->arity + 1
  break;
case 1:
  closure = closure→function(arg0);
  returnValue = closure→function(arg1);
  break;
}

You have to dynamically check if you're applying all remaining arguments. I handle the case of arity=0 to show that intermediate closures do exist, but you don't typically need that specific case.

When you're performing a partial application, you also have to check the arity and build intermediate closures referencing the proper function address, which changes depending on the arity. Given a closure and performing a partial application, you have to make adjustments to point to the new function pointer. So given a 3 argument closure to which you're applying n < 3 arguments:

closure nclo;
obj_t returnValue;
if (closure→arity == closure→applied + n) {
  returnValue = closure->function(arg0, ..., argn);
} else {
  nclo = malloc(2 * sizeof(void*) + closure→arity - closure→applied + n * sizeof(void*));
  nclo→arity = closure→arity - n;
  nclo→applied = closure→applied + n;
  nclo→function = closure→function - SIZE_OF_FUNCTION_PROLOGUE;  // compute offset of new entry point
  nclo→arg0 = arg0;
  ...
  nclo→argn = argn;
  memcpy(&nclo->argn + sizeof(void*), &closure→applied + sizeof(void*), closure→applied * sizeof(void*));
}

The most general case to handle is when you're in fully polymorphic code for which you cannot know if applying the closure you have is a partial application. This is a combination of the last two elaborations, but still requires only a simple switch statement to handle all cases.

Now the most boring case: when a function is not a closure and it's being fully applied with all arguments at a call site: well in this case you fill the registers as usual, then you push the remaining arguments on the real program stack, and pass the pointer to that activation frame as the "stack" argument. Simple!

This is only a high-level overview for myself, so no doubt I've skipped some important details, or gotten them wrong while jotting these notes down. Please let me know if you spot something!

Summary

Hopefully that didn't ramble so much that the core idea was obscured. The core idea is that all closures are functions taking a fixed number of arguments defined by the number of caller-save registers. These arguments are passed in registers, and a last register is reserved for a "stack frame" holding the remaining arguments. The closure record can act as this stack frame, so you don't have to copy values between records and the stack on closure application.

Furthermore, each function has a "prologue" generated for each argument laid out prior to the actual function being executed in memory. These prologues are used when applying partially applied functions, and they load any previously applied arguments from the closure record into registers, assuming all the reserved registers aren't already used.

This is basically an implementation of the eval/apply strategy, with a few tweaks to reduce the amount of redundant code generated and reuse as much memory as possible. I don't think there's anything particularly novel, but I haven't seen all the details laid out like this (although I may of course have ommitted some details I consider "obvious", even if they aren't to others). If you have any further suggestions, I'd love to hear it!

Limitations

I don't discuss any of the complexities surrounding unboxed types of non-pointer size, which considerably complicates calling conventions. Most functional languages require a universal representation for this reason, usually of pointer size, which is what I assume here. See my series on Garbage Collection Representations to get a feel for how these values translate to machine code, and how this would impact closure representations.

Easy Reverse Mode Automatic Differentiation in C#

Continuing from my last post on implementing forward-mode automatic differentiation (AD) using C# operator overloading , this is just a quick follow-up showing how easy reverse mode is to achieve, and why it's important. Why Reverse Mode Automatic Differentiation? As explained in the last post, the vector representation of forward-mode AD can compute the derivatives of all parameter simultaneously, but it does so with considerable space cost: each operation creates a vector computing the derivative of each parameter. So N parameters with M operations would allocation O(N*M) space. It turns out, this is unnecessary! Reverse mode AD allocates only O(N+M) space to compute the derivatives of N parameters across M operations. In general, forward mode AD is best suited to differentiating functions of type: R → R N That is, functions of 1 parameter that compute multiple outputs. Reverse mode AD is suited to the dual scenario: R N → R That is, functions of many parameters t...

Higher Logics

Search This Blog