[ planet-factor ]

Phil Dawes: Idea for a global interpreter lock optimized for low contention

Slava is thinking about adding a global interpreter lock to Factor, much like the Python GIL, as a step on the path to a full multithreaded VM. This would allow factor code to run while blocking FFI calls (e.g. C library calls) execute. As part of this each FFI call would need to release the lock before calling into C code and obtain it again afterwards before returning to Factor.

One of the problems with adding a GIL is that it penalizes the common single-threaded case. This got me thinking about how a mutex lock could be implemented that optimized performance for the low contention case so as to minimize the performance impact to single threaded apps. Here's the best approach I've come up with so far:

You need:
  1. a spinlock (implemented inline via cpu primitives)
  2. a boolean to represent the GIL
  3. an int to represent contention on the GIL
  4. an OS semaphore (win32 semaphore or pthread condition)

The goal is to arrive at a situation where if there is no contention then FFI calls just use inline assembler primitives to acquire/release the spinlock and flip the GIL, otherwise they fall back on the additional overhead of library calls to an OS semaphore.

Acquiring the Lock

The idea is that if there's no contention then acquiring the lock is just a case of obtaining the spinlock and flipping the GIL boolean.

(code is half-baked pseudocode, sorry!)

- acquire spinlock
- if GIL == false
   - GIL=true     // acquire GIL
   - release spinlock
   DONE!
- else:
   - increment contention-counter
   - release spinlock
   - goto LOOP

LOOP:                 // contention counter has been incremented by this thread
- acquire spinlock
- if GIL == false
   - GIL=true     // acquire GIL
   - decrement contention-counter
   - release spinlock
   DONE!
- else
   - release spinlock
   - wait on semaphore
   - goto LOOP

Releasing the lock

- acquire spinlock
- release GIL
- read contention counter
- release spinlock
- if contention counter is non-zero:
     notify semaphore

This is just an idea at this stage. There was no working wifi on my train home today so haven't done any decent research on this, and I haven't done any empirical testing yet. Also I'm not a multithreading expert so if there's a glaring error in the idea or in the implementation then I'd be very pleased if somebody could point it out - thanks!

Wed, 21 Oct 2009 21:01:00

Joe Groff: Hot-rodding Factor with typed functions, SIMD, and data-map

Factor's gained a number of features and libraries recently to help it deal efficiently with large sets of binary data.

Typed functions

Factor's compiler can take advantage of all the type information at its disposal to eliminate type checks and dynamic dispatch. However, since Factor is a dynamic language, very little type information usually survives past a function call; values can't walk across the function's borders as free integers or floats but must be smuggled through the call site inside hidden compartments of vans labeled object. Factor has some existing features to avoid this: For small functions, you can tear down the wall by inline-ing the function at its call site. However, inlining is inefficient for large functions that get used in multiple places. If you like to play fast and loose, the Factor compiler has some hooks to let you make unchecked assertions about the types of values in your code, but these hooks are unsafe, and no one should use them or know about them. Forget I said anything.

Given these inadequate choices, I wrote a typed library which wraps the aforementioned compiler hooks in a safe and easy to use syntax. You can define a TYPED: function like so:

USING: typed math ;
IN: scratchpad

TYPED: add-floats ( a: float b: float -- c: float ) + ;

The add-floats function will check its inputs before executing, converting them to floats if it can and throwing an error otherwise:

( scratchpad ) 1 2+1/2 add-floats .
3.5
( scratchpad ) "lol" "wut" add-floats .
Generic word >float does not define a method for the string class.
Dispatching on object: "lol"

When add-floats is called from another function, its type checks are inlined into the caller, so if the compiler can see that add-floats' inputs are going to be floats, it can eliminate the checks. The actual body of the function is compiled under a hidden name with the unsafe type declarations applied. After the call to the hidden typed function, the output value types are declared in the calling function, allowing the compiler to further optimize the caller's code. We can see the compiler in action using its optimized. diagnostic function:

( scratchpad ) USE: compiler.tree.debugger
( scratchpad ) [ 2.0 3.0 add-floats 5.0 * ] optimized.
[ 2.0 3.0 ( typed add-floats ) 5.0 float* ]

The compiler knows that 2.0 and 3.0 are floats, so it doesn't emit code to check them before calling ( typed add-floats ) (the hidden function that implements add-floats with type specializations). The compiler likewise knows that the result of add-floats will be a float, so multiplying it with 5.0 can be done with the specialized float* function instead of the generic * operator.

One cute thing to note is that typed is a completely self-contained Factor library. Nothing needed to change in the Factor VM or compiler to make it work.

SIMD support

Slava and I worked together to add support for hardware vector types to the language and compiler. Applying Factor's library of vector operations to objects of special fixed-size array types such as float-4 or short-8 now generates code that uses the SSE machine instructions of modern Intel processors. Together with first-class struct and binary array objects and typed functions, it's possible to write reasonably-performing vector code in a high-level style. For example, the new math.matrices.simd library implements fast 4x4 matrix math using code like this:

STRUCT: matrix4
    { columns float-4[4] } ;

: columns ( a -- a1 a2 a3 a4 )
    columns>> 4 firstn ; inline

TYPED:: m4.v ( m: matrix4 v: float-4 -- v': float-4 )
    m columns :> m4 :> m3 :> m2 :> m1
    
    v first  m1 n*v
    v second m2 n*v v+
    v third  m3 n*v v+
    v fourth m4 n*v v+ ;

We can load math.matrices.simd and look at the compiler's code generation for m4.v with the test-mr. function:

( scratchpad ) USING: math.matrices.simd typed.debugger ;
( scratchpad ) \ m4.v typed-test-mr.
=== word: ( typed m4.v ), label: ( typed m4.v )

_label 0 
_prologue T{ stack-frame { total-size 32 } } 
_label 1 
##gc RAX RCX 32 { } { } f 
##peek RAX D 1 
##peek RCX D 0 
##unbox-any-c-ptr RBX RAX RDX 
##alien-vector XMM0 RBX 0 float-4-rep 
##alien-vector XMM1 RBX 16 float-4-rep 
##alien-vector XMM2 RBX 32 float-4-rep 
##alien-vector XMM3 RBX 48 float-4-rep 
##alien-vector XMM4 RCX 10 float-4-rep 
##shuffle-vector-imm XMM5 XMM4 { 0 0 0 0 } float-4-rep 
##mul-vector XMM5 XMM5 XMM0 float-4-rep 
##shuffle-vector-imm XMM0 XMM4 { 1 1 1 1 } float-4-rep 
##mul-vector XMM0 XMM0 XMM1 float-4-rep 
##add-vector XMM5 XMM5 XMM0 float-4-rep 
##shuffle-vector-imm XMM0 XMM4 { 2 2 2 2 } float-4-rep 
##mul-vector XMM0 XMM0 XMM2 float-4-rep 
##add-vector XMM5 XMM5 XMM0 float-4-rep 
##shuffle-vector-imm XMM4 XMM4 { 3 3 3 3 } float-4-rep 
##mul-vector XMM4 XMM4 XMM3 float-4-rep 
##add-vector XMM5 XMM5 XMM4 float-4-rep 
##inc-d -1 
##allot RAX 32 byte-array RCX 
##load-immediate RCX 128 
##set-slot-imm RCX RAX 1 6 
##set-alien-vector RAX 10 XMM5 float-4-rep 
##replace RAX D 0 
_label 2 
_epilogue T{ stack-frame { total-size 32 } } 
##return 
_spill-area-size 0 

(typed-test-mr. is a version of test-mr. that examines the secret function for a TYPED: definition.) The preamble and postamble code are a bit bulkier than they would be for a similar C function, setting up GC roots at the beginning and allocating a new memory block at the end, but the heart of the function is about as good as it gets: load the four matrix columns and vector into registers (##alien-vector), broadcast each element of the vector in turn (##shuffle-vector-imm), multiplying it against the corresponding matrix column (##mul-vector) and summing the results (##add-vector), finally storing the final result in the newly allotted memory block (##set-alien-vector). The vector values fly around in vector registers until it's time for the final result to land in its final memory location.

Doug Coleman has also used Factor's SIMD support to implement the SIMD Fast Mersenne Twister algorithm in the random.sfmt library. He purports to get performance within a fraction of a second of an equivalent C++ implementation.

data-map

4x4 matrices and random number generators are nice, but the bread and butter purpose of SIMD is to accelerate the processing of large amounts of data. Slava plans to add some auto-vectorization support so that operations like v+ on packed binary arrays use SIMD operations. But as a general solution, auto-vectorization gets tough for both the compiler and developer to deal with. Even in mature vectorizing compilers, if you plié when the compiler expects you to jeté, the compiler will throw up its hands and vomit scalar code in disgust.

As an alternative to auto-vectorization in Factor, I've been experimenting with a macro to make it easy to write explicitly vectorized code, which I've given the working title data-map. It can take objects of any of Factor's packed binary array types as inputs and map over their contents from any C type to any other C type. You can also grab input values in groups and map them to groups of output values. This makes it easy to express tricky vector operations that would be tough or impossible to trick a C compiler into auto-vectorizing. Here's an example that packs an array of floating-point pixel values into byte-sized pixel values:

USING: alien.c-types alien.data.map generalizations kernel
math.vectors math.vectors.conversion math.vectors.simd
specialized-arrays typed ;
SIMDS: float int short uchar ;
SPECIALIZED-ARRAYS: float float-4 uchar-16 ;
IN: scratchpad

TYPED: float-pixels>byte-pixels ( floats: float-array -- bytes: byte-array )
    [
        [ 255.0 v*n float-4 int-4 vconvert ] 4 napply
        [ int-4 short-8 vconvert ] 2bi@
        short-8 uchar-16 vconvert
    ] data-map( float-4[4] -- uchar-16 ) ;

The above code grabs four float-4 vectors at a time from the input array and crams them into one uchar-16 vector, first scaling the four inputs from 0.01.0 to 0255 (255.0 v*n float-4 int-4 vconvert), then packing both pairs of int vectors into two short vectors (int-4 short-8 vconvert), and finally packing the two short vectors into a single uchar vector. The machine instruction that the vconvert operation maps to handles saturating values below 0 and above 255 for us.

data-map can also iterate over multiple input arrays in parallel. Here's another pixel-pushing example that folds planar R, G, B, and A data into a single RGBA image:

USING: alien.c-types alien.data.map kernel locals
math.vectors math.vectors.simd specialized-arrays typed ;
SIMDS: uchar ;
SPECIALIZED-ARRAYS: uchar-16 ;
IN: scratchpad

:: vmerge-transpose ( a b c d -- ac bd ac bd )
    a c (vmerge) b d (vmerge) ; inline

TYPED: RGBA-planes>RGBA (
    r: byte-array
    g: byte-array
    b: byte-array
    a: byte-array
    --
    rgba: byte-array
)
    [ vmerge-transpose vmerge-transpose ]
    data-map( uchar-16 uchar-16 uchar-16 uchar-16 -- uchar-16[4] ) ;

The (vmerge) function maps to SIMD instructions that interleave the contents of two vectors. By applying it twice to our four input vectors we get four interleaved vectors we can store to the destination buffer.

data-map still has some shortcomings. All the above examples assume that the input arrays are an evenly divisible size of the input grouping. data-map does nothing to deal with the tail end; currently, you'd need to find and handle it yourself. data-map also doesn't offer a solution for iterating over potentially misaligned subsequences of packed arrays, where you would also want to handle the unaligned head of the incoming slice separately before passing the aligned part the vectorized main loop. However, as is, it's been useful enough to accelerate the terrain generation demo from needing a five-second warmup to starting nearly instantly. As I try to apply it to more complex problems, I'll be tweaking the design to make it more general.

✻   ✼   ✻

Of course, Factor is still primarily a dynamic language, and it does nothing to stop you from slipping a type-opaque function call into your inner loop and kicking its performance down from C++ to Python levels. It still takes some arcane knowledge of the compiler's abilities to get it to generate fast code. But having typed function definitions, a rich set of object types that map to CPU types, and control flow constructs that work with typed data makes Factor's fast path a lot easier to access.

Tue, 20 Oct 2009 04:00:44

Joe Groff: Goodbye iPhone app store

Screw the iPhone and its gated-community development environment; working on Factor is more rewarding. My iPhone developer account expires on October 22, and with it goes Footsie. I've put the source code up for posterity on my code page.

Mon, 19 Oct 2009 04:06:23

Elie Chaftari: System-wide Factor

I like the Factor listener and its powerful help system. You also get a debugger, a code walker, and much more with it. It's also a two-way tool if you use it with a supported editor. I personally use TextMate with the latest Factor TextMate bundle.

I however often use the terminal and like to have Factor handy system-wide. If I want to quickly test an idea, I like to start the terminal with Ctrl-Tab "T" (Quicksilver) and be able to just fire up Factor. Instead of entering the full path to factor or cd to it and run ./factor, I prefer calling it just like any other system command, by issuing a factor command in the terminal.

Before - './YourPathToFactor/factor/factor'

This can easily be done by creating a file (not a soft link because factor is not a self-contained executable) named factor pointing to the factor executable in your factor main folder. My Factor main folder is on my Desktop. Change the path to Factor's main folder accordingly in the following file. Don't forget to chmod 755 to make it executable.

cat ~/bin/factor

#!/bin/sh
exec rlwrap "/Users/elie/Desktop/factor/factor" "$@"

The location of the above file should be somewhere under you PATH. I have a personal bin folder under my home directory (~/bin) for that use. I also added rlwrap in the exec command for readline history and reverse-i-search (ctrl-r). I previously installed rlwrap with Macports. Here's my .bashrc file. You can see that I added ~/bin to my $PATH.

cat ~/.bashrc

export PATH="/opt/local/bin:/opt/local/sbin:~/bin:$PATH"

After - it's as easy as 'factor'

You can also set 'Auto Use' on (available with the 'Auto Use' button in the listener) with the com-auto-use word. Beyond running Factor scripts, you can also have in-place vocab scaffolding by adding your current "." folder to the vocab roots with:

vocab-roots [ "." suffix ] change

The above line could be permanently added to your .factor-rc file but it currently disrupts the help system in the listener with a Permission denied (13) (probably due to searching on the "." path without the appropriate privileges).

Now supposing you change directory to your Desktop (cd Desktop/) and want to start experimenting with a new vocab there before moving it later to the work or extra directory. You only need to enter the following to get started:

"." "my-vocab" scaffold-vocab

Sample session:
Elie:~ elie$ cd Desktop/
Elie:Desktop elie$ factor
Loading /Users/elie/.factor-rc
( scratchpad ) com-auto-use
1: com-auto-use
^
No word named ``com-auto-use'' found in current vocabulary search path

The following restarts are available:

:1 Use the ui.tools.listener vocabulary
:2 Defer word in current vocabulary

Type :help for debugging help.
( scratchpad ) :1
1: Note:
Added "ui.tools.listener" vocabulary to search path
( scratchpad - auto ) vocab-roots [ "." suffix ] change
( scratchpad - auto ) "." "my-vocab" scaffold-vocab
1: Note:
Added "tools.scaffold" vocabulary to search path
Creating scaffolding for P" my-vocab/my-vocab.factor"
Creating scaffolding for P" my-vocab/authors.txt"
Loading my-vocab/my-vocab.factor
( scratchpad - auto )

Fri, 16 Oct 2009 11:27:00

Slava Pestov: Improved write barriers in Factor's garbage collector

The Factor compiler has been evolving very quickly lately; it has been almost completely rewritten several times in the last couple of years. The garbage collector, on the other hand, hasn't seen nearly as much action. The last time I did any work on it was May 2008, and before that, May 2007. Now more than a year later, I've devoted a week or so to working on it. The result is a cleaner implementation, and improved performance.

Code restructuring

I re-organized the garbage collector code to be more extensible and easier to maintain. I did this by splitting off a bunch of the garbage collector methods from the factor_vm class into their own set of classes. I made extensive use of template metaprogramming to help structure code in a natural way. Many people dislike C++, primarily because of templates, but I don't feel that way at all. Templates are my favorite C++ feature, and if it wasn't for templates C++ would just be a shitty object-oriented dialect of C.

First up is the collector template class, defined in collector.hpp:
template<typename TargetGeneration, typename Policy> struct collector
This class has two template parameters:
  • TargetGeneration - this is the generation that the collector will be copying objects to. A generation is any class that implements the allot() method.
  • Policy - this is a class that simulates a higher-order function. It implements a should_copy_p() method that tells the collector if a given object should be promoted to the target generation, or left alone.
On its own, the collector class can't do any garbage collection itself; it just implements methods which trace GC roots: trace_contexts() (traces active stacks), trace_roots() (traces global roots), and trace_handle() (traces one pointer).

Next up is the copying_collector template class, defined in copying_collector.hpp:
template<typename TargetGeneration, typename Policy> struct copying_collector
This class has the same two template parameters as collector; the target generation must define one additional method, next_object_after(). This is used when scanning newly copied objects. This class implements logic for scanning marked cards, as well as Cheney's algorithm for copying garbage collection.

Then, there are four subclasses implementing each type of GC pass:Each class subclasses copying_collector and specializes the two template arguments. For example, let's take a look at the declaration of the nursery collector:
struct nursery_collector : copying_collector<aging_space,nursery_policy>
This subclass specializes its superclass to copy objects to tenured space, using the following policy class:
struct nursery_policy {
factor_vm *myvm;

nursery_policy(factor_vm *myvm_) : myvm(myvm_) {}

bool should_copy_p(object *untagged)
{
return myvm->nursery.contains_p(untagged);
}
};

That is, any object that is in the nursery will be copied to aging space by the nursery_collector. Other collector subclasses are similar.

This code all uses templates, rather than virtual methods, so every collector pass will have a specialized code path generated for it. This gives higher performance, with cleaner code, than was is possible in C. The old garbage collector was a tangle of conditionals, C functions, and global state.

Partial array card marking

When a pointer to an object in the nursery is stored into a container in aging or tenured space, the container object must be added to a "remembered set" so that on the next minor collection, so that it can be scanned, and its elements considered as GC roots.

Old way

Storing a pointer into an object marks the card containing the header of the object. On a minor GC, all marked cards are be scanned; if a marked card was bound, then every object whose header is contained in this card would be scanned.

Problem

Storing a pointer into an array would necessitate the array to be scanned in its entirety on the next minor collection. This is bad if the array is large. Consider an algorithm that stores successive elements into an array on every iteration, and also performs enough work per iteration to trigger a nursery collection. Now every nursery collection -- and hence every iteration of the loop -- is scanning the entire array. We're doing a quadratic amount of work for what should be a linear-time algorithm.

New way

Storing a pointer into an object now marks the card containing the slot that was mutated. On a minor GC, all marked cards are scanned. Every object in every marked card is inspected, but only the subrange of slots that fit inside the card are scanned. This greatly reduces the burden placed on the GC from mutation of large arrays. The implementation is tricky. I need to spend some time thinking about and simplifying the code, as it stands the card scanning routine has three nested loops, and two usages of goto!

Implementation

copying_collector.hpp, trace_cards()

New object promotion policy

When aging space is being collected, objects contained in marked cards in tenured space must be traced.

Old way

These cards would be scanned, but could not be unmarked, since the objects they refer to were copied to the other aging semi-space, and would need to be traced on the next aging collection.

The problem

The old way was bad because these cards would remain marked for a long time, and would be re-scanned on every collection. Furthermore the objects they reference would likely live on for a long time, since they're referenced from a tenured object, and would needlessly bounce back and forth between the two aging semi-spaces.

New way

Instead, an aging collection proceeds in two phases: the first phase promotes aging space objects referenced from tenured space to tenured space, unmarking all marked cards. The second phase copies all reachable objects from aging to second aging semi-space. This promotes objects likely to live for a long time all the way to tenured space, and scans less cards on an aging collection since more cards can get unmarked.

Implementation

aging_collector.cpp

Faster code heap remembered set

If a code block references objects in the nursery, the code block needs to be updated after a nursery collection. This is because the machine code of compiled words directly refers to objects; there's no indirection through a literal table at runtime. This improves performance but increases garbage collector complexity.

Old way

When a new code block was allocated, a global flag would be set. A flag would also be set in the code block's header. On the next nursery collection, the entire code heap would be scanned, and any code blocks with this flag set in them would have their literals traced by the garbage collector.

New way

The problem with the old approach is that adding a single code block entails a traversal of the entire code heap on the next minor GC, which is bad for cache. While most code does not allocate in the code heap, the one major exception is the compiler itself. When loading source code, a significant portion of running time was spent scanning the code heap during minor collections. Now, the list of code blocks containing literals which refer to the nursery and aging spaces are stored in a pair of STL sets. On a nursery or aging collection, these sets are traversed and the code blocks they contain are traced. These sets are typically very small, and in fact empty most of the time.

Implementation

code_heap.cpp, write_barrier()
copying_collector.hpp, trace_code_heap_roots()

Faster card marking and object allocation

The compiler's code generator now generates tighter code for common GC-related operations too. A write barrier looks like this in pseudo-C:
cards[(obj - data_heap_start) >> card_bits] = card_mark_mask;
Writing out the pointer arithmetic by hand, we get:
*(cards + (obj - data_heap_start) >> card_bits) = card_mark_mask;
Re-arranging some operations,
*(obj >> card_bits + (cards - data_heap_start >> card_bits) = card_mark_mask;
Now, the entire expression
(cards - data_heap_start >> card_bits)
is a constant. Factor stores this in a VM-global variable, named cards_offset. The value used to be loaded from the global variable every time a write barrier would execute. Now, its value is inlined directly into machine code. This requires code heap updates if the data heap grows, since then either the data heap start or the card array base pointer might change. However, the upside is that it eliminates several instructions from the write barrier. Here is a sample generated write barrier sequence; only 5 instructions on x86.32:
0x08e3ae84: lea    (%edx,%eax,1),%ebp
0x08e3ae87: shr $0x8,%ebp
0x08e3ae8a: movb $0xc0,0x20a08000(%ebp)
0x08e3ae91: shr $0xa,%ebp
0x08e3ae94: movb $0xc0,0x8de784(%ebp)

Object allocations had a slight inefficiency; the code generated to compute the effective address of the nursery allocation pointer did too much arithmetic. Adding support to the VM for embedding offsets of VM global variables directly into machine code saved one instruction from every object allocation. Here is some generated machine code to box a floating point number; only 6 instructions on x86.32 (of course Factor does float unboxing to make your code even faster):
0x08664a33: mov    $0x802604,%ecx
0x08664a38: mov (%ecx),%eax
0x08664a3a: movl $0x18,(%eax)
0x08664a40: or $0x3,%eax
0x08664a43: addl $0x10,(%ecx)
0x08664a46: movsd %xmm0,0x5(%eax)

Implementation

cpu/x86/x86.factor: %write-barrier, %allot
cpu/ppc/ppc.factor: %write-barrier, %allot

Performance comparison


I compared the performance of a Mac OS X x86-32 build from October 5th, with the latest sources as of today.

Bootstrap time saw a nice boost, going from 363 seconds, down to 332 seconds.

The time to load and compile all libraries in the source tree (load-all) was reduced from 537 seconds to 426 seconds.

Here is a microbenchmark demonstrating the faster card marking in a very dramatic way:
: bench ( -- seq ) 10000000 [ >bignum 1 + ] map ;

The best time out of 5 iterations on the old build was 12.9 seconds. Now, it has gone down to 1.9 seconds.

Thu, 15 Oct 2009 11:16:00

Phil Dawes: Adding atomic CAS instruction support to Factor's compiler

I said in the last post that I'd write a bit about adding a new machine instruction to the factor compiler once I'd got round to actually doing it, so here it is:

If you've been following my blog you'll know that I wanted to utilise multiple cpu cores for a personal database project. Unfortunately Factor doesn't have an os-threaded runtime yet and so in order to work around this I modified the factor vm code to allow multiple vms to run simultaneously on separate threads.

I'm now writing a concurrent queue so that messages can be passed internally between the running vms. To implement my queue I wanted some fast atomic primitives so I set about adding a Compare-and-swap (CAS) instruction to Factor's compiler.

To implement CAS on x86 you basically need the CMPXCHG and LOCK instructions, so my first job was to get these added to Factor's x86 assembler DSL. I located the machine-code opcodes in the intel manuals and added the CMPXCHG and LOCK instructions to the x86 assembler vocabulary thus:

basis/cpu/x86/assembler/assembler.factor:

: CMPXCHG ( dst src -- ) { HEX: 0f HEX: B1 } 2-operand ;

: LOCK ( -- ) HEX: f0 , ;

With the new x86 instructions in place I was all set to add a new low-level IR 'virtual' instruction to factor's compiler. There are basically 3 steps to this:

  1. Declare the new low-level IR instruction along with the number and types of registers it requires
  2. Tell the code generator where to dispatch the calls to generate the 'real' cpu machine code for the instruction
  3. Write methods that emit both X86 and PPC versions of the machine code for the instruction.
I'll explain each step below:

Step one: Declaring the new instruction

Thanks to the recent addition of a low-level instruction DSL, adding a new instruction to the compiler backend is just a case of declaring the instruction name and the argument registers it requires:

basis/compiler/cfg/instructions/instructions.factor:

INSN: ##atomic-compare-exchange
      def: dst/int-rep
      use: ptr/int-rep old/int-rep new/int-rep
      temp: temp/int-rep ;

##atomic-compare-exchange is my new virtual CAS instruction that receives 3 input arguments: a pointer to a word in memory, an expected 'old' value and a new value.

The implementation of ##atomic-compare-exchange will do the following: compare the value pointed to by the ptr register to the value in old and if it's equal replace it with the value in new, otherwise leaves it as it is. Finally, put the resulting value in the destination register dst. In case you haven't guessed, this is pretty much exactly what CMPXCHG does on x86.

At compile time Factor's register allocator allocates the real (e.g. x86) cpu registers and passes them to our code generator.

Unfortunately as an added complication in this particular case the x86 instruction CMPXCHG uses the EAX/RAX register as an implicit argument, and unfortunately Factor's code generation doesn't support tying particular cpu registers to parameters yet (though Slava assures me it will soon). To work around this I'm asking the compiler to pass me an extra 'temp' register so we can use this if any of the others happens to be EAX/RAX.

With the low-level instruction declared I now want to get some interactive testing going, so I write a high level word called 'compare-swap' and a compiler intrinsic which uses the ##atomic-compare-exchange instruction. (See the previous post for an overview of factor compiler intrinsics). The point of this is that we can dump out and check the low level IR instructions emitted by the compiler. Here's the compare-swap word and compiler intrinsic:

: compare-swap ( ptr old new -- ? )
    2drop "compare-swap needs to be compiled" throw ;

: emit-compare-swap ( node -- )
    drop
    3inputs
    ^^atomic-compare-exchange
    ds-push ;

\ compare-swap [ emit-compare-swap ] "intrinsic" set-word-prop

Now we can use the 'test-mr.' debugger word to see the new instruction in action. We'll just pass in a bunch of junk arguments for now so we can see what the low-level MR instructions look like in context:

( scratchpad ) USE: compiler.cfg.debugger
( scratchpad ) [ ALIEN: 20 1 2 compare-swap ] test-mr.
=== word: ( gensym ), label: ( gensym )

_label 0 
_label 1 
##load-reference RAX ALIEN: 20 
##load-immediate RCX 8                            ! 1 << 3
##load-immediate RDX 16                           ! 2 << 3
##atomic-compare-exchange RAX RAX RCX RDX RBX     ! Oops! - RAX allocated as a register
##inc-d 1 
##replace RAX D 0 
_label 2 
##return 
_spill-area-size 0

In this example you can see that the register allocator has allocated RAX as both the destination register and one of the input registers, so our X86 implementation of ##atomic-compare-exchange will need to work around that.

Step two: Wiring up the code generator

Ok, now that we have low level IR working the next step is to tell the compiler how to generate the real cpu machine-code for the new instruction. There's a convention that all machine code emitting words start with a '%' so I'm going to create a generic word %atomic-compare-exchange with method implementations for each CPU. Here's the generic word declaration:

/basis/cpu/architecture/architecture.factor:

HOOK: %atomic-compare-exchange cpu ( dst cptr old new temp -- )

N.B. Factor has a generic dispatch mechanism called 'HOOK:' which dispatches polymorphically based on the value of a variable at compile time. In this case it's the cpu variable which is set to a singleton representing the target architecture (x86.32, x86.64, ppc), so essentially this generic word is polymophic based on CPU architecture.

Now I tell the compiler to used this generic word for code generation using the CODEGEN: DSL word. (again, see Slava's post on the new compiler DSL words):

/basis/compiler/codegen/codegen.factor:

CODEGEN: ##atomic-compare-exchange %atomic-compare-exchange

Step three: Doing the x86 code generation

All that's left now is to implement %atomic-compare-exchange for our cpu architectures. Below is my method implementation for x86. To make the example more straightforward I've omitted the code that works around the implicit EAX/RAX register, which I abstracted into a 'with-protected-accumulator' combinator.

/basis/cpu/x86/x86.factor:

:: (%atomic-compare-exchange) ( dst cptr old new -- )
    accumulator-reg old MOV
    LOCK
    cptr [] new CMPXCHG
    dst accumulator-reg MOV ;

! CMPXCHG implicitly uses EAX/RAX (accumulator) so need to remove
! EAX from arguments and protect it from being stomped
M: x86 %atomic-compare-exchange ( dst cptr old new temp -- )
    [ (%atomic-compare-exchange) ] with-protected-accumulator ;

The (%atomic-compare-exchange) word contains the actual machine code generation: You can see I simply output 4 lines of assembler using the factor x86 assembler DSL and the registers passed to me by the compiler. (N.B. 'accumulator-reg' is my helper word that returns EAX or RAX depending on whether the arch is 32 or 64 bit)

Now that the x86 implementation is written we can check the output machine code with the disassemble word (which uses either the Udis86 library or GDB under the hood to do the disassembling):

( scratchpad ) [ ALIEN: 20 1 2 compare-swap ] disassemble
00007f85690413a0: 48b8fe3ac566857f0000  mov rax, 0x7f8566c53afe
00007f85690413aa: 48b90800000000000000  mov rcx, 0x8
00007f85690413b4: 48ba1000000000000000  mov rdx, 0x10
00007f85690413be: 4889c3                mov rbx, rax
00007f85690413c1: 4889c8                mov rax, rcx
00007f85690413c4: f0480fb113            lock cmpxchg [rbx], rdx
00007f85690413c9: 4889c0                mov rax, rax
00007f85690413cc: 4983c608              add r14, 0x8
00007f85690413d0: 498906                mov [r14], rax
00007f85690413d3: c3                    ret

The disassembler output verifies that the cmpxchg instruction is being compiled correctly. You can also see that I'm doing some juggling with the rax register to manage using it as an implicit argument to cmpxchg.

Hopefully that gives a good overview of how to get new low-level instructions added to Factor's compiler, and also illustrates how machine-code generation works in Factor.

Tue, 13 Oct 2009 19:02:00

Daniel Ehrenberg: Bitfields in Factor structs and the special style

Sometimes, it's useful to pack multiple small numbers into a small amount of memory. Say you have a tuple like

TUPLE: nums a b c ;
where you know that a is always between 0 and 31, b is always between -2 and 1, and c is always either 0 or 1. It should be possible to store this in a single byte, since a could be represented in 5 bits, b can be represented in 2 bits, and c can be represented in 1 bit.

One way to implement this is to define words to store in a single fixnum what a nums tuple stores. These words would consist of shifts, masks and maybe calls to a sign extension routine. But this code would be boring and annoying to write.

I've automated this process by allowing Factor structs to have bit fields, like C structs can. Here's how the implementation of nums would look:
STRUCT: nums
{ a uint bits: 5 }
{ b int bits: 2 }
{ c uint bits: 1 } ;
Thanks to Joe Groff's changes on how structs work, this is a class, with prettyprinting and accessors just like tuple classes. Structs were originally a feature of Factor for the foreign function interface, but they're actually useful in pure Factor programs too: they're an efficient way to manipulate binary data.

Bitfields deviate from the FFI origins of structs because they don't follow any standard convention on bitfield layout. In C, unlike for structs, there is no single standard ABI for bitfields, even on a single OS/architecture platform. Different compilers act differently. So why not make up my own convention? I store everything little-endian, and the first bitfield stores in the least significant bits of the byte. Because bitfield access only reads the underlying memory one byte at a time in the current implementation, this is just as efficient on big-endian hardware as little-endian hardware.

Factor supports efficient homogeneous arrays of structs, allowing lots of data to be packed together efficiently. Because I extended structs for bitfields, rather than creating a new type of class, struct arrays can immediately be used with structs that have bitfields. This worked immediately; I didn't have to modify struct arrays.

The actual code for the setters and getters is implemented in a high-level style. There are no type declarations, and all math is done with generic arithmetic calls. Factor's compiler is smart enough to translate this into fixnum arithmetic with no overflow checks in the case that the bitfields are small enough. If you make a field 100 bits wide, it'll use generic integer arithmetic, to take into account possible overflow into bignums. But this will only be used for certain calculations, the ones that could overflow.

The Factor code generated for accessors is naive, not only in how it doesn't declare types, but it also does some things that could be special-cased for more efficient code. For example, in every byte read of the array, a mask is applied to the result with bitwise and, so that the relevant bits can be extracted. Sometimes, that mask will be 255, so it won't actually mask away any bits, and does nothing. Rather than special-casing a mask of 255 and generating code that doesn't call bitand, I extended the compiler to understand the algebraic identity that, in the range of numbers from 0 to 255, 255 bitand is the identity function. (This works for (2^n)-1, for any n.)

Not all code can be compiled as efficiently for the compiler as the bitfield accessors. There is a special style of code that the compiler can compile more efficiently. As the compiler becomes more advanced, the subset grows. Really, there's a continuum of programs that the compiler can compile more or less efficiently. Originally, the special style consisted of code whose stack effect could be inferred by the compiler, allowing optimizations to take place. Now, all valid Factor code has an inferrable stack effect, but the compiler is advanced enough that it can do further optimizations when more information is available.

Code written in the special style has to have something indicating all of this information about the code. In the case of bitfield accessors, we want the compiler to be able to infer that it can use fixnum arithmetic without overflow checks if the bitfield is small enough that the result won't ever overflow a bignum. The compiler can figure this out in the case of bitfield getters because it is only doing shifts, bitands and bitors on the result of alien-unsigned-1, a word to read one byte of memory. The shifts are all constants.

In the sparse conditional constant propagation pass, the compiler tracks the possible interval of values that a virtual register could hold. The compiler understands how bitands, bitors and shifts by a literal relate the interval of their output to the interval or their input. Another pass, the modular arithmetic pass, propagates information backwards about the modulus that earlier calculations could be done in, based on how the data is used later. This also allows overflow checks to be eliminated.

This aspect of the special style allows overflow checks to be eliminated, allowing the use of more direct machine arithmetic. Other aspects of special style allow other efficient code to be generated. On code where types can infer, and a float or SIMD type is used, the compiler can allocate float or SIMD registers, and inline efficient code for arithmetic operations. Code using immutable tuples get the benefit of tuple unboxing, so repeated lookup of slots is cheap, and sometimes allocation can be eliminated. Code that uses mutable datastructures gets unboxed at another point in the compiler, but since it is harder to reason about, unboxing right now has only been implemented within a restricted piece of code. When the class of the input to a generic word is known, the correct method is called directly, and sometimes inlined. There are many other optimizations that have been implemented over the years, too many to describe in this blog post.

Not all code could be written in the special style, and it would be unreasonable to expect most Factor programmers to learn about it. The compiler and the UI framework, for example, would be difficult to translate into a form taking heavy advantage of style. But a lot of the code in the standard library could be written this way. For example, the code for appending sequences is written in a style that allows the inner loop to have no subroutine calls inside of it and register allocation to be performed, when operating on certain types of sequences. There is only one piece of code that implements append, but using the hints feature lets specialized versions be generated for certain types which are often appended together. Append is used in many places, so code that's written in a dynamic style will benefit from this, even if the dynamic style code isn't optimized better.

Special style is an extremely important feature of Factor, and without it, Factor would be as slow as many scripting languages. Without special style, many libraries would have to be implemented in C, as scripting languages do. Because of special style, Factor can be self-hosting and use many libraries written in Factor without an overwhelming performance penalty. Without special style, to implement bitfields as fast as they are right now, it would have been necessary to generate machine code on each platform.

Update: I guess I forgot to tell you exactly what special style is, and how to write code in it. Well, it's complicated. I'll write about it in the future. Special style grows as the compiler becomes more advanced, but I can't describe how it is right now.

Sat, 10 Oct 2009 17:51:00

Phil Dawes: Hand-coding multi-platform assembler using Factor compiler intrinsics

Disclaimer: I'm not a Factor compiler expert and am just getting to grips with compiler intrinsics so some of this might be a bit iffy.

'Compiler Intrinsics' is a mechanism by which you can insert your low-level implementation of a subroutine into the compiler output. This is useful in a couple of scenarios:

  • if the compiler doesn't support the desired functionality - e.g. it does something hardwarey that Factor can't do yet
  • if the subroutine is performance critical and the compiler isn't generating the most efficient code

The old way of doing compiler intrinsics in Factor was to hand-code some assembler using one of Factor's assembler DSLs (PPC or X86) and then attach it to an existing word as a word-property along with an argument type pattern. When the compiler compiled calls to the word it would compare the input parameters to the pattern and on match would insert the assembler directly into the generated code.

Since my last post about Factor's compiler over a year ago Slava has pretty much re-written the whole thing. It now has two intermediate stages:

The first frontend stage transforms the factor code into an intermediate representation called 'high level IR'. This is basically a decomposition of factor code into primitive word-calls and control nodes through various optimization passes. This is very similiar to the dataflow IR in the original Factor compiler that I described in the previous blog post

The second backend stage is the new bit. It converts the high-level IR into low-level IR, which is basically a platform independent assembler language. An optimization stage then runs and cpu registers are allocated resulting in 'machine IR' (abbreviated to 'MR' in the debug tools). The real machine code generation is then done from this MR.

The new way of doing compiler intrinsics allows you to insert low-level IR code at the beginning of the 'backend' stage. Differences to the old way include:

  • You now code using the platform independent instructions defined in compiler.cfg.instructions
  • Instructions operate on virtual registers. There are an infinite number of those
  • Subroutine arguments don't appear in registers. Instead you manually insert code to get them in and out of the data stack using ds-push, ds-pop
  • You still have to box and unbox values manually (just as before)
  • There's an optimization stage that runs after you've emitted the low level IR instructions from your compiler intrinsic

As a really simple example here's a word which is going to add 35 to the fixnum on the top of the stack and push the result. To make sure that we're executing the intrinsic assembler I'll give it a default implementation that throws an error.

: add-35 ( n -- n' ) 
    drop "shouldn't call this" throw  ;

Incidently, here are the MR instructions generated from this default implementation:

( scratchpad ) USE: compiler.cfg.debugger
( scratchpad ) \ add-35 test-mr.
=== word: add-35, label: add-35

_label 0 
_prologue T{ stack-frame { total-size 32 } } 
_label 1 
##load-reference RAX "shouldn't call this" 
##replace RAX D 0 
_label 2 
##call M\ object throw 
_label 3 
##no-tco 
_spill-area-size 0

A couple of things to notice:

  • The instructions are prefixed with ##. E.g. ##load-reference, ##replace

  • This MR output is displayed after cpu register allocation has been done: RAX is an x86.64 register. Also D is a pseudo-register that points to the data stack. If you look at the disassembled machine code (just below the callstack juggling) you can see that D actually becomes R14:

( scratchpad ) \ add-35 disassemble
00007f6d98780ce0: 49b8e00c78986d7f0000  mov r8, 0x7f6d98780ce0 (add-35)
00007f6d98780cea: 6820000000            push dword 0x20
00007f6d98780cef: 4150                  push r8
00007f6d98780cf1: 4883ec08              sub rsp, 0x8
00007f6d98780cf5: 48b8e6866ca76d7f0000  mov rax, 0x7f6da76c86e6
00007f6d98780cff: 498906                mov [r14], rax
00007f6d98780d02: e859a385ff            call 0x7f6d97fdb060

Ok, so instead of an implementation that throws an error I want to insert my own instructions into the output. I can do this by attaching some low-level-IR emitting code to the word using the "intrinsic" word property:

: emit-add-35 ( node -- )
    drop              ! don't need to inspect the compiler node
    ds-pop            ! insert instruction to pop value off the stack
    ^^untag-fixnum    ! insert code to untag the value in the register
    35 ^^add-imm      ! insert instruction to add 35 to it (add-imm = add immediate)
    ^^tag-fixnum      ! insert code to tag the result
    ds-push ;         ! insert code to push the result onto the data stack

\ add-35 [ emit-add-35 ] "intrinsic" set-word-prop

The emit-add-35 just pops a value off of the stack, un-tags (unboxes) it and then adds 35 to it and tags the result. A couple of points:

  • 'Hats' - The ^^ form of instructions are the same as the ## form, except that after emitting the instruction the ^^ form returns the (new) destination register so that it can be used by the next instruction.

  • 'tag/untag' - Factor aligns all its heap data to the nearest 8 byte boundary, which leaves the bottom 3 bits of each pointer free for runtime type identification (RTTI). These 3 RTTI bits are called the 'tag', and in the case of a fixnum the tag is '000' and the other bits store the actual value rather than a pointer to the value. So instead of unboxing fixnums we simply untag them, which equates to shifting them 3 bits to the right.

  • node parameter - You'll notice that the emit-add-35 word takes a node parameter. This parameter is a structure passed by the compiler and contains information about the inferred types and value-ranges of the arguments at compile time. This is handy if you're dispatching based on type or you want to decide whether to include overflow logic. In this example I'm doing neither so I discard it

Now that the add-35 word has a compiler intrinsic we can see the emitted code by compiling it within a quotation (code-block) and displaying the mr:

( scratchpad ) [ add-35 ] test-mr.
=== word: ( gensym ), label: ( gensym )

_label 0 
_label 1 
##peek RAX D 0                     ! - load value from stack
##sar-imm RAX RAX 3                ! - untag
##add-imm RAX RAX 35               ! - add 35
##shl-imm RAX RAX 3                ! - tag
##replace RAX D 0                  ! - replace top stack elem with result
_label 2 
##return 
_spill-area-size 0

I've annotated this output but you could probably guess what it was doing anyway.

I mentioned earlier that a backend optimizer stage runs after the intrinsic word is called. To illustrate this here's a compilation of the add-35 word with a supplied constant argument:

( scratchpad ) [ 4 add-35 ] test-mr.
=== word: ( gensym ), label: ( gensym )

_label 0 
_label 1 
##load-immediate RAX 312 
##inc-d 1 
##replace RAX D 0 
_label 2 
##return 
_spill-area-size 0

You can see that the Factor compiler dispensed with our hand-coded add instruction and instead just stuck the fixnum-tagged result in the RAX register. It did this because it could perform the evaluation and boxing at compile time. ( 312 = (35 + 4)<<3 ). Here's the resulting X86 assembler:

( scratchpad ) [ 4 add-35 ] disassemble
00007feac680e0c0: 48b83801000000000000  mov rax, 0x138
00007feac680e0ca: 4983c608              add r14, 0x8
00007feac680e0ce: 498906                mov [r14], rax
00007feac680e0d1: c3                    ret

So that leaves the question: How do I code actual X86 assembler into a subroutine?

To do that you need to create a new low-level instruction tuple and emit your X86 assembler from a generate-insn method on that instruction. This is a lot easier than it sounds thanks to the INSN: and CODEGEN: words.

I've got to add some CAS instructions soon so I'll probably write a bit about it then.

Wed, 7 Oct 2009 19:16:00

Blogroll


planet-factor is an Atom/RSS aggregator that collects the contents of Factor-related blogs. It is inspired by Planet Lisp.

Syndicate