Burger.

it's not working

Written January 19, 2024

Arthur Whitney is an esteemed computer scientist who led the design on a few well-known pieces of software:

I've never even seen a trillion numbers, much less calculated them, but kdb is apparently a standard tool on Wall Street. They probably care about money, so I'll assume kdb does its job well. His languages take significantly after APL, which was a very popular language for similar applications before the invention of (qwerty) keyboards.

But I'm not here to talk about boring things like "using software to make incomprehensible amounts of money in finance" or "human beings and their careers", I'm here to talk about how a guy writes C code weird. For a very simple version of the programming language K, there's a publicly available interpreter he wrote in a few days using about 50 lines of C to show the basics of interpreter writing. This is the C (specifically the January 16, 2024 version #2):

a.h

typedef char*s,c;s Q=(s)128;
#define _(e...) ({e;})
#define x(a,e...) _(s x=a;e)
#define $(a,b) if(a)b;else
#define i(n,e) {int $n=n;int i=0;for(;i<$n;++i){e;}}

#define Q(e) if(Q==(e))return Q;
#define Qs(e,s) if(e)return err(__func__,s);
#define Qr(e) Qs(e,"rank")
#define Qd(e) Qs(e,"domain")
#define Qz(e) Qs(e,"nyi")

#define _s(f,e,x...) s f(x){return _(e);}
#define _i(f,e) _s(f,e,c x)
#define f(f,e)  _s(f,e,s x)
#define F(f,e)  _s(f,e,s a,s x)

#define ax (256>x)
#define ix (c)x
#define nx x[-1]
#define xi x[i]

#define aa x(a,ax)
#define ia x(a,ix)
#define na x(a,nx)

#define oo w("oo\n")

a.c

#include"a.h"//fF[+-!#,@] atom/vector 1byte(int/token) clang-13 -Os -oa a.c -w 
#define r(n,e) _(s r=m(n);i(n,r[i]=e)r)
f(w,write(1,ax?&x:x,ax?1:strlen(x));x)F(err,w(a);w((s)58);w(x);w((s)10);Q)
_i(wi,s b[5];sprintf(b,"%d ",x);w(b);0)
f(W,Q(x)$(ax,wi(ix))i(nx,wi(xi))w(10);x)

f(srt,Qz(1)0)f(uni,Qz(1)0)F(Cut,Qz(1)0)F(Drp,Qz(1)0)_i(m,s a=malloc(1+x);*a++=x;a)
#define A(c) ((s)memchr(a,c,na)?:a+na)-a
#define g(a,v) ax?255&a:r(nx,v)
f(not,g(!ix,!xi))f(sub,g(-ix,-xi))F(At,Qr(aa)g(a[ix],a[xi]))F(_A,Qr(aa)g(A(ix),A(xi)))
f(ind,Qr(!ax)0>ix?r(-ix,-ix-1-i):r(ix,i))F(Ind,Qr(!aa)Qd(1>ia)g(ix%ia,xi%ia))
#define G(f,o) F(f,ax?aa?255&ia o ix:Ltn==f?f(sub(x),sub(a)):f(x,a):r(nx,(aa?ia:a[i])o xi))
G(Ltn,<)G(Eql,==)G(Not,!=)G(Sum,+)G(Prd,*)G(And,&)G(Or,|)
f(cat,Qr(!ax)r(1,ix))F(Cat,a=aa?cat(a):a;x=ax?cat(x):x;s r=m(na+nx);memcpy(r+na,x,nx);memcpy(r,a,na))
f(at,At(x,0))f(rev,Qr(ax)At(x,ind(255&-nx)))f(cnt,Qr(ax)nx)
F(Tak,Qr(!aa||ax)Qd(0>ia||ia>nx)At(x,ind(a)))F(Sub,Sum(a,sub(x)))F(Mtn,Ltn(x,a))f(qz,Qz(1)0)
#define v(e) ((strchr(V,e)?:V)-V)
s U[26],V=" +-*&|<>=~!@?#_^,",
(*f[])()={0,abs,sub,qz ,qz,rev,qz ,qz, qz ,not,ind,at,uni,cnt,qz ,srt,cat},
(*F[])()={0,Sum,Sub,Prd,And,Or,Ltn,Mtn,Eql,Not,Ind,At,_A ,Tak,Drp,Cut,Cat};
_i(n,10u>x-48?x-48:26u>x-97?U[x-97]:0)
f(e,s z=x;c i=*z++;!*z?n(i):v(i)?x(e(z),Q(x)f[v(i)](x)):x(e(z+1),Q(x)58==*z?U[i-97]=x:_(c f=v(*z);Qd(!f)F[f](n(i),x))))
int main(){c b[99];while(1)if(w(32),b[read(0,b,99)-1]=0,*b)58==b[1]?e(b):W(e(b));}

This is the entire interpreter, and this is apparently how he normally writes code. Opinions on his coding style are divided, though general consensus seems to be that it's incomprehensible. As daunting as it is, I figured I should give it a chance for a few reasons.

So I'm going to go line by line and explain my understanding. I tried to use the notes provided in the repo only when I was stuck, which was a few times early on, but by the end I could understand it pretty well.

A reading of a.h

typedef char*s,c;

This already shows some funky C. It defines s as char *, and c as char, because the * attaches to the name, not the type. It's an oddity of C syntax that I've never been a fan of. Otherwise this is pretty straight forward: s is for string, and c is for character.

s Q=(s)128;

Fuck. Shit. He assigned 128 to a string named Q. What does it mean? s is char *. Why is Q a pointer to the address 128? I thought I must have misunderstood, and s was actually a character or something, but it's clearly specified as char *. s is for string! I couldn't figure out the meaning, so I soon gave up and looked at the annotated code. The char * is unsigned long long in other versions, and they explain that the type is used for both integers and pointers. The code operates on vectors of 8-bit integers, either as ASCII or numbers, so it makes some sense to use char * from a memory layout perspective, but I don't use pointers as integers very often.

#define _(e...) ({e;})
#define x(a,e...) _(s x=a;e)
#define $(a,b) if(a)b;else
#define i(n,e) {int $n=n;int i=0;for(;i<$n;++i){e;}}

These are all pretty straight forward, with one subtle caveat I only realized from the annotated code. They're all macros to make common operations more compact: wrapping an expression in a block, defining a variable x and using it, conditional statements, and running an expression n times.

The subtle thing the annotations point out is the first macro, ({e;}). The parentheses around curly brackets make this a statement expression, a non-standard C extension that allows you to treat a block of statements as a single expression, if the last statement is an expression that provides a value. In other words, int x = ({int a = func1(); int b = func2(a); a+b;}); sets x to whatever a+b is. This is used everywhere in the code after this.

#define Q(e) if(Q==(e))return Q;
#define Qs(e,s) if(e)return err(__func__,s);
#define Qr(e) Qs(e,"rank")
#define Qd(e) Qs(e,"domain")
#define Qz(e) Qs(e,"nyi")

These are error macros using that mysterious Q defined earlier. Q seems to have been used to represent errors, possibly short for "Quit". The Qr/d/z functions seem to be types of errors. I have no idea what "nyi" means (I figure it out later).

#define _s(f,e,x...) s f(x){return _(e);}
#define _i(f,e) _s(f,e,c x)
#define f(f,e)  _s(f,e,s x)
#define F(f,e)  _s(f,e,s a,s x)

These replace function declarations, and we can see that _ macro being used to add an implicit return. _s could be used like

_s(my_function, puts("I rock!!!"); x*5+e, s x, int e)

, which would create basically this standard C:

char *my_function(char *x, int e) {
	puts("I rock!!!");
	return x*5+e;
}

All the macros except the base _s also add implicit arguments like a and x and you bet it's hard to tell them apart.

#define ax (256>x)

This was another one that baffled me until I looked at the annotations. Remember how I said s values were either integers or pointers? 256 is the cutoff value for these integers, which the annotations call atoms, so ax means "is x an atom?"

#define ix (c)x
#define nx x[-1]
#define xi x[i]

These aren't too confusing. ix casts x to a char. nx implies x is some sort of fat pointer, meaning there's probably a length at x[-1], but we'll see. xi just indexes x as a normal pointer, using our implicitly defined i from the i(...) macro.

#define aa x(a,ax)
#define ia x(a,ix)
#define na x(a,nx)

These copy ax, ix, and nx respectively to work on the a variable, which is an implicit argument in functions defined using the F(f,e) macro. You remembered the x(name, expression) macro for assigning to a locally-scoped x, right?

#define oo w("oo\n")

It prints oo. It's not used anywhere.

a reading of a.c

I wound up not needing to refer to the annotated code at all to understand this. The C code is mostly using everything in the headers to build the interpreter.

#define r(n,e) _(s r=m(n);i(n,r[i]=e)r)

We create a vector r from m(n) (which is defined later (it's malloc)), fill r with the results of e, and return it out of the statement expression.

f(w,write(1,ax?&x:x,ax?1:strlen(x));x)

This defines s w(s x), which is our print function. If x is an atom (ax?), we print it as a single character by getting its address (&x) and providing a length of 1. If it's a vector, we print it as a string using strlen to calculate how long it is, so now we also know vectors must be null-terminated here. write and strlen are standard functions that we call without including the headers, because fuck headers. Let the linker figure it out.

F(err,w(a);w((s)58);w(x);w((s)10);Q)

Our fancy shmancy error printing function, s err(s a, s x). The confusing thing is that : and the newline are represented by their ASCII numbers 58 and 10, respectively. This just prints a message in the format {a}:{x}\n and returns our special error value Q.

_i(wi,s b[5];sprintf(b,"%d ",x);w(b);0)

Defines s wi(c x), which takes x as a char, formats it as an integer in up to 5*sizeof(char*)/sizeof(char) characters (40 on 64-bit machines), and writes that.

f(W,Q(x)$(ax,wi(ix))i(nx,wi(xi))w(10);x)

Another print function, s W(s x) either writes x as an integer or a list of integers. It also refuses to print the Q vector.

f(srt,Qz(1)0) f(uni,Qz(1)0) F(Cut,Qz(1)0) F(Drp,Qz(1)0)

I figured out what nyi means! It means "Not yet implemented", as we can see from these function definitions.

_i(m,s a=malloc(1+x);*a++=x;a)

And we find our previously-used function s m(c x), which allocates our buffer and returns a fat pointer (with the size at x[-1], hence the 1+x and a++). x is the length we're allocating here, which means our vectors are limited to 255 bytes. The repo suggests upgrading capacity as an exercise to the reader, which could be fun.

#define A(c) ((s)memchr(a,c,na)?:a+na)-a

This macro finds the first occurence of the character c in our vector a as an index into the string (hence the -a, since memchr returns a pointer). If the result is null, it just returns the length of the string (a+na - a). We see another fun bit of non-standard syntax, ?:, which I had to look up. a ?: b is equivalent to a ? a : b without evaluating a twice. Pretty snazzy!

#define g(a,v) ax?255&a:r(nx,v)

Strange little operation, I'll have to see it in action. If x is an atom, it clamps a to be an atom with a simple mask (255&a), otherwise it creates a new vector the same size as x filled with the result from v.

f(not,g(!ix,!xi)) f(sub,g(-ix,-xi)) F(At,Qr(aa)g(a[ix],a[xi])) F(_A,Qr(aa)g(A(ix),A(xi)))

Ah, I see now. g(a, v) lets us define functions that work on both atoms and vectors. If x is an atom, it returns the atom result clamped within the correct bounds. Otherwise it allocates a new vector and computes the other expression. All the above functions work either on x as an integer (ix), or on every element of x (xi).

This is a lot of functionality in such a small bit of code.

f(ind,Qr(!ax)0>ix?r(-ix,-ix-1-i):r(ix,i))
F(Ind,Qr(!aa)Qd(1>ia)g(ix%ia,xi%ia))

These are some atom-only functions.

Honestly, I can buy that this method of coding produces fewer bugs, once you can actually write it, since you work only on small building blocks of the logic and reuse them. Like, where could a bug for ind even be? Maybe an off-by-one in -ix-1-i, but it's hard to miss what's happening once you can see the trees through the forest.

#define G(f,o) F(f,ax?aa?255&ia o ix:Ltn==f?f(sub(x),sub(a)):f(x,a):r(nx,(aa?ia:a[i])o xi))

That's too many trees! I can't understand this many nested ternary operators at the same time because I'm not the alien from Arrival. I process things in linear time. I have to chunk this up.

F(f, ax ?
		aa ?
			255 & ia o ix
			: Ltn==f ? 
				f(sub(x),sub(a))
				: f(x,a)
		: r(nx,(aa?ia:a[i])o xi))

Okay, I see. It's 3 cases from 2 conditions: x is an atom or a vector, and a is an atom or a vector.

G(Ltn,<)G(Eql,==)G(Not,!=)G(Sum,+)G(Prd,*)G(And,&)G(Or,|)

Using our fancy new macro, we quickly define seven new functions for the vectors, where they're all element-wise applications of binary operators.

f(cat,Qr(!ax)r(1,ix)) F(Cat,a=aa?cat(a):a;x=ax?cat(x):x;s r=m(na+nx);memcpy(r+na,x,nx);memcpy(r,a,na))

I was confused by the first function, but I see now these are cat as in "concatenate". For an atom, it creates a vector of length 1 containing that atom. Cat does the more complex work of joining two vectors together, running cat on each value if it's an atom to get a list.

f(at,At(x,0)) f(rev,Qr(ax)At(x,ind(255&-nx))) f(cnt,Qr(ax)nx)

Some more simple functions. Lil at gets the first item of x; rev reverses the list using our ind function to generate the list of indeces in reverse, which is a little overkill but vectors are 256 bytes at max and memory is never freed so who cares; and cnt gets the size of x.

F(Tak,Qr(!aa||ax)Qd(0>ia||ia>nx)At(x,ind(a))) F(Sub,Sum(a,sub(x))) F(Mtn,Ltn(x,a)) f(qz,Qz(1)0)

Some more simple functions. Tak returns the first a characters from the vector x as a new list; Sub subtracts; Mtn is "more than"; and qz returns our "not yet implemented" error.

#define v(e) ((strchr(V,e)?:V)-V)

A shorthand to get the first occurence of a character from a string V, returning an offset into the array or zero if it's not present. This seems ambiguous, since that's also the result if we match with the first character, but we'll see.

s U[26],V=" +-*&|<>=~!@?#_^,",
(*f[])()={0,abs,sub,qz ,qz,rev,qz ,qz, qz ,not,ind,at,uni,cnt,qz ,srt,cat},
(*F[])()={0,Sum,Sub,Prd,And,Or,Ltn,Mtn,Eql,Not,Ind,At,_A ,Tak,Drp,Cut,Cat};

Ah, we have seen. The first character of V is a space, and it looks like the arrays of function pointers match up with the characters of V to give us our functions, with space being null. However, I think abs is from the standard library here, since it's not defined anywhere else? That seems like a bug. It'd work for atoms, but it'd break vectors. This also defines an array of 26 vectors, which I assume will be our variables.

_i(n, 10u>x-48? x-48 : 26u>x-97? U[x-97] : 0)

s n(c x) reads a char and returns one of several things. I'll have to consult an ASCII table.

So it looks like this function is specifically to read values, either numerals or variables, all of which are one character.

f(e,s z=x;c i=*z++;!*z?n(i):v(i)?x(e(z),Q(x)f[v(i)](x)):x(e(z+1),Q(x)58==*z?U[i-97]=x:_(c f=v(*z);Qd(!f)F[f](n(i),x))))

Uh, let's spread this out a little.

f(e, s z=x; c i=*z++; !*z ? n(i)
	: v(i) ? x(e(z), Q(x) f[v(i)](x))
		: x(e(z+1), Q(x) 58==*z ? U[i-97]=x 
			: _(c f=v(*z); Qd(!f) F[f](n(i),x))))

Okay, easy. It's a recursive function that checks each character of vector x, called i. If we're at the end of the string, we check if i is a value and return it. Otherwise, we first check if it's an operator (from our V string). If it is, we evaluate the rest of the string, check for errors, and then apply the operation to the result from the rest of the evaluation. If it wasn't an operation, we evaluate the rest of the string, skipping one character. If the skipped character is a colon (ASCII 58), we assign the result of the evaluation to one of the slots in U (if the character i is not a lowercase ASCII letter, this will corrupt memory, so don't write bugs).

I'm pretty sure spaces are a syntax error in every location, and I don't see code to create array literals. If the skipped character is an operator, we instead apply that binary operator on the evaluation result and n(i), which is either a variable or a number. This means code is executed from right to left, with no operator precedent, which is standard for APL-type languages from what I understand. Because this language has only single-character variable names, numbers, and operators, this is all the tokenizing, parsing, and evaluation we need.

C int main(){c b[99]; while(1) if(w(32),b[read(0,b,99)-1]=0,*b) 58==b[1] ? e(b) : W(e(b));}

And finally, main. In an infinite loop, we read up to 99 characters from a user, and then write the evaluated result of that text.

Et voila. We have a tiny interpreter for a simple array language. It's not exactly production-ready, but it does quite a lot in its tiny footprint.

Conclusions

I think I can say I understand this code pretty well, even more than most code I read. I don't know how much of that is because of the coding style, or because I spent eight hours writing several thousand words about fifty lines of code in a borderline-delusional fugue state brought on by drinking one small Starbucks™ Frapuccino® (Mocha® Flavored*) I bought from a gas station at 10 PM on a Thursday.

I had some fun dreams about macros with one-character names applying operations on scalars and vectors that morning (I went to sleep at 6:40 AM).

All in all, it was a fun exercise. To summarize my thoughts:

Good ideas

Bad ideas

Ideas I'm ambivalent about

This coding style feels most appropriate for "done" code, or code that's been worked out on paper or some other environment and is now being written down in a compact way. It's more like finalized mathematical notation than typical code. I don't see myself being able to effectively write code like this, since I tend to quickly jot down ideas, compile and run to validate, and edit what I've done based on the results. Making a small set of good primitives and building heavily on those requires basically having solved the problem before writing a single line. Otherwise you get bad primitives that cause more confusion than help, and adjusting those primitives would involve rewriting all the code that depends on them, which is all the code.

But I think that's my key takeaway: I tend to work out problems in the code, which can lead to messy results. I write the dumbest possible solution, and then have to try and refactor as I develop a better mental model of the problem. What I like most about this code isn't how few characters are used, or everything fitting in one screen. I like how it seems the code was well-understood before it was written. You can't refactor a 500-line jumble of C into code like this.

The real lesson is that I'm probably too quick to jump into coding things. I could spend more time working out how I want to model a problem in a more free-form way, like writing notation on paper, before jumping into the rigid world of programming syntax. With a clear mental model, I can then write code in terms of how I was thinking about the problem, instead of thinking about the problem in terms of how I wrote the code.

Next time

I think a fun exercise would be to extend this interpreter while maintaining its code style, to see how I fare actually working with it. The repository this came from has various suggested exercises, but my ideas would be adding the following: