Here we go again...
It's been almost 2 years since I last wrote a blurb on this blargh. Welp, I need a new creative programming side project before I go insane. Let's call it Qak!
You might remember me waxing on and on about Wee a long time ago. That eventually ended up becoming Paperbots, a browser-based programming environment geared towards beginners, teachers, as well as programmers who want a simple creative outlet. It worked out pretty well, and were I not as lazy as I am, I'd figure out how to get the certificate for the site going again, so you can enjoy the oevres my pupils created. Alas, I remain lazy, and my nginx-proxy and Docker knowledge is insufficient.
Paperbots was written in TypeScript and intended for browser-use only. In recent months, I returned to native land, writing (terrible) C, C++ and some x86_64 assembly. CLion made this so enjoyable that I started craving work on a C++ based side project to get the creative juices flowing again. There's just something magical about not being tied to a managed runtime and peaking and poking at memory addresses.
What better excuse to start yet another language/VM project! This time around, I don't have any lofty goals. Goals take the fun out of the equation. Instead, I'll just see where the wind carries me.
While I'd love to create something similar to Bob Nystrom's excellent Crafting Interpreters, I neither have the time, talent nor copy editor to come up with something as educationally sound. I do however hope to add on top of Bob's amazing work by building a statically typed scripting language instead of a dynamically typed one. It will be interesting to see how it performs compared to the likes of Python or Bob's Wren.
While Bob did occasionally dive into the real-life issues that one encounters when creating a programming language and runtime, I feel like more documentation of that specific sort of pain can provide additional knowledge on top of Bob's already impecable work. So, this is not me telling you about my successes. This is going to be me detailing my failures and dumbness!
Where the wind's hopefully gonna blow
I said no goals. I lied. Here's what I want to eventually create:
- A statically typed programming language, featuring boring stuff like
- Algol-language family syntax (yes, shhhhh)
- Primitive types
- Records
- Functions/Closures as first class citizens
- Parametric polymorphism
- Modules
- A stack-based virtual machine, featuring
- Single-threaded use only
- A pluggable garbage collector
- A debugger interface
- A bi-directional FFI making it easy to call C code from inside the language and vice versa.
The little fucker is not going to win any language design awards. But hopefully I can keep the implementation as simple as possible to allow easy experimentation. E.g. writing an emitter for x86_64 assembly or LLVM IR, or adding new syntactic sugar should be a mostly self-contained, pluggable affair.
Cmake this, Cmake that
I spent 3 hours last sunday to set-up the basic build system and some code that will come in handy. More on the code in later sections.
Here's the build, expressed as CMakeLists.txt file.
cmake_minimum_required(VERSION 3.15)
project(qak)
if (MSVC)
message("MSCV detected")
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS}")
set(CMAKE_C_FLAGS "${CMAKE_CXX_FLAGS}")
else ()
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -Wall -Wextra -pedantic -std=c89")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wall -Wextra -Wnon-virtual-dtor -pedantic -std=c++03 -fno-exceptions -fno-rtti")
endif ()
set(CMAKE_CXX_STANDARD 11)
include_directories(src)
file(GLOB INCLUDES "src/*.h")
file(GLOB SOURCES "src/*.cpp")
add_library(qak-lib ${INCLUDE} ${SOURCES})
set_target_properties(qak-lib PROPERTIES OUTPUT_NAME "qak")
add_executable(qak ${INCLUDES} "src/apps/qak.cpp")
target_link_libraries(qak LINK_PUBLIC qak-lib)
include_directories(src/apps)
add_executable(test_tokenizer ${INCLUDES} "src/apps/test_tokenizer.cpp")
target_link_libraries(test_tokenizer LINK_PUBLIC qak-lib)
include_directories(src/apps)
add_executable(test_memory ${INCLUDES} "src/apps/test_memory.cpp")
target_link_libraries(test_memory LINK_PUBLIC qak-lib)
include_directories(src/apps)
add_executable(test_map ${INCLUDES} "src/apps/test_map.cpp")
target_link_libraries(test_map LINK_PUBLIC qak-lib)
I'm not particularly fond of CMake, but I love CLion so much, that I'm happy to suffer. While CMake veterans may hold their noses at my use of GLOB
, it makes my life easy.
The build generates a few artifacts:
libqak.a
, a static library you can link into your app.qak(.exe)
, an executable you can eventually feed sources to.test_xxx(.exe)
, a set of executable, each testing various aspects of Qak.
When loaded into CLion, I get build targets for each of the above, which lets me iterate right inside CLion at break neck speeds.
It all feels surprisingly light-weight so far, and since I don't plan on bringing in any dependencies, it may actually stay that way. There's hope.
Testing, testing... is this working?
While I could use any of the available C++ unit test frameworks, like Google Test, I'm a simple boy who likes simple things. Here's my unit test framework.
#ifndef QAK_TEST_H
#define QAK_TEST_H
#include <cstdio>
#include <cstdlib>
#include "types.h"
#define QAK_CHECK(expr, ...) { if (!(expr)) { fprintf(stdout, "ERROR: "); fprintf(stdout, __VA_ARGS__); fprintf(stdout, " (%s:%d)\n", __FILE__, __LINE__); fflush(stdout); exit(-1); } }
#endif //QAK_TEST_H
And here's a (smoke) test.
#include <stdio.h>
#include "qak.h"
#include "test.h"
using namespace qak;
int main() {
BumpAllocator allocator(16);
u1* fourBytes = allocator.alloc<u1>(4);
QAK_CHECK(fourBytes != nullptr, "Expected allocated memory.");
QAK_CHECK(allocator.head->next == nullptr, "Expected single block.");
QAK_CHECK(allocator.head->nextFree - allocator.head->base == 4, "Expected 4 allocated bytes.");
u1* eightBytes = allocator.alloc<u1>(8);
QAK_CHECK(eightBytes != nullptr, "Expected allocated memory.");
QAK_CHECK(allocator.head->next == nullptr, "Expected single block.");
QAK_CHECK(allocator.head->nextFree - allocator.head->base == 12, "Expected 12 allocated bytes.");
u1* moreBytes = allocator.alloc<u1>(1000);
QAK_CHECK(moreBytes != nullptr, "Expected allocated memory.");
QAK_CHECK(allocator.head->next != nullptr, "Expected single block.");
QAK_CHECK(allocator.head->nextFree - allocator.head->base == 1000, "Expected 1000 allocated bytes.");
allocator.free();
QAK_CHECK(allocator.head == nullptr, "Expected no block.");
}
I basically built my own assert
called QAK_CHECK
that evaluates an expression. In case the expression ends up being false
, a super helpful error string is printed to the console and the test app kills itself with a non-zero exit code.
I'm happy with this so far. Eventually, a little Bash script will build all the test_xxx
executables, enumerate and execute them, and collect stats on failed vs. succeeded tests based on their exit codes. Simple and quick.
Memory management
Over the Christmas holidays I played around with OpenJDK a little bit. It's a surprisingly readable code base, with many API design decisions I really like. One of them is their use of memory arenas plus RAII to clean-up after yourself in local blocks. Here's a simple example straight from OpenJDK's instanceKlass.cpp
:
if (log_is_enabled(Trace, class, nestmates)) {
ResourceMark rm(THREAD);
log_trace(class, nestmates)("Checking nest membership of %s in %s",
k->external_name(), this->external_name());
}
The piece of code creates a ResourceMark
inside the if
block. Any allocation happening in that block will then be tracked by that mark. The external_name()
method allocates memory in this case. When the block is exited, the ResourceMark
deconstructs itself through RAII, and cleans up any memory that was allocated through it. Nice.
There is of course a bit more to this, as indicated by the THREAD
passed to the constructor. Under the hood, the ResourceMark
uses a thread local to push itself on a stack of ResourceMarks
. Any code that requires allocation can then get the ResourceMark
at the top of that thread-local stack to do its dirty allocation work. This has the nice side effect that one does not have to pass around allocators. It does incure a bit of a performance hit though, as accessing the thread local isn't free.
I've built something similar for Qak called HeapAllocator
. It keeps tracks of (re-)allocations which it cleans up once it's destructed. The Qak code base is inherently single-threaded (yes, shhh), so I don't want to incure the thread-local access hit (for now). This means I have to pass the allocator to anything that needs to allocate memory. A minor inconvenience.
{
HeapAllocator mem;
Buffer file = qak::readFile("data/tokens.qak", mem);
Array tokens(mem);
Array errors(mem);
u1* scratch = mem.allocate(1024, __FILE__, __LINE__);
}
For now, the allocator keeps track of the file location of every allocation. That allows it to act as a poor man's memory leak detector. Eventually, I don't want to bake those long file names into the binary, so that's going to be toggleable via a flag.
I've also implemented a simple BumpAllocator
, the usage of which you can see in the above test code in the previous section. It will come in handy for allocations that are highly sequential, e.g. abstract syntax tree nodes during the parsing phase.
Am I happy with those memory management primitives? No.
The BumpAllocator
has a fragmentation problem. Internally it uses a linked list of fixed sized blocks. The head of that list is the "current" block. If the current block can't accommodate a new allocation, the allocator allocates a new block that can. This new block then becomes the new "current" block. The previous block just sticks around and will no longer be used, even if a subsequent allocation could fit it. There are a bunch of degenerate allocation patterns that can trip this up. But alas, we are in native code land, we know what we're doing!
The HeapAllocator
in its current state also has a few problems. For one, it uses a std::map
to keep track of allocations. Ideally, I can get rid of this dependency by using my own map. It also lacks the nice stack behaviour found in OpenJDK's ResourceMark
, but that one is easy to fix.
For now, both allocators serve their purpose, albeit not as nicely as they could. We'll live.
Collections
Everybody needs their own dynamic array and map implementation. So that's what I did. Feast your eyes on map.h and array.h. As with the allocators, there's a lot that could be better, especially the capacity and bucket management of the map. Alas, ain't nobody got time for that, so I'm moving on and will revisit this sometime in the future.
I/O, not one of Jupiter's moons
To get things off the ground, I also need a way to read files from disk. Behold, io.cpp. That is the code. Please send a PR to fix up all errornous edge cases!
Tokenization and UTF-8
With the scafolding out of the way, I could finally write some compiler related code. As with any compiler project, you start with the tokenizer, also known as scanner, or lexer.
The tokenizer takes a stream of characters (in some encoding, in our case UTF-8), ignores whitespace like new lines or tabs, and returns tokens. Tokens are meaningful text strings, representing special characters like +
, keywords, variable/type/function names, or literals like 1.23
, or "Hello world."
.
The tokenizer thus turns a source file like this:
한자🥴
123 123b 123s 123l 123.2 123.3f 123.4d
'c'
'\n'
true false
null
_Some987Identifier
"Hello world. 한자🥴"
. , ; : + -
Into a list of tokens like this:
Identifier (0:10): 한자🥴
Integer literal (11:14): 123
Byte literal (15:19): 123b
Short literal (20:24): 123s
Long literal (25:29): 123l
Float literal (30:35): 123.2
Float literal (36:42): 123.3f
Double literal (43:49): 123.4d
Character literal (50:53): 'c'
Character literal (54:58): '\n'
Boolean literal (59:63): true
Boolean literal (64:69): false
Null literal (70:74): null
Identifier (75:93): _Some987Identifier
String literal (94:119): "Hello world. 한자🥴"
. (120:121): .
, (122:123): ,
; (124:125): ;
: (126:127): :
+ (128:129): +
- (130:131): -
The returned tokens each have a type, like integer literal, or identifier, as well as a reference to their location in the source, expressed as a start and (exclusive) end offset in bytes. The location of a token is quite important, as it allows subsequent stages like the parser, or type checker to report errors nicely. That said, i should keep track of line numbers as well, to ease the pain when it's time to actually print a nice error.
The implementation of the tokenizer is no more than 320 lines of code, and covers pretty much anything we need for an Algol-type language. It's actually more or less lifted and translated from Java to C from my old basis-template scripting language. Surprisingly, the C++ code is a bit more concise than the Java version.
I've also added (possibly broken) UTF-8 support by shamelessly stealing 20 lines of code from here. The compiler really doesn't need a full-blown understanding of UTF-8. All we need is to be able to advance multi-byte UTF-8 characters. And that's what those 20 lines do.
You can go ahead and play around with the tokenizer via the test_tokenizer test. Just modify the tokens.qak file, then run the executable from the root directory of the project. Try to crash it.
Next up
Next up is the implementation of a run-of-the-mill recursive descent parser for the first couple of language constructs. The parser takes the tokens from the tokenizer stage, validates the syntax of the program and outputs an abstract syntax tree for subsequent processing.
Fun! Please reply to this tweet with your specific bikeshed!
Read the next post in the series.