Building Wee Lang (2) - History lessons

July 21, 2018

We're going to build an eierlegende Wollmilchsau.

This is the second in a series of articles on building a programming language called Wee and its tooling. Wee is an educational prgramming tool for beginners, bridging the gap between (visual) learning tools like Scratch, and professional environments like Java, Python, or C. You can learn more about my motivations on my old blog.

In the last installment, I reminisced about how I learned to program. For brevity's sake, I left out one important aspect: QBasic, the language. In this article, I want to informally dive into QBasic the language and then try to identify the key elements across all the factors that enabled me to learn programming, technical and non-technical. It is these key elements I want to improve upon and derive Wee from.

As a small aside: QBasic is a less powerful derivative of QuickBasic that shipped with MS-DOS 6.x. It lacks QuickBasic's compiler and linker, misses some standard library functions, does not include advanced IDE features like watchpoints and conditional breakpoints, and does not support QuickBasic's module system. Except for the missing module support, the languages QuickBasic and QBasic are identical. I'll hence treat them as the same in the following discussion.

Q(uick)Basic

A lot has been written about the mind-crippling properties of BASIC. The most prominent quote about BASIC comes from Dijkstra

"It is practically impossible to teach good programming to students that have had a prior exposure to BASIC: as potential programmers they are mentally mutilated beyond hope of regeneration."

Having famously made the case against the GOTO statement and having coined the term structured programming, it is unsurprising that Dijkstra would have great distain for a language that is essentially GOTO soup, as this little GW-BASIC snippet illustrates:

10 INPUT "What is your name: "; U$
20 PRINT "Hello "; U$
30 INPUT "How many stars do you want: "; N
40 S$ = ""
50 FOR I = 1 TO N
60 S$ = S$ + "*"
70 NEXT I
80 PRINT S$
90 INPUT "Do you want more stars? "; A$
100 IF LEN(A$) = 0 THEN GOTO 90
110 A$ = LEFT$(A$, 1)
120 IF A$ = "Y" OR A$ = "y" THEN GOTO 30
130 PRINT "Goodbye "; U$
140 END

Early BASIC dialects like GW-BASIC supported only rudimentary forms of control flow statements. Instead, control flow was largely expressed by jumping between (mandatory) line numbers via the GOTO statement. These dialects also did not support functions. These are the BASICs Dijkstra was refering to.

But Q(uick)Basic was a different beast entirely: it was a structured programming language.

Program Structure

A QuickBasic program consists of one or more .BAS files (QBasic only deals with a single .BAS file). Each file defines a module. A module consists of:

Forward declarations of functions, procedures, types and (module-)global variables.
Module-level code.
Function and procedure definitions.

When compiling or running a program made up of multiple modules, one module is chosen as the entry point. Its module-level code becomes the main driver of the program.

To access functions, procedures, types and variables of other modules, they have to be declared in the module using them. This is achieved through a .BI header, usually one per module. This header file can then be included via the $INCLUDE meta-command, in both the module the declarations belong to, as well as other modules that want to access the module's content.

Here's a simple example of a program consisting of a main module, and a second module defining a procedure, a function, a variable shared across modules, and a type. Note that the variable is both accessible to other modules module-level code through COMMON, and their functions and procedures through SHARED.

' --- MAIN.BAS ---
'$INCLUDE: 'MODULE.BI'

DIM p as Point2D

initModule
PRINT "Module version: " + moduleVersion
PRINT add%(1, 2)
PRINT p.x; p.y

' --- MODULE.BI ---
DECLARE SUB initModule ()

DECLARE FUNCTION add%(a AS INTEGER, b AS INTEGER)

COMMON SHARED moduleVersion AS STRING

TYPE Point2D
   x AS INTEGER
   y AS INTEGER
END TYPE

' --- MODULE.BAS ---
'$INCLUDE: 'MODULE.BI'

SUB initModule ()
   moduleVersion = "1.0"
END SUB

FUNCTION add%(a AS INTEGER, b AS INTEGER)
   add% = a + b
END FUNCTION

QuickBasic does not have the notion of name spaces, something sorely missed when mixing many modules. Overuse of module-global and program-global variables was also a "feature" of many QuickBasic programs in the wild, despite language constructs that could have helped avoid them.

QuickBasic modules can be combined into library files with the suffix .QLB. This allows exchanging code without giving away the source code. In addition to packaging up QuickBasic modules, you can also put native code (.OBJ and .LIB) into a QLB file. QuickBasic can call into cdecl and pascal code pretty easily. Declare the function signatures, marshall strings and arrays or pass pointers to records directly, and away you go.

Types, variables, scopes and operators

QBasic supports the following types:

Primitive types
- INTEGER, for 16-bit signed integers.
- LONG, for 32-bit signed integers.
- SINGLE, for 32-bit floating point numbers.
- DOUBLE, for 32-bit floating point numbers.
STRING, immutable, fixed length or dynamic length (e.g. as the result of a concatenation), and made up of ASCII characters.
Arrays, multi-dimensional, with custom index ranges. Can contain elements of all types except other arrays.
Records, that could contain primitive types, fixed length strings, and other record types as fields. Can not contain dynamic strings or fixed/dynamic arrays. Records can not be recursive.

The first surprise in QBasic's type system: everything is a value type (as far as the user is concerned, and with one exception).

Every variable is statically typed, either through an explicit declaration statement or implicitely through first use and an optional type suffix.

a% = 10 ' A 16-bit integer
b& = 20 ' A 32-bit integer
c! = 1.23 ' A 32-bit float
d# = 1.23 ' A 64-bit float

DIM s as STRING * 40 ' Fixed length string of 40 characters
e$ = "Hello world." ' A dynamic string
e$ = e$ + " My name is Mario." ' Concatenation creates a dynamic new string

DIM a(-5 TO 5) AS INTEGER ' Fixed length array of integers
REDIM f(a%) AS INTEGER ' A dynamic array

' Record type declaration
TYPE Point2d
   x AS SINGLE
   y AS SINGLE
END TYPE

DIM p AS Point2d ' A record consisting of two 32-bit floats x and y.

Declaring a variable allocates the required memory to hold a value of the variable's type. The variable will point to that memory location until it goes out of scope and its memory is reclaimed (more on that in the memory management section below). The scope of a variable is either that of the program (module-level variables), or that of the function or procedure it is defined in. Qbasic does not create scopes for control flow structures.

Variables are initialized to their type's default value. All primitive types are initialized to 0. Strings are initialized to an empty string. Record fields are initialized to their respective type's default value. Array elements are also initialized to their type's default value.

QBasic performs implicit narrowing and widening for primitive types, e.g. you can assign an INTEGER to a SINGLE and vice versa. There is no implicit conversion of non-string types to STRING.

For primitive types, the binary arithmetic operators +, -, *, /, \ (integer division), MOD (integer modulo) and ^ (exponentiation) are defined. Unary arithmetic negation is also supported.

The binary logical operators AND, OR, and XOR, as well as the unary logical NOT are only defined on primitive types. QBasic does not have a boolean type, but instead interprets non-zero primitive values as true and zero as false.

The binary relational operators <, <=, >, >=, = (equal), <> (not equal) are supported for primitive types and strings. In the case of strings, the operators compare each string's characters and their length. Records and arrays can not be tested for equality/inequality, altough this would be trivial since everything is a value type.

All operators discussed so far follow the standard operator precedence found in other languages like JavaScript. In addition to this operator precedence, users can use ( and ) to group sub-expressions.

The assignment operator = is defined for all types except arrays. Since everything is a value type, assignment deep copies the right-hand side value to the left-hand side variable, record field or array element.

The array element indexing operator (index) lets us work with array elements. Used on the left-hand side of an assignment, the right-hand side value (which must match the array element type) is deep copied to the array element at the index. Used within an expression, the operator returns a deep copy of the array element at the index.

The field operator . lets us access the invidual fields of a record. Used on the left-hand side of an assignment, the right hand side value (which must match the field type) is deep copied to the field of the record. Used within an expression, the operator returns a deep copy of the field value of the last element in a field operator chain, e.g. a.b.c.

Control flow statements

QBasic supports a basic set of control flow statements.

' Conditions must evaluate to non-zero (true) or zero (false)
IF a < b AND d THEN
   ...
ELSEIF d <> e THEN
   ...
ELSE
   ...
END IF

WHILE a
   ...
WEND

DO
   ...
LOOP WHILE a > b

' STEP is optional
FOR i = 2 to -2 STEP -1
   ...
NEXT

' The test expression can be a primitive or string
SELECT CASE someString$
CASE "Hello", "World"
   ...
CASE IS < "Muh" ' IS is required when using relational operators
   ...
CASE 10 TO 20 ' Can also specify ranges if the text expression is a number
   ...
CASE ELSE
   ...
END SELECT

Nothing out of the ordinary, except that SELECT statements, that allow usage of strings, ranges and relational operators (compare to Java's switch and cry a little inside). Of course, QBasic also supports GOTO, but let's ignore that. All control flow statements are just that: statements. They do not yield values themselves.

Procedures and functions

Lacking a unit or void type, QBasic has to make an explicit distinction between functions and procedures (functions "returning" void).

SUB DoSomething (stuff%, otherStuff%)
...
END SUB

FUNCTION Repeat$(s AS STRING, repetitions as INTEGER)
   FOR i% = 1 to repetitions
      Repeat$ = Repeat$ + s
   NEXT
END FUNCTION

DoSomething 1, 2
a$ = Repeat("Huh", 5)

The return type of a function is declared by suffixing the function name with one of the implicit type suffixes for primitive and string types. This means that in QBasic, functions can not return records or arrays! To return a value, you simply assign it to the function name.

Here is another interesting deviation from the "norm": All arguments are passed by reference by default, including primitive types. This is the exception to "everything is a value type". It is also a solution to the problem of not being able to return records and arrays: you simply pass them in by reference.

SUB foo(i AS INTEGER, a() AS INTEGER, s AS STRING, p AS Point2d)
   i = 10
   a(1) = 20
   s = "Hello"
   p.x = 30
END SUB

DIM i AS INTEGER
DIM a(1 to 10) AS INTEGER
DIM s AS STRING
DIM p AS Point2d

foo(i, a, s, p)

PRINT i; a(10); s; p.x
' Output: 10 20 Hello 30

The BYVAL and BYREF keywords can be used to specify how arguments should be passed explicitely, except for array parameters, which are always passed by-reference. That makes sense, as passing an array by value could be a costly operation.

Speaking of arrays: array parameters are always dimensionless. Bounds and dimensionality checks are performed at runtime.

Error handling

QBasic has a somewhat rustic way of raising and handling errors. The ERROR statement raises an error, identified by a (user-defined) error code. If no error handler is set, the program terminates. To set the global error handler, the ON ERROR GOTO label|lineNumber statement is used. If an error occurs, the runtime jumps to the module-level line specified after GOTO, which can decide how to proceed. The error handler has access to a few globals, most notably ERR, which holds the error code, and ERL, which holds the line at which the error was raised. The error handler can either exit the program, resume at the next statement after the statement that raised the error, or jump to a label/line number. Here's how that looks like:

SUB foo ()
   ERROR 123 ' raise an error
   PRINT "Resumed from error handler."
END SUB

ON ERROR GOTO ErrorHandler

CALL foo()
b% = 0
a% = 1 / b% ' Division by zero raises ERR 11

ErrorHandler:
IF ERR = 123 THEN
   PRINT "Oh no, an error occured, let's ignore it."
   RESUME NEXT
ELSE
   PRINT "Unhandled error"; ERR; "at line "; ERL
   END
END IF

' Output:
' Oh no, an error occured, let's ignore it.
' Resumed from error handler.
' Unhandled error 11 at line 10

Not the most sophisticated of error handling mechanisms, and a bit icky as all errors (including those of modules) need to be handled in a single place. Still better than a segmentation fault!

Memory management

QBasic has automatic memory management. For a user, that's the end of the story (safe for standard library functions like CLEAR that smoke any and all variables for cases where memory gets tight). Under the hood, things are more interesting

Primitive type values and record values work the same: they are allocated in the program's data segment (module-level code) or on the stack (by-value arguments, function/procedure-level variables). Module-level values of these types never get reclaimed. Stack allocated values are cleaned up as soon as a function/procedures frame is popped from the stack. This may partially explain why records can not contain dynamic strings or arrays, as these are handled differently (Records can contain static strings which are just embedded, fixed length arrays of bytes without additional meta-date, and hence easy to reclaim).

Both dynamic strings and dynamic arrays allocate their data on the heap internally. In addition to a pointer to that heap data, they also keep track of the data's length for bounds checking.

An array's heap data is uniquely owned by its variable. Since we can't assign one array to another, the ownership can not be transferred. Since arrays can only be passed by reference, and functions can not return an array, the invariant still holds (think this through, it's fun!). Reclaiming dynamic arrays is thus simple: if the array variable goes out of scope, free the heap memory. The second reclamation point is when an array is dynamically resized via the REDIM statement: the old heap data gets freed, and a new heap block is allocated an assigned to the array variable. The invariant still holds!

Strings require a more complex reclamation strategy, as strings can be assigned to each other. In addition to the heap pointer and length, QBasic also stores a reference count for string heap data. Since strings are immutable, their heap data can be shared by multiple string variables. When one dynamic string is assigned to another dynamic string variable, the heap data reference count of the old string data is decreased, and if no other string variable references it, reclaimed. Conversely, the heap data reference count of the assigned string is increased, and the assigned-to string variable points to that heap data going forward. The two strings now share the same heap data.

As it turns out, QBasic does actually work with internal reference types for dynamically sized types. It also achieves multiple-ownership semantics by reference counting for string data. Without the user noticing.

Note: The information in this section has been pieced together from bits of information scathered around the web and inspecting the assembly output of the QuickBasic compiler. There may be some inaccuracies, but the basic principles should hold.

Design consequences

In my opinion, the biggest consequence of QBasic's design is that there are no aliases, (except when passing arguments by reference). This is a side-effect of not having reference types (again, ignoring arguments passed by reference). It's both a blessing and curse. What's aliasing, you ask?

Aliasing occurs when a single value is referred to by two or more symbols in the program, e.g. two variables, or a field and a variable, or through two array elements, and any other combination you can think of. It's very common in languages that have pointers or references. Consider this (pseudo) Java snippet:

class A { B b = new B(); }
A a = new A();
foo(Arrays.asList(a.b));

The B instance escapes the unique ownership of A by being also put into a list. Worse, that list is passed to a method foo(), and god knows what that does with our poor B (e.g. create more references to our B). Anything in the system can potentially modify B, which makes reasoning about state hard.

Having no aliases is a blessing, because a system without aliasing simplifies reasoning about state and state mutators. This is not only true for humans, but also compilers, which have a much easier time optimizing code without aliases. In case of QBasic, it also makes automatic memory management a lot simpler: since there can not be reference cycles, simplistic reference counting is sufficient, and only required for types that can't be allocated on the stack, e.g. dynamic arrays or strings.

On the other hand, having no aliasing is also a curse. Quite a few fundamental data structures like graphs are cumbersome to implement without aliases. In a system without reference types, we can't have recursive types. Without recursive types, we can't do something like a binary tree as easily as this:

// A binary tree node stores references to its left
// and right children via a reference, with null signaling
// "no reference".
class Node {
   String value;
   Node left;
   Node right;

   public Node (String value) { this.value = value; }
}

// Creating a tree is easy
Node root = new Node("root");
root.left = new Node("root.left");
root.right = new Node("root.right");
root.right.left = new Node("root.right.left");

// Traversing the tree is simple as well
public void traverse (Node node) {
   if (node == null) return;
   System.out.println(node.value);
   traverse(node.left);
   traverse(node.right);
}

Here's the equivalent in QBasic:

' A binary tree node stores references to its left
' and right children via an index into an array, with 0 signaling
' "no reference".
TYPE Node
   value AS STRING * 40
   left AS INTEGER
   right as INTEGER
END TYPE

' Creating a tree is cumbersome
DIM nodes(1 TO 4) AS Node
nodes(1).value = "root"
nodes(1).left = 1
nodes(1).right = 2
nodes(2).value = "root.left"
nodes(3).value = "root.right"
nodes(3).left = 4
nodes(4).value = "root.right.left"

' Traversing a tree is slightly more cumbersome
SUB traverse (nodeIndex AS INTEGER, nodes() AS Node)
   if (nodeIndex = 0) RETURN
   PRINT nodes(nodeIndex).value
   traverse(nodes(nodeIndex).left, nodes)
   traverse(nodes(nodeIndex).right, nodes)
END SUB

We have built our own "reference system" by abusing array indices as aliases. Rust suffers from a similar deficiency, provided you want to stick to idiomatic Rust. We can also see that in a system without reference types, it's impossible to "extract" sub elements of arrays or records into the local scope through an alias.

DIM node AS node
node = nodes(1) ' Let's modify the root!
node.value = "Hi!"
PRINT nodes(1).value ' Oh no, we modified a copy
' Output: root

But it's not all bleak. Another consequence of "everything is a value type" is the absence of the billion dollar mistake. There is no null in QBasic. Sadly, it also lacks a mechanism to signal optionality of a value.

Key elements

If you read this far, ping me on Twitter to receive your price! Time to summarize my findings.

Target audience, prerequisits, motivators

I did not jump head first into programming, but had exposure to basic computing first. From that, I derrive the following basic skill set, which implicitely defines the target audience of Wee. A novice user of Wee:

Has a specific goal that can be achieved through programming.
Knows basic arithmetic and can read and write fluently.
Has a personal computer.
Knows how to write on a keyboard and how to use a mouse or touchpad.
Knows what files and folders are.
Knows that programs exist and serve different purposes (but not necessarily how they work).
Knows how to use a browser.
Extremely basic proficiency in English.

The requirement of owning a personal computer may look a bit weird when phones and tablets are ubiquitous. But I do not think a tiny phone display and touch typing lend themselves well to learning to program. The limited screen space requires excessive mode switching. Worse, on-screen keyboards cover 30-50% of the available screen space, leaving less room to display helpful information. It is simply not comfortable to program on a mobile device (including tablets).

The basic English proficiency requirement stems from the fact, that I do not plan on localizing keywords and sample programs to other languages. My super scientific Twitter poll seems to indicate that localizing the programming language probably does more harm than good. Information in the software world is pre-dominantely transported in English, so we might as well get novices on track early. Learning materials, comments in programs, and possibly community sections should of course be localized.

Of course, some of these requirements can be relaxed if the user is accompanied by a person that can compensate missing skills on the spot.

Learning materials

When I started to program, I sourced my learning materials from various places, which was not very effective. The materials also differed in kind. The rough sketch for Wee:

Remove the language barrier as much as possible, by localizing all learning materials.
Provide a centralized hub for all learning materials, including instructions on where and how to get help.
Provide guided learning materials in form of interactive tutorials, that introduce users to
- A high-level overview of programming and what it can be used for.
- The notional machine model used by Wee, using an interactive visualization.
- How to navigate and use the basic facilities of the programming environment.
- The Wee programming language, using relatable problems to motivate the introduction of a language feature.
Provide explorative learning materials in form of small programs, using descriptive naming and comments, which users can modify and remix.
Provide reference materials integrated in the programming environment, including small, self-contained examples that can be copied, pasted and run immediately.
Allow users to publish their own materials (articles, remix-able programs, comment sections on programs, ...) on the centralized hub.
Pointers to "where to go from here" for advanced users that want to explore professional tools.

Programming environment

I think QBasic is still a pretty good baseline when it comes to introductory programming environments. I want to update that experience with Wee:

A familar UI/UX that avoids information overload, invites exploration through simplicity, and lacks ceremony.
A great on-boarding experience with instant gratification, even for people that do not read the fine manual.
A contextual help system, worded in a beginner friendly way, with references to other learning materials.
Minimal context switching.
Immediate feedback on errors, with error messages for humans.
A capable debugger.
Code navigation, completion and refactoring facilities.
Collaborative editing.
Source control management.
Simple mechanism to share programs and libraries.

Again, much of this is fuzzy. I'll figure out the details once I get there. This includes the choice of platform. Currently, I gravitate towards the browser, for multiple reasons. I can more easily distribute fixes and additions to the platform and learning materials. It is more conductive to community aspects, including sharing. It's (mostly) cross-platform. There are of course also quite a few downsides. Not everyone is connected to the internet all the time. Getting a complex application like this to work across browsers is a bit of a nightmare. Finally, as a user, I might not want to leave my code with a 3rd party.

An alternative to the browser would be a Visual Studio Code plugin. I would have to re-invent less, and the user would work with a professional tool. But I also could not control the experience fully, and community features are harder to integrate.

Programming language

QBasic was far from perfect, but I think some of its features and limitations were actually conductive to me learning to program. I already have a pretty good idea of what Wee's language design will look like. It still needs to pass the Michael smell test, so I post-pone going into details until the next article. I can however present some basic elements:

Statically typed structured programming language.
Support for modules that can be used as libraries.
Namespaces, one per module.
Automatic memory management (but without garbage collection).
(Almost) Everything is a value type.
- unit.
- Primitive types (boolean, int, float, string, functions).
- Lists and dictionaries.
- Non-recursive records.
- Reference type that can not escape into parent scopes (function arguments, local variable references).
Default value initialization.
List, dictionary and record literals.
No implicit narrowing/widening/conversions.
Standard set of arithmetic, logical and relational operators, with standard precedence.
Assignment operator deep copies (including arrays).
Control flow constructs (if, for (each), while, match).
Functions
- Immutable pass-by-reference by default.
- Pass-by-value and mutability argument modifiers.
- Can return any type or unit.
- Named arguments.
- Overloading on argument types.
No aliasing except for reference function arguments.
No null.
Error handling TBD (possibly unchecked exceptions).
Standard library (strings, i/o, graphics, audio).

While you can already start the bike-shedding, you may want to wait for a future article where I'll dive into the details. I have not decided on the runtime yet. I think for V1, I'll write a byte code interpreter for which multiple implementations can be written. Since I'm likely going to target the web first, I might look into WebAssembly as an alternative target. Ideally, Wee could be embedded anywhere.

Community

The final ingredient is a thriving community. In the old days, various scathered websites and forums as well as IRC channels served as community hubs. For Wee, I'd like this to be a bit more centralized, integrating it with the learning materials and the programming environment:

Users can create an account and get access to the below features.
Server-side storage and versioning of programs.
Sharing and remixing of programs.
Collaborative editing.
Forums.
Let users publish articles, dev logs, etc. (could be part of forums).
Comment sections for programs and learning materials.

Funny as it may sound, I feel the least confident in building this part. There is also a massive can of worms to be opened on the legal side of things, especially when kids are involved.

Next up

Now that I have an idea of what I want to achieve, I can start with the actual work. The next step is finishing the design of the language. Once that's complete, I'm going to share it here so we can all have a nice round of bike-shedding.

As always, I'm happy to hear from you on Twitter and appreciate corrections to my terrible writing in form of a pull request. If you are interested in helping out with Wee, or want to give feedback that does not fit into a tweet storm, you can reach me via email.