# Rendering like it's 1996 - Rectangles

December 07, 2022 Cool kids draw their rectangles themselves.

To follow along this blog post with running code, make sure you've installed the prerequisites. Then:

``````git clone https://github.com/badlogic/r96
cd r96
git checkout 02-dont-be-square
code .
``````

Last time, we set up our development environment and explored concepts like pixels, colors, and rasters. Today, we're going to figure out how to draw rectangles. Exciting!

But first, we'll do some housekeeping.

## Putting more stuff into r96.h/r96.c

Let's stuff a few things into `src/r96/r96.h` and `src/r96/r96.h`, where re-usable code used by the demo apps lives.

### Destructuring colors

Creating colors from alpha, red, green, and blue components is done via `R96_ARGB()`. Sometimes we want to do the reverse and extract individual color components:

``````#define R96_A(color) ((uint8_t) (color >> 24))
#define R96_R(color) ((uint8_t) (color >> 16))
#define R96_G(color) ((uint8_t) (color >> 8))
#define R96_B(color) ((uint8_t) (color))``````

Which can be used like this:

``````uint32_t color = R96_ARGB(255, 128, 38, 4);
uint8_t alpha = R96_A(color); // 255
uint8_t red = R96_R(color);   // 128
uint8_t green = R96_G(color); // 38
uint8_t blue = R96_B(color);  // 4``````

### Stellar memory management

Remember this line from the previous demo apps?

``uint32_t *pixels = (uint32_t *) malloc(sizeof(uint32_t) * res_x * res_y);``

Direct call to `malloc()`, `sizeof()`, casting, yuck! Let's wrap allocation in a few simple macros:

``````#define R96_ALLOC(type) (type *) malloc(sizeof(type))
#define R96_ALLOC_ARRAY(type, numElements) (type *) malloc(sizeof(type) * numElements)
#define R96_FREE(ptr) free(ptr)``````

`R96_ALLOC(type)` allocates one instance of `type`. `R96_ALLOC_ARRAY(type, numElements)` allocates `numElements` sequential instances. To clean-up, we call `R96_FREE(ptr)`. E.g.:

``````typedef struct rectangle {
float x, y, width, height;
} rectangle;

rectangle *rect = R9_ALLOC(rectangle);
R96_FREE(rect);
uint32_t *pixels = R9_ALLOC_ARRAY(uint32_t, 320 * 240);
R96_FREE(pixels);``````

Apart from a little less typing, these macros let us replace our allocator, should we want that down the road.

### A struct for (raster) images

Keeping track of the width, height, and pixels of rasters is annoying, as we saw in the previous demo apps. The term "raster" is also a bit too "scientific". Colloquially, what we are dealing with are images.

Note: if we wanted to be super precise, we'd have to say "raster images". Images can also be represented in other ways, e.g. as a list of shapes with color information, aka vector images. In this series however, we'll only deal with raster images. Raster, raster image, and image are all synonymous throughout the series.

Here's a struct to ease the pain:

``````typedef struct r96_image {
int32_t width, height;
uint32_t *pixels;
} r96_image;``````

Many of the algorithms we'll build work using signed integers. We thus store `width` and `height` as signed integers to avoid some casting.

We add an initializer function that sets `width` and `height` and allocates memory for the pixels:

``````void r96_image_init(r96_image *image, int32_t width, int32_t height) {
assert(width > 0);
assert(height > 0);

image->width = width;
image->height = height;
image->pixels = R96_ALLOC_ARRAY(uint32_t, width * height);
}``````

The heap-allocated memory also needs to be freed at some point, so we add a disposal function:

``````void r96_image_dispose(r96_image *image) {
R96_FREE(image->pixels);
}``````

Creating raster images has never been easier:

``````r96_raster raster;
r96_raster_init(&raster, 320, 240);
r96_raster_dispose(&raster);

r96_raster *heapRaster = R96_ALLOC(r96_raster);
r96_raster_init(heapRaster, 64, 64);
r96_raster_dispose(heapRaster);
R96_FREE(heapRaster);``````

As illustrated by the snippet above, we can initialize and dispose both stack and heap allocated `r96_image` instances. Which leads me into a little excursion on memory and resource life-time management.

### A word on memory and resource life-time management

In C, we don't have the luxury of a borrow checker like in Rust, or a garbage collector like in Java. We don't even have RAII, like in C++. Heap allocated memory blocks and other resources, like file handles, must be carefully managed.

C doesn't really give us the tools to build water-tight resources management. Instead, we must come up with rules, that make it less likely that we'll shoot ourselves in the foot. From these rules we derive an API design which should help enforce the rules.

Our 4 rules are:

• There are no naked resources (e.g. a pointer to heap allocated memory)
• A resource is owned by a single instance of a resource-owning type (e.g. heap memory storing pixels is owned by a single `r96_image` instance)
• Prefer stack allocation over heap allocation
• Prefer value types over reference types

While we won't go for a full "handles instead of pointers" system, we'll try to make our code as pointer- and allocation-free as possible.

How does this translate to the `r96` API?

1. Resource-owning types are initialized and disposed with corresponding initializer and disposal functions, e.g. `r96_image`, `r96_image_init()`, and `r96_image_dispose()`
2. Resource-owning types are always passed and returned by reference
3. Non-resource-owning types are initialized with C99 designated initializers
4. Non-resource-owning types are passed to and returned from functions by value, unless performance considerations make it prohibitive

The first two rules make it less likely that two instances of a resource-owning type point to the same resource. The initialization and disposal functions give us a single location where a resource of a specific type is acquired and released, which helps with debugging.

The latter two rules make it more likely that we store much of our data on the stack instead of the heap (or as value types instead of reference types). Stack allocation is faster, and we don't need to remember to free anything if a stack allocated thing goes out of scope.

We'll try to stick with this until everything falls apart.

### Reading and writing pixels, the safe way

Previously, we've calculated pixel addresses manually. Let's wrap that functionality up in functions for setting and getting pixel colors on and from an `r96_image`:

``````void r96_set_pixel(r96_image *image, int32_t x, int32_t y, uint32_t color) {
if (x < 0 || x >= image->width || y < 0 || y >= image->height) return;
image->pixels[x + y * image->width] = color;
}

uint32_t r96_get_pixel(r96_image *image, int32_t x, int32_t y) {
if (x < 0 || x >= image->width || y < 0 || y >= image->height) return 0;
return image->pixels[x + y * image->width];
}``````

Neither `r96_set_pixel()` nor `r96_get_pixel()` can assume that coordinates given to them are within the image bounds. We thus ensure that's the case. This is a form of clipping. Clipping will be a permanent, very annoying companion of ours throughout this journey.

Note: the largest x-coordinate that's still within the bounds of the image is `width - 1`, the largest y-coordinate is `height - 1`. Missing that `-1` is a regular source of errors.

These functions are slower than manually calculating a pixel address, as they do more work. They'll still come in handy when we implement naive versions of more complex rendering algorithms. Once the naive implementation is working, we usually replace calls to these functions with inlined versions, such that we avoid per pixel clipping and possibly the multiplication in the address calculation.

### Demo app: drawing pixels, again

Our new fancy API is put to use in the demo app `02_image.c`.

``````#include <MiniFB.h>
#include <stdio.h>
#include "r96/r96.h"

int main(void) {
r96_image image;
r96_image_init(&image, 320, 240);
struct mfb_window *window = mfb_open("02_image", image.width * 2, image.height * 2);
do {
for (int i = 0; i < 200; i++) {
int32_t x = rand() % image.width;
int32_t y = rand() % image.height;
uint32_t color = R96_ARGB(255, rand() % 255, rand() % 255, rand() % 255);
r96_set_pixel(&image, x, y, color);
}

if (mfb_get_mouse_button_buffer(window)[MOUSE_BTN_1]) {
int32_t mouse_x = mfb_get_mouse_x(window);
int32_t mouse_y = mfb_get_mouse_y(window);
uint32_t color = r96_get_pixel(&image, mouse_x, mouse_y);
printf("(%i, %i) = { alpha: %i, red: %i, green: %i, blue: %i }\n", mouse_x, mouse_y, R96_A(color), R96_R(color), R96_G(color), R96_B(color));
}

mfb_update_ex(window, image.pixels, image.width, image.height);
} while (mfb_wait_sync(window));
r96_image_dispose(&image);
return 0;
}``````

In lines 6-7, we create a new `r96_image` instance and initialize it to have 320x240 pixels. We then create a window with twice the size of our image. MiniFB will scale our 320x240 pixels image up to the 640x480 window size in `mfb_update_ex()`. This way, it's easier to see what's going on.

The pixel rendering in line 14 now uses `r96_set_pixel()` instead of manually calculating the pixel address.

To exercise the new `r96_get_pixel()` function and the color component macros, we print the current color at the mouse position if the left mouse button is pressed (lines 17-22).

Finally, we dispose of the image in line 26.

Have a look at `02_image.html` to see the magic that gets the `printf()` output into the div below the canvas.

### Clearing an image

In real-time graphics, we often clear the "screen" before we render a new frame. So far, we didn't have a need for this age old tradition, but we'll likely need it in the future.

In our case, the "screen" is the `r96_image` we draw our pixels to, which we later pass to `mfb_update_ex()` to have it displayed in a window.

Here are two functions to clear a `r96_image` with a specific color.

``````void r96_clear(r96_image *image) {
memset(image->pixels, 0x0, image->width * image->height * sizeof(uint32_t));
}

void r96_clear_with_color(r96_image *image, uint32_t color) {
uint32_t *pixels = image->pixels;
for (int i = 0; i < image->width * image->height; i++)
pixels[i] = color;
}``````

`r96_clear()` covers the common case of setting all pixels of an image to black. It uses `memset()`, which is heavily optimized and usually beats a hand-rolled loop.

`r96_clear_with_color()` lets us clear with an arbitrary color. As we loop through pixels individually, it will likely be quite a bit slower than the `memset()` based alternative above.

But how much slower?

### Demo app: profiling and optimizing `r96_clear_with_color()`

While you can make some assumption about the performance of your code just by looking at it, you must always measure to ensure your assumptions hold. Profiling code is an art in itself, especially as your code becomes more complex.

Luckily, profiling `r96_clear()` and `r96_clear_with_color()` is comparatively trivial. All we need is a way to measure how much time has passed. MiniFB provides a high precision timer we can use for that purpose.

Note: At least we keep it trivial. Usually, you'd have to factor in things like what apps are running on your benchmarking machine, cold start vs. warm start, averaging timings over `n` runs, and so on. However, we aren't really going for absolute timing values here, but merely compare relative improvements after code changes. This simplistic approach is good enough for that. Also, use a proper profiler, like Superluminal.

The command line demo app called `03_clear_profiling.c` tests how fast `r96_clear()` and `r96_clear_with_color()` are respectively:

``````#include <MiniFB.h>
#include <stdio.h>
#include "r96/r96.h"

int main(void) {
const int num_iterations = 200000;
r96_image image;
r96_image_init(&image, 320, 240);

struct mfb_timer *timer = mfb_timer_create();
mfb_timer_reset(timer);
for(int i = 0; i < num_iterations; i++)
r96_clear(&image);
printf("r96_clear()                 %f secs\n", mfb_timer_delta(timer));

mfb_timer_reset(timer);
for(int i = 0; i < num_iterations; i++)
r96_clear_with_color(&image, 0x0);
printf("r96_clear_with_color()      %f secs\n", mfb_timer_delta(timer));

r96_image_dispose(&image);
return 0;
}``````

We create an image with 320x200 pixels and a timer. As a base-line, we measure how long it takes to clear the image `200000` times using `r96_clear()`.

Then we measure how long it takes to do the same with `r96_clear_with_color()`.

We always profile our code using an optimized release build. Here's what I get on my machine:

``````r96_clear()                 0.573491 secs
r96_clear_with_color()      4.833749 secs``````

That's absolutely not great! We assumed that manually looping through and setting the pixels is a bit slower than `memset()`. But 8.5x times slower? What is going on here?

Just looking at the code of `r96_clear_with_color()` doesn't really give us any hints how to improve it, so let's look at the (manually annotated) assembly code the compiler (in this case Clang targeting x86_64 on Linux) spits out for the function next to the original C code.

Note: while you can get the assembly output straight from your compiler, I tend to use the Godbolt Compiler Explorer, an online C/C++ compiler and disassembler. Enter C/C++ code in the left side editor and you'll see the assembly output generated by the chosen compiler + flags combination on the right side. You can even share your sketches, so others can help improve things easily, without needing to set up a build environment.

``````void r96_clear_with_color(r96_image *image, uint32_t color) {
uint32_t *pixels = image->pixels;
for (int i = 0; i < image->width * image->height; i++)
pixels[i] = color;
}``````
``````r96_clear_with_color:                        # rdi = image, esi = color
mov     eax, dword ptr [rdi + 4]     # eax = image->width
imul    eax, dword ptr [rdi]         # eax = image->height * image->width
test    eax, eax                     # eax == 0?
jle     .LBB1_3                      # yes, exit
mov     r8, qword ptr [rdi + 8]      # r8 = pixels = image->pixels
xor     ecx, ecx                     # ecx = i = 0
.LBB1_2:
mov     dword ptr [r8 + 4*rcx], esi  # pixels[i] = color
inc     rcx                          # i++
movsxd  rdx, dword ptr [rdi]         # rdx = image->width
movsxd  rax, dword ptr [rdi + 4]     # rax = image->height
imul    rax, rdx                     # rax = image->width * image->height
cmp     rcx, rax                     # i == image->width * image->height?
jl      .LBB1_2                      # no, next pixel
.LBB1_3:
ret``````

`image` is passed to the function in `rdi`, the color is passed to the function in `rsi`.

Lines 2-5 multiply `image->height` (`[rdi + 4]`) by `image->width` (`[rdi]`), then check if the result is zero. If that's the case, we jump to `.LBB1_3` and exit the function. This is an early rejection test.

If the initial check passes, then `image->pixels` is loaded into the local variable `pixels` (`r8`) in line 6. In line 7, `i` (`ecx`) is set to zero. We are ready to iterate.

In line 9, we set `pixels[i]` (`[r8 + 4*rcx]`) to the `color` (`esi`). Then we increment `i` (`rcx`) in line 10.

In line 11 we load `image->width` (`[rdi]`) into `rdx`, followed by loading `image->height` (`[rdi + 4]`) into `rax` in line 12, and multiply them in line `13`. The result ends up in `rax`.

`rax` is then compared to `i` (`rcx`). If it is less than `image->width * image->height` (`rax`), we jump back to `.LBB1_2` and write the next pixel, otherwise we fall through and exit the function.

Did you notice the problem?

Our loop condition `i < image->width * image->height` has been compiled to some rather sub-optimal machine code. Instead of calculating `image->width * image->height` once, it is calculated for every loop iteration! Not only does this mean the `width` and `height` fields of the image are fetched from memory every iteration, we also have an integer multiplication per iteration. That's not great! Here's what Stefan of Live++ fame has to say on the bird site on why the compiler is incapable of hoisting the loop invariant out of the loop. Would you like to know more?

Let's fix this by manually precalculating `image->width * image->height`, thereby helping the compiler out a little:

``````void r96_clear_with_color(r96_image *image, uint32_t color) {
uint32_t *pixels = image->pixels;
for (int i = 0, n = image->width * image->height; i < n; i++)
pixels[i] = color;
}``````

Before looking at the assembly code, let's run this and see if we improved things:

``````r96_clear()                 0.633761 secs
r96_clear_with_color()      0.761754 secs``````

Much better. We are now in the same ballpark as the `memset()` based `r96_clear()`. What did the compiler do? Hold on to your butts:

``````r96_clear_with_color:                   # @r96_clear_with_color
mov     ecx, dword ptr [rdi + 4]
imul    ecx, dword ptr [rdi]
test    ecx, ecx
jle     .LBB1_12
mov     rax, qword ptr [rdi + 8]
mov     r9d, ecx
cmp     ecx, 8
jae     .LBB1_3
xor     edx, edx
jmp     .LBB1_11
.LBB1_3:
mov     edx, r9d
and     edx, -8
movd    xmm0, esi
pshufd  xmm0, xmm0, 0                   # xmm0 = xmm0[0,0,0,0]
lea     rcx, [rdx - 8]
mov     rdi, rcx
shr     rdi, 3
inc     rdi
mov     r8d, edi
and     r8d, 3
cmp     rcx, 24
jae     .LBB1_5
xor     ecx, ecx
jmp     .LBB1_7
.LBB1_5:
and     rdi, -4
xor     ecx, ecx
.LBB1_6:                                # =>This Inner Loop Header: Depth=1
movdqu  xmmword ptr [rax + 4*rcx], xmm0
movdqu  xmmword ptr [rax + 4*rcx + 16], xmm0
movdqu  xmmword ptr [rax + 4*rcx + 32], xmm0
movdqu  xmmword ptr [rax + 4*rcx + 48], xmm0
movdqu  xmmword ptr [rax + 4*rcx + 64], xmm0
movdqu  xmmword ptr [rax + 4*rcx + 80], xmm0
movdqu  xmmword ptr [rax + 4*rcx + 96], xmm0
movdqu  xmmword ptr [rax + 4*rcx + 112], xmm0
jne     .LBB1_6
.LBB1_7:
test    r8, r8
je      .LBB1_10
lea     rcx, [rax + 4*rcx]
shl     r8, 5
xor     edi, edi
.LBB1_9:                                # =>This Inner Loop Header: Depth=1
movdqu  xmmword ptr [rcx + rdi - 16], xmm0
movdqu  xmmword ptr [rcx + rdi], xmm0
cmp     r8, rdi
jne     .LBB1_9
.LBB1_10:
cmp     rdx, r9
je      .LBB1_12
.LBB1_11:                               # =>This Inner Loop Header: Depth=1
mov     dword ptr [rax + 4*rdx], esi
inc     rdx
cmp     r9, rdx
jne     .LBB1_11
.LBB1_12:
ret``````

The compiler basically generated a version of a highly optimized `memset()` implementation specifically for our setup at hand. We could spend an hour dissecting it to precisely understand what's going on. But why question the compiler's wisdom? The take away:

Always precalculate as much as you can for your inner loops, including values used in your loop condition. The compiler can not always figure out loop invariants.

## Drawing horizontal lines

With our housekeeping out of the way, it's time to get back to the actual goal of this post: drawing rectangles. But first, we need to learn how to draw horizontal lines. Wax on, wax off.

We define a horizontal line by a start point `(x1, y)` and an end point `(x2, y)`, where `x1 <= x2`. Since it's a horizontal line, the y-coordinate is the same for both points. The line covers all the pixels including and in-between the start and end point.

Note: Translating a "mathematical" representation of something to pixels on a raster is what's actually called "rasterization". An infinitely precise thing is boiled down to and approximated by a bunch of colored squares in a raster.

A few example horizontal lines on a small 16x12 raster.

Some of the lines are "problematic", as they aren't entirely inside the raster. Before we can draw their pixels, we need to clip them!

The two blue lines have y-coordinates `y < 0` and `y > height - 1`. Any horizontal line with a y-coordinate outside the raster can be entirely ignored.

The two red lines share a similar fate. They are outside the raster on the x-axis. Any horizontal line with either `x2 < 0` or `x1 > width - 1` can be discarded.

The two green lines are all good. Both their endpoints are inside the raster.

The pink lines are ... weird. They are partially inside the raster. How do we deal with them? We clip them! Meaning, we calculate new start and end points for the part of the line that is inside the raster.

For the left point, we snap `x1` to `0`. For the right point, we snap `x2` to `width - 1`. Here are all the pixels we actually have to draw.

Once we've clipped a line, we just need to draw all the pixels between and including its start and end points.

### Demo app: horizontal lines, naive version

Let's add a new demo app called `04_hline.c`:

``````#include <MiniFB.h>
#include <stdio.h>
#include "r96/r96.h"

void hline(r96_image *image, int32_t x1, int32_t x2, int32_t y, uint32_t color) {
if (x1 > x2) {
int32_t tmp = x2;
x2 = x1;
x1 = tmp;
}

if (x1 >= image->width) return;
if (x2 < 0) return;
if (y < 0) return;
if (y >= image->height) return;

if (x1 < 0) x1 = 0;
if (x2 >= image->width) x2 = image->width - 1;

for (int32_t x = x1; x <= x2; x++)
r96_set_pixel(image, x, y, color);
}

int main(void) {
r96_image image;
r96_image_init(&image, 320, 240);
struct mfb_window *window = mfb_open("04_hline", image.width * 2, image.height * 2);
struct mfb_timer *timer = mfb_timer_create();
do {
srand(0);
mfb_timer_reset(timer);
for (int i = 0; i < 200000; i++) {
uint32_t color = R96_ARGB(255, rand() % 255, rand() % 255, rand() % 255);
hline(&image, rand() % image.width, rand() % image.width, rand() % image.height, color);
}
printf("Took: %f\n", mfb_timer_delta(timer));

if (mfb_update_ex(window, image.pixels, image.width, image.height) < 0) break;
} while (mfb_wait_sync(window));
r96_image_dispose(&image);
return 0;
}``````

The `hline()` function first ensures that `x1 < x2` in lines 6-10. This is a precondition we need to enforce, otherwise our clipping rules won't work.

In lines 12-15, we reject all lines that are entirely outside the image, applying the rules we established above for the blue and red lines.

Lines that are partially inside the image get clipped in lines 17-18.

With all the clipping out of the way, we know that the start and end point (and the pixels between them) are entirely within the bounds of the image.

We finish the function by drawing all the pixels at and in between the start and end point in lines 20-21 using `r96_set_pixel()`.

In the `main()` function, we draw `200000` lines at random positions and measure how long it takes to do so. The call to `srand()` sets the seed for the function `rand()`. We'll always get the same sequence of "random" numbers that way.

Note: If we measure something to later improve it, we have to ensure that what we measured is the same each time, even if it involves "fake" randomness. If we didn't "fix" the randomness in this case, we may draw a different number of pixels each run, which would skew the results.

Here's the output from a release build on my machine:

``````Took: 0.032682
Took: 0.033261
Took: 0.032935
Took: 0.032707``````

Not a scientific benchmark, but good enough as a ballpark estimate. This time, we have no base-line to compare against. So, let's take this version of `hline()` and see if we can improve upon it!

### Demo app: horizontal lines, optimized

Looking real hard at the `hline()` implementation, the only thing that stands out is the call to `r96_set_pixel()`. What's that do again?

``````void r96_set_pixel(r96_image *image, int32_t x, int32_t y, uint32_t color) {
if (x < 0 || x >= image->width || y < 0 || y >= image->height) return;
image->pixels[x + y * image->width] = color;
}``````

Oh right, it actually does clipping and it calculates the address of the pixel, which adds an addition and multiplication to our inner loop in `hline()`. That's not good!

The `hline()` function already ensures that all pixels we are drawing are inside the raster. The extra clipping in `r96_set_pixel()` is thus unnecessary.

And once we know the address of the start pixel at `x1, y`, the address of each subsequent pixel on the line can be calculated by continuously incrementing the address by `1`. Remember: pre-calculate all loop invariants and do as little work in inner loops as possible!

Here's the optimized `hline()` version from `05_hline_opt.c`:

``````void hline(r96_image *image, int32_t x1, int32_t x2, int32_t y, uint32_t color) {
if (x1 > x2) {
int32_t tmp = x2;
x2 = x1;
x1 = tmp;
}

if (x1 >= image->width) return;
if (x2 < 0) return;
if (y < 0) return;
if (y >= image->height) return;

if (x1 < 0) x1 = 0;
if (x2 >= image->width) x2 = image->width - 1;

uint32_t *pixels = image->pixels + y * image->width + x1;
int32_t num_pixels = x2 - x1 + 1;
while(num_pixels--) {
*pixels++ = color;
}
}``````

We keep all the clipping from before, but are a bit smarter regarding our inner loop.

We precalculate the address of the pixel covering our start point at `(x1, y)` in line 16. We also precalculate the number of pixels we are going draw in line 17.

The inner loop then counts down the number of pixels we've drawn so far, while assigning the color to the current pixel and incrementing the address by one. Pretty tight. On my machine, I get these timings:

``````Took: 0.011275
Took: 0.011216
Took: 0.011437
Took: 0.011545``````

A nice improvement! Let's have some live lines to celebrate (click to start)

Note: `hline()` is as good as it gets, so it's been added as `r96_hline()` to `r96.c`. Going forward, we'll develop other rendering functions in the same way: build a demo app with a naive implementation, improve performance, and finally add it to the `r96` library once it's good enough.

## Drawing rectangles

We define a rectangle by its top left corner point `(x1, y1)` and its bottom right corner point `(x2, y2)`, where `x1 <= x2` and `y1 <= y2`. The rectangle covers all pixels `(x, y)`, where `x1 <= x <= x2` and `y1 <= y <= y2`.

Alternatively, we can define a rectangle by its top left corner point `(x1, y1)` and a positive `width` and `height` given in pixels. We can derive `(x2, y2)` from this information, as `x2 = x1 + width - 1` and `y2 = y1 + height - 1`.

Note: And there it is, again. Our good friend `-1`. Can you figure out why it's needed to calculate `x2` and `y2` from `x1`, `y1`, `width`, and `height`?

As with horizontal lines, it pays off to first check out the clipping cases:

This looks familiar! Much of the clipping we did for horizontal lines applies to rectangles as well. All we need to do is take `y2` into account. Let's go through the cases.

The two blue rectangles are above the top edge of the raster (`y2 < 0`) and below the bottom edge of the raster (`y1 > height - 1`). Any rectangles that fulfill either criteria are outside the raster and don't have to be drawn.

The two red rectangles have the same problem, but relative to the left and right raster edges. Any rectangle with either `x2 < 0` or `x1 > width - 1` can ignored. This is the exact same case as we had for horizontal lines.

The pink rectangles are the baddies again. Just like with their pink horizontal line counter parts, we'll need to clip them. If `x1 < 0` we snap it to `0`. If `y1 < 0` we also snap it to `0`. If `x2 > width - 1`, we snap it to `width - 1`. Similarly, if `y2 > height - 1`, we snap it to `height - 1`.

If a rectangle passes all the above tests, or has been clipped, then it is firmly inside the raster and can be drawn as is, like the green rectangle. Here's what we'll render for the above scene after clipping:

Knowing that a rectangle's corner points `(x1, y1)` and `(x2, y2)` are fully inside the raster after clipping makes rendering the rectangle trivial.

For each y-coordinate `y` between and including `y1` and `y2`, we render a horizontal line from `(x1,y)` to `(x2,y)`. And we already got a function for rendering horizontal lines!

### Demo app: drawing rectangles, naive version

Time for another demo app called `06_rect.c`:

``````#include <MiniFB.h>
#include <stdio.h>
#include "r96/r96.h"

void rect(r96_image *image, int32_t x1, int32_t y1, int32_t width, int32_t height, uint32_t color) {
if (width <= 0) return;
if (height <= 0) return;

int32_t x2 = x1 + width - 1;
int32_t y2 = y1 + height - 1;

if (x1 >= image->width) return;
if (x2 < 0) return;
if (y1 >= image->height) return;
if (y2 < 0) return;

if (x1 < 0) x1 = 0;
if (y1 < 0) y1 = 0;
if (x2 >= image->width) x2 = image->width - 1;
if (y2 >= image->height) y2 = image->height - 1;

for (int y = y1; y <= y2; y++)
r96_hline(image, x1, x2, y, color);
}

int main(void) {
r96_image image;
r96_image_init(&image, 320, 240);
struct mfb_window *window = mfb_open("06_rect", image.width * 2, image.height * 2);
struct mfb_timer *timer = mfb_timer_create();
do {
srand(0);
mfb_timer_reset(timer);
for (int i = 0; i < 200000; i++) {
uint32_t color = R96_ARGB(255, rand() % 255, rand() % 255, rand() % 255);
rect(&image, rand() % image.width, rand() % image.width, rand() % (image.width / 5), rand() % (image.height / 5), color);
}
printf("Took: %f\n", mfb_timer_delta(timer));

if (mfb_update_ex(window, image.pixels, image.width, image.height) < 0) break;
} while (mfb_wait_sync(window));
r96_image_dispose(&image);
return 0;
}``````

It's usually more comfortable to render a rectangle given its top-left corner point `(x1, y1)` and a width and height. That's what the `rect()` function expects.

It does a sanity check on the provided `width` and `height`: if they are `<= 0` we exit early. Drawing a zero width or zero height rectangle makes no sense.

Next, we calculate `x2` and `y2`, as it's easier to do clipping on the corner point representation of a rectangle.

We then run through all the clipping cases we discussed above and end up with our final `(x1, y1)` and `(x2, y2` corner points which we know are within the raster if we made it that far.

The `for` loop then iterates through all the y-coordinates the rectangle covers, and draws a horizontal line for each `(x1, y)` and `(x2, y)` using `r96_hline()`.

The `main()` function is unsurprising. We draw `200000` semi-random rectangles and time how long that takes. Have a demo!

Here are the timings on my machine for a release build:

``````Took: 0.029729
Took: 0.029665
Took: 0.029447
Took: 0.029449
``````

Can we make this faster?

### Inlining and link time optimizations

Just by looking at the code, we see that invoking `r96_hline()` isn't ideal. It performs clipping on the inputs, which we know are within the raster. Superfluous work. But let's ignore that for now, as the bulk of the work is done when setting the pixels in a row.

Another issue: unless the compiler has inlined the call to `r96_hline()` into `rect()`, we also incur a function call for each row of the rectangle we draw. Since `r96_hline()` lives in `r96.c` and we are calling it from `main.c` the compiler will be unable to inline the function, as those are 2 separate compilation units.

We can remedy this by moving the definition of `r96_hline()` from `r96.c` into `r96.h` and prefix the function with the keyword `inline`. Then I get:

``````Took: 0.027828
Took: 0.027713
Took: 0.027813
Took: 0.027851
``````

Indeed, looking at the assembly, the call was eliminated and `r96_hline()` was entirely inlined into `rect()`. But that's not a great solution, as most compilers will manage to inline `r96_hline()` everywhere, which bloats executable size, or may put pressure on the register allocator, thus actually slowing down our code.

As an alternative to gunking up our header files, we can use link time optimizations as supported by Clang, GCC, and MSVC. Let's update the CMakeLists.txt file to turn that on.

``````...

if(CMAKE_BUILD_TYPE MATCHES Release)
include(CheckIPOSupported)
check_ipo_supported(RESULT result)
if(result)
set(CMAKE_INTERPROCEDURAL_OPTIMIZATION TRUE)
endif()
endif()

...``````

This snippet checks if LTO is supported by the current compiler, and if so, enables LTO for all targets that are defined afterwards, if we are building a release build. Here are the new timings with LTO on:

``````Took: 0.025003
Took: 0.025114
Took: 0.025030
Took: 0.024995
``````

Not bad! The reason this performs better than the inlining done above is that basically all the code is inlined into `main()` directly. It was even able to remove some of the superfluous clipping checks of the inlined `r96_hline()`! Let's keep LTO on by default for all targets.

We can't always rely on LTO to do such a good job though. Let's fix this by getting rid of the call to `r96_hline()` ourselves.

### Demo app: rectangles, optimized

Here's what I came up with in the new demo app `07_rect_opt.c`:

``````void rect_opt(r96_image *image, int32_t x1, int32_t y1, int32_t width, int32_t height, uint32_t color) {
if (width <= 0) return;
if (height <= 0) return;

int32_t x2 = x1 + width - 1;
int32_t y2 = y1 + height - 1;

if (x1 >= image->width) return;
if (x2 < 0) return;
if (y1 >= image->height) return;
if (y2 < 0) return;

if (x1 < 0) x1 = 0;
if (y1 < 0) y1 = 0;
if (x2 >= image->width) x2 = image->width - 1;
if (y2 >= image->height) y2 = image->height - 1;

int32_t clipped_width = x2 - x1 + 1;
int32_t next_row = image->width - clipped_width;
uint32_t *pixel = image->pixels + y1 * image->width + x1;
for (int y = y1; y <= y2; y++) {
for (int i = 0; i < clipped_width; i++) {
*pixel++ = color;
}
pixel += next_row;
}
}``````

All the clipping code stays the same, we only replace our call to `r96_hline()` with something that's a bit less work per row.

In line 18, we calculate the clipped width. We need that in the next line to calculate `next_row`, which we'll use to calculate the address of the first pixel in the next row, once we are done setting all the pixels in the current row (line 26).

We also precalculate the address of the first `pixel` we are going to set the color of. It's the pixel of the top-left corner of the rectangle.

Then we start to loop through all rows of the rectangle, from `y1` to `y2`. For each row, we set each of its pixels to the `color`. That's one addition (`i++`), one memory store (`*pixel = color`) and one increment (`pixel++`) per pixel in a row. Pretty OK.

Once we've completed rendering the current row, we calculate the address of the first pixel of the next row by adding `next_row` to `pixel`.

I've omitted the `main()` function above, as it is essentially the same as the last demo app's `main()`.

On my machine, I get the following timings without LTO:

``````Took: 0.023962
Took: 0.023914
Took: 0.023863
Took: 0.023713
``````

With LTO:

``````Took: 0.024675
Took: 0.024555
Took: 0.024480
Took: 0.024596
``````

Pretty much a toss up, though it seems the non-LTO version has a slight edge. Compared to the first naive implementation, we shaved off a bit of overhead by removing the function call and clipping. But since those are per row costs, we don't gain as much, as the bulk of the work is done per pixel.

It is nice, however, that we can match LTO performance with just a simple modification of the code that will alsow work with compilers without LTO support!

Since this is as good as it gets, `rect_opt()` has been promoted `r96.c` and shall henceforth be known as `r96_rect()`.

## Again, Mario, REALLY?

Welp, I guess I can't write short blog posts. But we learned a lot again! I'm undecided what we'll look into next time. Either drawing lines, or loading and drawing images. It's going to be fun none the less.

Discuss this post on Twitter or Mastodon.