Shakyboi - Part 2: Class lookups, tests, and class file parsing

February 24, 2021

Shake it like a polaroid picture.

Shakyboi - Part 2: Builds, Class lookups, tests, and class file parsing

Last time I proposed to write a (partial) replacement for ProGuard for Very Good Reasons (tm). We had a superficial look at what tree shaking is, how it applies to Java classes, and how we could get the data needed to perform the shaking.

All of that was very superficial and didn't contain any meat-y code. This installement will have some meat-y code, but will not yet describe the implementation of the actual tree shaking. Instead, we'll walk through the scafolding needed to get us to a place where we can implement tree shaking. Here's what we'll need.

Build: Everybody loves Gradle, so I'll be using Maven. We'll set it up such that it spits out a fat .jar that can be run from the CLI, and a way to run tests. For good measure, we'll throw in the incantations to be able to publish things to Maven Central via SonaType eventually, i.e. if we create plugins to use shakyboi via Gradle or Maven.
Tests: Running shakyboi via the CLI is the end goal, but for development purposes we need a bit of structure and repeatability. We'll use JUnit 5, like the good JVM citizens we are.
Class lookup: A way to find .class files based on class names like java/lang/Object and read their raw bytes. The files can come from a directory, a .jar file, or a Java Runtime Image, like the one in the JDK/JRE we use to run our shakyboi.
.class file parser: We'll need to figure out all classes a class under inspection references to create our dependency graph. For that, we need to parse raw .class files. While we could use ASM, we'll write our own class file parser based on the class file format specification. Waste of time? Maybe, but a fun exercise and it forces us to understand the JVM a little better.

Building and testing

Not much to say here. Have a look at the full pom.xml. It's a barebone (yes, really...) build descriptor that will allow us to:

mvn test, to execute all unit tests.
mvn package, to package the fat jar, which will be output to shakyboi/target/shakyboi.jar
mvn deploy, to deploy a snapshot build to SonaType (Mario only).
mvn release:prepare && maven release:deploy, to deploy a release version that gets synced with Maven Central (Mario only).

The pom.xml file also lets us import the project into the IDE of our choice. For this little project I'll deviate from my usual choice of Eclipse (fight me) and use IntelliJ IDEA.

Looking up classes

A user of shakyboi is expected to feed it a bunch of .class files that make up their app and that should be tree shaken. Let's call these classes app classes.

The app classes may depend on other classes that should not be shaken, e.g. classes from the Java runtime class library like java/util/List. For tree shaking, we still need to at least verify they exist, so we need a way to look them up as well. Let's call these classes bootstrap classes.

Classes can come from different sources, but we don't particularly care about the details. Instead, we generalize via the ClassLookup interface and create implementations for different source types. The basic interface is simple:

package io.marioslab.shakyboi.lookup;

/**
 * A class lookup provides the raw bytes of a class fetched from
 * some place.
 */
public interface ClassLookup {
    /**
     * Looks up the class with the given name and returns its
     * .class file as a byte array.
     *
     * @param name the binary class name, e.g. "java/lang/Object". @see jvms-4.2.1.
     * @return the classes bytes or null.
     * @throws RuntimeException in case an unrecoverable error happened.
     */
    byte[] findClass(String name);
}

for historical reasons, the class file format requires package names to be delimited by / instead of .. Thus, java.lang.Object turns into java/lang/Object. This is called the binary class name as per the specification.

To avoid conversions from and to the binary class name format, we'll use the binary class name format throughout the shakyboi code. Only user facing code will use the dot notation, e.g. java.lang.Object

.class files of app classes usually come from either a directory or a .jar file. These two sources are covered by DirectoryClassLookup and JarClassLookup.

For unit tests, we may want to load classes at runtime from the JVM classpath, for which we can use ClassLoaderClassLookup. It can be used to load classes that belong to the tests (app classes), as well as classes from the Java runtime (bootstrap classes) and makes writing self-contained tests easy.

Until and including Java 8, a JDK or JRE would ship all JRE classes in a single file called lib/rt.jar. This changed with the introduction of modules in Java 9. The JDK or JRE now ships with file called lib/modules containing modules like java.base, which in turn contain their respective .class files. To access .class files in such a module image, we can use JrtImageClassLookup. The current implementation looks up the modules file in the home directory of the JDK or JRE that is currently executing shakyboi.

Multiple class lookups can be chained via a CombinedClassLookup. That should cover pretty much all of our class lookup needs.

A simple smoke test called ClassLookupTest checks the basic functionality of all the above classes, and illustrates their use.

Understanding `.class` files

We can lookup the raw bytes of a .class file by class name. Next, we have to write a parser that can parse the .class file, so we can find all other classes mentioned in that class file.

To write a parser, we need to understand that Java class file format which is described in the Java Virtual Machine Specification. Go, have a look! It's pretty nifty, but has some historic limitations, like using 16-bit indices, resulting in things like maximum method sizes of 64kb and similar unfortunate consequences.

Here's the top-level structure of a class file:

ClassFile {
    u4             magic;
    u2             minor_version;
    u2             major_version;
    u2             constant_pool_count;
    cp_info        constant_pool[constant_pool_count-1];
    u2             access_flags;
    u2             this_class;
    u2             super_class;
    u2             interfaces_count;
    u2             interfaces[interfaces_count];
    u2             fields_count;
    field_info     fields[fields_count];
    u2             methods_count;
    method_info    methods[methods_count];
    u2             attributes_count;
    attribute_info attributes[attributes_count];
}

The spec uses C-like structs to describe the data in a class file. Instead of short, int, etc., it uses more precise types like u2 (16-bit unsigned integer), or u4 (32-bit unsigned integer). Note that everything in a class file is stored in big endian.

The first three fields are the magic marker 0xcafebabe, and the major and minor class file version.

The constant pool is a lookup for things like class names, method signatures, strings, numeric constants, and other things. This is were our focus will be. More on it later.

The access_flags field stores whether the class is public, private, and so on, as a set of bit flags.

The this_class field is meant to tell us the fully qualified name of the class, but how is that done with just a number (u2)?

The answer is the constant pool, stored in the constant_pool[] array. As its name says, it stores constants, such as strings, numbers, etc. Basically any kind of data that's not code is stored here and referenced (and shared) by the other parts of the class file.

There are different types of constant pool entries, e.g. a type representing a UTF-8 strings, a type for int constants, and so on. The type of an entry is encoded in its first byte according to the spec. The generic representation of an entry is a cp_info:

cp_info {
    u1 tag;
    u1 info[];
}

The first byte is a tag, denoting the type of entry, followed by the payload as a variable length array of bytes. There is a total of 17 constant pool entry types. We have to find the entries that contain class names. Let's have a look at the most probably candidate types.

According to the spec, the this_class field points at an entry in the constant pool that must have the type CONSTANT_Class (7). Here's the corresponding structure:

CONSTANT_Class_info {
    u1 tag;
    u2 name_index;
}

The tag field would be set to the value 7. The info[] part of the generic cp_info becomes a single 16-bit unsigned integer called name_index. It is an index into the constant pool, pointing at another constant pool entry of type CONSTANT_Utf8 (1), which looks like this:

CONSTANT_Utf8_info {
    u1 tag;
    u2 length;
    u1 bytes[length];
}

Now we're getting somewhere! The tag is set to the value 1. The length stores the length of the UTF-8 string in bytes. And finally, the bytes[] array stores the actual UTF-8 data. That's where our class' name hides!

Moving through the class file structure again, the next field is super_class. It too is an index into the constant pool that points at a CONSTANT_Class_info entry, which in turn points at a CONSTANT_Utf8_info that holds the super class' name.

Next in the class file structure is the interfaces[] array. Each of its elements is again an index into the constant pool, pointing at a CONSTANT_Class_info entry representing one of the interfaces the class implements.

Other classes may also be referenced as types of fields of a class, which we can find in the next part of the class file structure, the fields[] array. The structure of a field looks like this:

field_info {
    u2             access_flags;
    u2             name_index;
    u2             descriptor_index;
    u2             attributes_count;
    attribute_info attributes[attributes_count];
}

That looks familiar! The name_index is an index into the constant pool, pointing at a UTF-8 string entry representing the field's name. The descriptor_index points at a UTF-8 constant pool entry that describes the field's type. The field type is encoded according to the field descriptor format.

E.g. a field with type java.utils.List would be described by the descriptor Ljava/utils/List;. A two dimensional array of objects would have the descriptor [[Ljava/lang/Object;, etc. We can extract additional class names from every field descriptor!

Next in the class file structre is the methods[] array, which stores all the methods of the class. Each method is encoded like this:

method_info {
    u2             access_flags;
    u2             name_index;
    u2             descriptor_index;
    u2             attributes_count;
    attribute_info attributes[attributes_count];
}

By now, you should see the pattern. The name_index points to a UTF-8 entry in the constant pool representing the name of the method. The descriptor_index points to a UTF-8 entry that stores the method signature according to the method descriptor format. A method like int foo(String, List) would have the descriptor (Ljava.lang.String;Ljava/util/List;)I. Method descriptors are another place to find class names.

The last part of the class file structure is the attributes[] array. Attributes are a sort of free-form data storage that allow to store arbitrary data. Attributes are not only found in the top level class file structure, but also in fields and methods. Some attributes can even contain attributes themselves. There are a few predefined attributes the JVM expects and can process, e.g. the "Code" attribute that stores the bytecodes of a method.

For the first iteration of shakyboi, we can ignore attributes. While they may reference classes, they will do so by pointing at the constant pool. We can thus get away with ignoring them, as parsing the constant pool does not require us to parse attributes. We'll just parse them in their generic structure form:

attribute_info {
    u2 attribute_name_index;
    u4 attribute_length;
    u1 info[attribute_length];
}

So far it looks like we can find all the classes referenced by a class in that classes constant pool. They are either encoded in CONSTANT_Class_info, or method and field descriptors, which are stored in CONSTANT_Utf8_info entries.

Are there other elements of the class file that may reference another class? Yes! Have a look at this class:

public class Foo {
    public void doFoo() {
        new Zip().doZip();        
    }
}

Clearly, it references class Zip in method doFoo(). But Zip is not part of the method signature, so won't be found in the method's descriptor. Let's javap -c the class. Here's what the disassembled method bytecodes look^ like:

public void doFoo();
Code:
    0: new           #7                  // class Zip
    3: dup
    4: invokespecial #9                  // Method Zip."":()V
    7: invokevirtual #10                 // Method Zip.doZip:()V
    10: return

The new, invokespecial, and invokevirtual instructions all reference Zip via the constant pool in some form, e.g. new #7 references the constant pool entry at index 7. Let's have a look at the constant pool entries at indices 7, 9, 10:

 #7 = Class              #8             // Zip
 #8 = Utf8               Zip
 #9 = Methodref          #7.#3          // Zip."":()V
#10 = Methodref          #7.#11         // Zip.doZip:()V

I threw in entry 8 for good measure. There's a class entry in the constant pool for Zipafter all! And it's referenced directly by the new instruction

We also find a new constant pool entry type: CONSTANT_Fieldref_info, referenced by the invokespecial #9 and invokevirtual #10 instructions.

In fact, whenever bytecode references a field or method, it does so through a field reference, method reference, or interface method reference entry in the class pool. These look as follows:

CONSTANT_Fieldref_info {
    u1 tag;
    u2 class_index;
    u2 name_and_type_index;
}

CONSTANT_Methodref_info {
    u1 tag;
    u2 class_index;
    u2 name_and_type_index;
}

CONSTANT_InterfaceMethodref_info {
    u1 tag;
    u2 class_index;
    u2 name_and_type_index;
}

In each case, the class_index points to a class entry in the constant pool. It tells us which class the referenced field or method stems from. Great, another place to find classes.

Further more, name_and_type_index points to a name and type entry in the constant pool, describing the field's type or the method's signature. It looks like this:

CONSTANT_NameAndType_info {
    u1 tag;
    u2 name_index;
    u2 descriptor_index;
}

Yay, we've found another set of descriptors that could potentially reference classes.

Anything else we missed? Yes, there's one more constant pool entry type that directly references a UTF-8 entry that is a descriptor - CONSTANT_MethodType_info:

CONSTANT_MethodType_info {
    u1 tag;
    u2 descriptor_index;
}

This constant pool entry is used when method handles/invoke dynamic.

Let's summarize all the places we need to check for class names:

All CONSTANT_Class_info constant pool entries, except the one pointed to by this_class.
All CONSTANT_NameAndType_info constant pool entries, pointing to a field or method descriptor via descriptor_index.
All CONSTANT_MethodType_info constant pool entries, pointing to a method descriptor via descriptor_index.
The descriptors of fields and methods, pointed to by their descriptor_index.

That's a lot of words to arrive at such a tiny list. But the list of classes we generate must be exhaustive, otherwise we may end up removing things we shouldn't. Also, you now have a grasp of how .class files work (except for attributes, read the spec).

Let's finish this article off by looking at the .class file parser code.

Parsing `.class` files

Turns out writing about the .class file format takes longer than implementing a parser. In shakyboi, the class for parsing a .class file is called ClassFileReader. It parses a .class file given as a byte[] array into a ClassFile instance.

The complexity of .class files is mostly found in attributes, which we can thankfully ignore (for now). The entire parser is only 167 lines long. Here's its main driver method:

public static ClassFile readClassFile(String name, byte[] data) throws IOException {
    DataInputStream in = new DataInputStream(new ByteArrayInput(data));
    try {
        ClassFile clz = new ClassFile(name, data);

        if (in.readInt() != 0xcafebabe) throw new RuntimeException("Magic 0xcafebabe not found");

        clz.minorVersion = in.readUnsignedShort();
        clz.majorVersion = in.readUnsignedShort();

        readConstantPool(in, clz);

        clz.accessFlags = in.readUnsignedShort();
        clz.thisClass = in.readUnsignedShort();
        clz.superClass = in.readUnsignedShort();

        readInterfaces(in, clz);
        readMembers(in, clz, clz.fields);
        readMembers(in, clz, clz.methods);
        readAttributes(in, clz, clz.attributes);

        return clz;
    } catch (Throwable t) {
        throw new IOException("Error reading class " + name, t);
    }
}

That maps pretty cleanly to the class file structure above. The work horse is the ClassFileReader.readConstantPool() method, responsible for parsing the constant pool and translating it into Java POJOs.

While we won't be rewriting .class files (yet), I've also implemented a ClassFileWriter which takes a ClassFile and serializes it into a .class file in form of a byte[] array.

Up next

You made it to the end! While wordy, your time investment in reading all fo this will be recouped once we start doing more fancy stuff, like obfuscation. For now, we are happy to have all our scaffolding in place and an idea how to identify all classes referenced by a class.

In the next installement, we'll finally write our first tree shaker. That article will be a lot shorter :D

Discuss this post on Twitter.