Shakyboi - Part 2: Builds, Class lookups, tests, and class file parsing
Last time I proposed to write a (partial) replacement for ProGuard for
All of that was very superficial and didn't contain any meat-y code. This installement will have some meat-y code, but will not yet describe the implementation of the actual tree shaking. Instead, we'll walk through the scafolding needed to get us to a place where we can implement tree shaking. Here's what we'll need.
- Build: Everybody loves Gradle, so I'll be using Maven. We'll set it up such that it spits out a fat
.jar
that can be run from the CLI, and a way to run tests. For good measure, we'll throw in the incantations to be able to publish things to Maven Central via SonaType eventually, i.e. if we create plugins to use shakyboi via Gradle or Maven. - Tests: Running shakyboi via the CLI is the end goal, but for development purposes we need a bit of structure and repeatability. We'll use JUnit 5, like the good JVM citizens we are.
- Class lookup: A way to find
.class
files based on class names likejava/lang/Object
and read their raw bytes. The files can come from a directory, a.jar
file, or a Java Runtime Image, like the one in the JDK/JRE we use to run our shakyboi. .class
file parser: We'll need to figure out all classes a class under inspection references to create our dependency graph. For that, we need to parse raw.class
files. While we could use ASM, we'll write our own class file parser based on the class file format specification. Waste of time? Maybe, but a fun exercise and it forces us to understand the JVM a little better.
Building and testing
Not much to say here. Have a look at the full pom.xml
. It's a barebone (yes, really...) build descriptor that will allow us to:
mvn test
, to execute all unit tests.mvn package
, to package the fat jar, which will be output toshakyboi/target/shakyboi.jar
mvn deploy
, to deploy a snapshot build to SonaType (Mario only).mvn release:prepare && maven release:deploy
, to deploy a release version that gets synced with Maven Central (Mario only).
The pom.xml
file also lets us import the project into the IDE of our choice. For this little project I'll deviate from my usual choice of Eclipse (fight me) and use IntelliJ IDEA.
Looking up classes
A user of shakyboi is expected to feed it a bunch of .class
files that make up their app and that should be tree shaken. Let's call these classes app classes.
The app classes may depend on other classes that should not be shaken, e.g. classes from the Java runtime class library like java/util/List
. For tree shaking, we still need to at least verify they exist, so we need a way to look them up as well. Let's call these classes bootstrap classes.
Classes can come from different sources, but we don't particularly care about the details. Instead, we generalize via the ClassLookup
interface and create implementations for different source types. The basic interface is simple:
package io.marioslab.shakyboi.lookup;
/**
* A class lookup provides the raw bytes of a class fetched from
* some place.
*/
public interface ClassLookup {
/**
* Looks up the class with the given name and returns its
* .class
file as a byte array.
*
* @param name the binary class name, e.g. "java/lang/Object". @see jvms-4.2.1.
* @return the classes bytes or null.
* @throws RuntimeException in case an unrecoverable error happened.
*/
byte[] findClass(String name);
}
for historical reasons, the class file format requires package names to be delimited by /
instead of .
. Thus, java.lang.Object
turns into java/lang/Object
. This is called the binary class name as per the specification.
To avoid conversions from and to the binary class name format, we'll use the binary class name format throughout the shakyboi code. Only user facing code will use the dot notation, e.g. java.lang.Object
.class
files of app classes usually come from either a directory or a .jar
file. These two sources are covered by DirectoryClassLookup
and JarClassLookup
.
For unit tests, we may want to load classes at runtime from the JVM classpath, for which we can use ClassLoaderClassLookup
. It can be used to load classes that belong to the tests (app classes), as well as classes from the Java runtime (bootstrap classes) and makes writing self-contained tests easy.
Until and including Java 8, a JDK or JRE would ship all JRE classes in a single file called lib/rt.jar
. This changed with the introduction of modules in Java 9. The JDK or JRE now ships with file called lib/modules
containing modules like java.base
, which in turn contain their respective .class
files. To access .class
files in such a module image, we can use JrtImageClassLookup
. The current implementation looks up the modules file in the home directory of the JDK or JRE that is currently executing shakyboi.
Multiple class lookups can be chained via a CombinedClassLookup
. That should cover pretty much all of our class lookup needs.
A simple smoke test called ClassLookupTest
checks the basic functionality of all the above classes, and illustrates their use.
Understanding .class
files
We can lookup the raw bytes of a .class
file by class name. Next, we have to write a parser that can parse the .class
file, so we can find all other classes mentioned in that class file.
To write a parser, we need to understand that Java class file format which is described in the Java Virtual Machine Specification. Go, have a look! It's pretty nifty, but has some historic limitations, like using 16-bit indices, resulting in things like maximum method sizes of 64kb and similar unfortunate consequences.
Here's the top-level structure of a class file:
ClassFile {
u4 magic;
u2 minor_version;
u2 major_version;
u2 constant_pool_count;
cp_info constant_pool[constant_pool_count-1];
u2 access_flags;
u2 this_class;
u2 super_class;
u2 interfaces_count;
u2 interfaces[interfaces_count];
u2 fields_count;
field_info fields[fields_count];
u2 methods_count;
method_info methods[methods_count];
u2 attributes_count;
attribute_info attributes[attributes_count];
}
The spec uses C-like structs to describe the data in a class file. Instead of short
, int
, etc., it uses more precise types like u2
(16-bit unsigned integer), or u4
(32-bit unsigned integer). Note that everything in a class file is stored in big endian.
The first three fields are the magic marker 0xcafebabe
, and the major and minor class file version.
The constant pool is a lookup for things like class names, method signatures, strings, numeric constants, and other things. This is were our focus will be. More on it later.
The access_flags
field stores whether the class is public, private, and so on, as a set of bit flags.
The this_class
field is meant to tell us the fully qualified name of the class, but how is that done with just a number (u2
)?
The answer is the constant pool, stored in the constant_pool[]
array. As its name says, it stores constants, such as strings, numbers, etc. Basically any kind of data that's not code is stored here and referenced (and shared) by the other parts of the class file.
There are different types of constant pool entries, e.g. a type representing a UTF-8 strings, a type for int constants, and so on. The type of an entry is encoded in its first byte according to the spec. The generic representation of an entry is a cp_info
:
cp_info {
u1 tag;
u1 info[];
}
The first byte is a tag, denoting the type of entry, followed by the payload as a variable length array of bytes. There is a total of 17 constant pool entry types. We have to find the entries that contain class names. Let's have a look at the most probably candidate types.
According to the spec, the this_class
field points at an entry in the constant pool that must have the type CONSTANT_Class (7)
. Here's the corresponding structure:
CONSTANT_Class_info {
u1 tag;
u2 name_index;
}
The tag
field would be set to the value 7
. The info[]
part of the generic cp_info
becomes a single 16-bit unsigned integer called name_index
. It is an index into the constant pool, pointing at another constant pool entry of type CONSTANT_Utf8 (1)
, which looks like this:
CONSTANT_Utf8_info {
u1 tag;
u2 length;
u1 bytes[length];
}
Now we're getting somewhere! The tag
is set to the value 1
. The length
stores the length of the UTF-8 string in bytes. And finally, the bytes[]
array stores the actual UTF-8 data. That's where our class' name hides!
Moving through the class file structure again, the next field is super_class
. It too is an index into the constant pool that points at a CONSTANT_Class_info
entry, which in turn points at a CONSTANT_Utf8_info
that holds the super class' name.
Next in the class file structure is the interfaces[]
array. Each of its elements is again an index into the constant pool, pointing at a CONSTANT_Class_info
entry representing one of the interfaces the class implements.
Other classes may also be referenced as types of fields of a class, which we can find in the next part of the class file structure, the fields[]
array. The structure of a field looks like this:
field_info {
u2 access_flags;
u2 name_index;
u2 descriptor_index;
u2 attributes_count;
attribute_info attributes[attributes_count];
}
That looks familiar! The name_index
is an index into the constant pool, pointing at a UTF-8 string entry representing the field's name. The descriptor_index
points at a UTF-8 constant pool entry that describes the field's type. The field type is encoded according to the field descriptor format.
E.g. a field with type java.utils.List
would be described by the descriptor Ljava/utils/List;
. A two dimensional array of objects would have the descriptor [[Ljava/lang/Object;
, etc. We can extract additional class names from every field descriptor!
Next in the class file structre is the methods[]
array, which stores all the methods of the class. Each method is encoded like this:
method_info {
u2 access_flags;
u2 name_index;
u2 descriptor_index;
u2 attributes_count;
attribute_info attributes[attributes_count];
}
By now, you should see the pattern. The name_index
points to a UTF-8 entry in the constant pool representing the name of the method. The descriptor_index
points to a UTF-8 entry that stores the method signature according to the method descriptor format. A method like int foo(String, List)
would have the descriptor (Ljava.lang.String;Ljava/util/List;)I
. Method descriptors are another place to find class names.
The last part of the class file structure is the attributes[]
array. Attributes are a sort of free-form data storage that allow to store arbitrary data. Attributes are not only found in the top level class file structure, but also in fields and methods. Some attributes can even contain attributes themselves. There are a few predefined attributes the JVM expects and can process, e.g. the "Code" attribute that stores the bytecodes of a method.
For the first iteration of shakyboi, we can ignore attributes. While they may reference classes, they will do so by pointing at the constant pool. We can thus get away with ignoring them, as parsing the constant pool does not require us to parse attributes. We'll just parse them in their generic structure form:
attribute_info {
u2 attribute_name_index;
u4 attribute_length;
u1 info[attribute_length];
}
So far it looks like we can find all the classes referenced by a class in that classes constant pool. They are either encoded in CONSTANT_Class_info
, or method and field descriptors, which are stored in CONSTANT_Utf8_info
entries.
Are there other elements of the class file that may reference another class? Yes! Have a look at this class:
public class Foo {
public void doFoo() {
new Zip().doZip();
}
}
Clearly, it references class Zip
in method doFoo()
. But Zip
is not part of the method signature, so won't be found in the method's descriptor. Let's javap -c
the class. Here's what the disassembled method bytecodes look^ like:
public void doFoo();
Code:
0: new #7 // class Zip
3: dup
4: invokespecial #9 // Method Zip."":()V
7: invokevirtual #10 // Method Zip.doZip:()V
10: return
The new
, invokespecial
, and invokevirtual
instructions all reference Zip
via the constant pool in some form, e.g. new #7
references the constant pool entry at index 7. Let's have a look at the constant pool entries at indices 7, 9, 10
:
#7 = Class #8 // Zip
#8 = Utf8 Zip
#9 = Methodref #7.#3 // Zip."":()V
#10 = Methodref #7.#11 // Zip.doZip:()V
I threw in entry 8
for good measure. There's a class entry in the constant pool for Zip
after all! And it's referenced directly by the new
instruction
We also find a new constant pool entry type: CONSTANT_Fieldref_info
, referenced by the invokespecial #9
and invokevirtual #10
instructions.
In fact, whenever bytecode references a field or method, it does so through a field reference, method reference, or interface method reference entry in the class pool. These look as follows:
CONSTANT_Fieldref_info {
u1 tag;
u2 class_index;
u2 name_and_type_index;
}
CONSTANT_Methodref_info {
u1 tag;
u2 class_index;
u2 name_and_type_index;
}
CONSTANT_InterfaceMethodref_info {
u1 tag;
u2 class_index;
u2 name_and_type_index;
}
In each case, the class_index
points to a class entry in the constant pool. It tells us which class the referenced field or method stems from. Great, another place to find classes.
Further more, name_and_type_index
points to a name and type entry in the constant pool, describing the field's type or the method's signature. It looks like this:
CONSTANT_NameAndType_info {
u1 tag;
u2 name_index;
u2 descriptor_index;
}
Yay, we've found another set of descriptors that could potentially reference classes.
Anything else we missed? Yes, there's one more constant pool entry type that directly references a UTF-8 entry that is a descriptor - CONSTANT_MethodType_info
:
CONSTANT_MethodType_info {
u1 tag;
u2 descriptor_index;
}
This constant pool entry is used when method handles/invoke dynamic.
Let's summarize all the places we need to check for class names:
- All
CONSTANT_Class_info
constant pool entries, except the one pointed to bythis_class
. - All
CONSTANT_NameAndType_info
constant pool entries, pointing to a field or method descriptor viadescriptor_index
. - All
CONSTANT_MethodType_info
constant pool entries, pointing to a method descriptor viadescriptor_index
. - The descriptors of fields and methods, pointed to by their
descriptor_index
.
That's a lot of words to arrive at such a tiny list. But the list of classes we generate must be exhaustive, otherwise we may end up removing things we shouldn't. Also, you now have a grasp of how .class
files work (except for attributes, read the spec).
Let's finish this article off by looking at the .class
file parser code.
Parsing .class
files
Turns out writing about the .class
file format takes longer than implementing a parser. In shakyboi, the class for parsing a .class
file is called ClassFileReader
. It parses a .class
file given as a byte[]
array into a ClassFile
instance.
The complexity of .class
files is mostly found in attributes, which we can thankfully ignore (for now). The entire parser is only 167 lines long. Here's its main driver method:
public static ClassFile readClassFile(String name, byte[] data) throws IOException {
DataInputStream in = new DataInputStream(new ByteArrayInput(data));
try {
ClassFile clz = new ClassFile(name, data);
if (in.readInt() != 0xcafebabe) throw new RuntimeException("Magic 0xcafebabe not found");
clz.minorVersion = in.readUnsignedShort();
clz.majorVersion = in.readUnsignedShort();
readConstantPool(in, clz);
clz.accessFlags = in.readUnsignedShort();
clz.thisClass = in.readUnsignedShort();
clz.superClass = in.readUnsignedShort();
readInterfaces(in, clz);
readMembers(in, clz, clz.fields);
readMembers(in, clz, clz.methods);
readAttributes(in, clz, clz.attributes);
return clz;
} catch (Throwable t) {
throw new IOException("Error reading class " + name, t);
}
}
That maps pretty cleanly to the class file structure above. The work horse is the ClassFileReader.readConstantPool()
method, responsible for parsing the constant pool and translating it into Java POJOs.
While we won't be rewriting .class
files (yet), I've also implemented a ClassFileWriter
which takes a ClassFile
and serializes it into a .class
file in form of a byte[]
array.
Up next
You made it to the end! While wordy, your time investment in reading all fo this will be recouped once we start doing more fancy stuff, like obfuscation. For now, we are happy to have all our scaffolding in place and an idea how to identify all classes referenced by a class.
In the next installement, we'll finally write our first tree shaker. That article will be a lot shorter :D
Discuss this post on Twitter.