Shakyboi - Part 1: Of tree shaking and class files

February 20, 2021

Shake it like a polaroid picture.

Shakyboi - Part 1: Of tree shaking and class files

I don't hate ProGuard. It's a fine piece of software. However, its documentation is a bit lacking, and some of its behaviour is weird. So what better way to spend a few weekends than to learn all the hard lessons Eric, creator of ProGuard, had learned over many years! Thus I've started building Shakyboi, my personal replacement for ProGuard. I'll probably reinvent everything ProGuard does badly. But at least I'll have some fun doing so.

What's tree shaking?

ProGuard takes as input Java .class files and "shrinks", optimizes, and obfuscates them. The "shrinking" part is more commonly known as tree shaking, at least in the JavaScript space.

Tree shaking is a form of dead code removal. Removing dead code generally results in smaller binary sizes, a desirable thing for faster deployments, smaller and thus faster downloads, and potentially faster app start-up times.

In case of Java, we can apply tree shaking on multiple levels. For now, we'll concern ourselves with tree shaking at the class level. The goal of class level tree shaking is to remove any classes that are not used by the app.

Have a look at this contrived example Java app.

// Zipp.java
public class Zipp {
    public void doZipp() {
        System.out.println("I'm a Zipp.");
    }
}
// Zapp.java
public class Zapp {
    public void doZapp() {
        System.out.println("I'm a Zapp.");
    }
}
// Zopp.java
public class Zopp {
    public void doZopp() {
        System.out.println("I'm a Zopp.");
    }
}
// Foo.java
public class Foo {
    public void doFoo() {
        System.out.println("I'm a Foo.");
        new Zipp().doZipp();
    }
}
// Bar.java
public class Bar {
    public void doBar() {
        System.out.println("I'm a Bar.");
        new Zapp().doZapp();
        new Zopp().doZopp();
    }
}
// App.java
public class App {
    public static void main(String[] argv) {
        new Bar().doBar();
    }
}

Let's assume we have compiled the 6 classes via javac and packaged them into a file called app.jar using the build system of our choice. The app.jar file (which is really just a ZIP file) would look like this.

App.class
Bar.class
Foo.class
Zapp.class
Zipp.class
Zopp.class

Running the app on the CLI looks like this:

❯ java -jar app.jar
I'm a Bar.
I'm a Zapp.
I'm a Zopp.

The command starts the Java Virtual Machine (JVM) and tells it to look up the class App in the app.jar file and call its main(String[]) method.

With our mushy brains, we can easily see from both the output and the source code that class App depends on class Bar (because Bar is referenced in App.main(String[])), and that Bar depends on Zapp and Zopp (because they are referenced in Bar.doBar()). Class Foo is not referenced in App or Bar, nor is Zipp. Congratulations, you just manually shook the class dependency tree (well, graph really), which looks like this:

In graph theory terms, we've created a directed graph that encodes the class dependencies, which has 2 connected components. The connected component containing the App class contains all classes necessary for the successful execution of our app. All classes in the other connected components can be removed from our app.

Having performed our dependency analysis manually, we can remove the files Foo.class and Zipp.class from our app.jar file and the app will still work:

❯ zip -d app.jar Foo.class Zipp.class
deleting: Zipp.class
deleting: Foo.class

❯ java -jar app.jar
I'm a Bar.
I'm a Zapp.
I'm a Zopp.

Here's the before-and-after in terms of file size:

❯ ls -lah app.jar
-rw-r--r--  1 badlogic  staff   4.1K Feb 20 14:16 app.jar

❯ zip -d app.jar Foo.class Zipp.class
deleting: Zipp.class
deleting: Foo.class

❯ ls -lah app.jar
-rw-r--r--  1 badlogic  staff   3.3K Feb 20 14:17 app.jar

We saved about 0.8K! 40 years ago that would have been marvelous.

Constructing the class dependency graph from .class files

Obviously, doing manual dependency analysis and removal of "dead" classes is an itsy bit too complex for real-world applications. Being the computer loving folks we are, we can surely devise a way to automate the class dependency graph creation.

For Java, we could do it in multiple different ways. One approach would be to parse Java source files, but that's just nasty, so we ignore that option.

A possibly less complex way to do it is by parsing .class files and extracting the classes a class depends on. For this, we have to understand the Java class file format. It's a mostly trivial and well thought out format, so it should be easy to parse and extract the information we need.

Instead of diving into the format specification, let's first have a look at the output of javap when fed one of our classes. javap comes with your JDK and is able to disassemble .class files. Here's what Main.class contains:

❯ javap -c -v -cp app.jar App
Classfile jar:file:///Users/badlogic/workspaces/marioslab/shakyboi/testapps/foo/target/app.jar!/App.class
    Last modified 20 Feb 2021; size 398 bytes
    SHA-256 checksum ea9637fce3dfdead99a625a3047ff56547e428c5a8bc81549c537eea8663af13
    Compiled from "App.java"
public class App
    minor version: 0
    major version: 59
    flags: (0x0021) ACC_PUBLIC, ACC_SUPER
    this_class: #13                         // App
    super_class: #2                         // java/lang/Object
    interfaces: 0, fields: 0, methods: 2, attributes: 1
Constant pool:
    #1 = Methodref          #2.#3          // java/lang/Object."":()V
    #2 = Class              #4             // java/lang/Object
    #3 = NameAndType        #5:#6          // "":()V
    #4 = Utf8               java/lang/Object
    #5 = Utf8               
    #6 = Utf8               ()V
    #7 = Class              #8             // Bar
    #8 = Utf8               Bar
    #9 = Methodref          #7.#3          // Bar."":()V
    #10 = Methodref          #7.#11         // Bar.doBar:()V
    #11 = NameAndType        #12:#6         // doBar:()V
    #12 = Utf8               doBar
    #13 = Class              #14            // App
    #14 = Utf8               App
    #15 = Utf8               Code
    #16 = Utf8               LineNumberTable
    #17 = Utf8               LocalVariableTable
    #18 = Utf8               this
    #19 = Utf8               LApp;
    #20 = Utf8               main
    #21 = Utf8               ([Ljava/lang/String;)V
    #22 = Utf8               argv
    #23 = Utf8               [Ljava/lang/String;
    #24 = Utf8               SourceFile
    #25 = Utf8               App.java
{
    public App();
    descriptor: ()V
    flags: (0x0001) ACC_PUBLIC
    Code:
        stack=1, locals=1, args_size=1
            0: aload_0
            1: invokespecial #1                  // Method java/lang/Object."":()V
            4: return
        LineNumberTable:
        line 1: 0
        LocalVariableTable:
        Start  Length  Slot  Name   Signature
            0       5     0  this   LApp;

    public static void main(java.lang.String[]);
    descriptor: ([Ljava/lang/String;)V
    flags: (0x0009) ACC_PUBLIC, ACC_STATIC
    Code:
        stack=2, locals=1, args_size=1
            0: new           #7                  // class Bar
            3: dup
            4: invokespecial #9                  // Method Bar."":()V
            7: invokevirtual #10                 // Method Bar.doBar:()V
        10: return
        LineNumberTable:
        line 3: 0
        line 4: 10
        LocalVariableTable:
        Start  Length  Slot  Name   Signature
            0      11     0  argv   [Ljava/lang/String;
}
SourceFile: "App.java"

Looks intimidating at first sight, but it's not so bad! If you CTRL + F for Bar, you can see all the places App references Bar in the .class file.

We see a bunch of Bar in the constant pool section of the javap output. The constant pool is a lookup table referenced by other parts of the class. It contains symbolic references to other classes, fields, methods, constants and so on. Each constant pool entry has an index and a type. A constant pool entry may also reference another constant pool entry. We can see Bar in a handful of constant pool entries. Let's look at the first two.

#7 = Class              #8             // Bar
#8 = Utf8               Bar

The entry at index 7 is of type Class, and encodes, well, a class. It references another entry at index 8, which is of type Utf8, a UTF-8 string encoding the class' name. Great! In general, we can assume that almost all classes a class depends on can be found in the constant pool in form of Class entries.

The actual byte code making up a class' method will reference the constant pool, e.g. the App.main(String[]) method looks like this:

public static void main(java.lang.String[]);
descriptor: ([Ljava/lang/String;)V
flags: (0x0009) ACC_PUBLIC, ACC_STATIC
Code:
    stack=2, locals=1, args_size=1
        0: new           #7                  // class Bar
        3: dup
        4: invokespecial #9                  // Method Bar."":()V
        7: invokevirtual #10                 // Method Bar.doBar:()V
    10: return
    LineNumberTable:
    line 3: 0
    line 4: 10
    LocalVariableTable:
    Start  Length  Slot  Name   Signature
        0      11     0  argv   [Ljava/lang/String;

The instruction at byte code index 0 (new #7) references the constant pool entry at constant pool index 7, which is the Class entry discussed above. The instruction tells the JVM to construct a new instance of class Bar, as helpfully pointed out by javap via the comment // class Bar appended to the instruction.

Let's ignore the dup instruction.

The next instruction at byte code index 4 (invokespecial #9) calls a special method, the constructor of Bar. The constructor is specified as a constant pool entry at index 9. The constant pool entry is of type MethodRef, which encodes a class (Bar in this case), and the method of that class we want to invoke (the constructor of Bar):

#9 = Methodref          #7.#3          // Bar."":()V

The MethodRef references constant pool entries 7 and 3 (javap has again helpfully resolved them for us in the comment):

#3 = NameAndType        #5:#6          // "":()V
#7 = Class              #8             // Bar

Ah, we already know constant pool entry 7, it's the class! Constant pool entry 3 is of type NameAndType, which describes the name and type of either a field or a method. The type is encoded as a descriptor string. You may have seen those before, they may look a bit wonky. Here's the descriptor of App.main(String[])

([Ljava/lang/String;)V

As it turns out, classes directly referenced by a class can all be found in the constant pool, either in form of a Class constant pool entry, or as a descriptor string. The latter is a bit more complex to handle, as we have to go through all constant pool entry types that reference such a descriptor. We can figure this out from the class file format specification. Additionally, we have to consider the super class hierarchy of the class, as well as the interfaces it implements, information that's also part of the .class file.

And if we can do it for one class, we can do it for all other classes it depends on, recursively, which gives us our class dependency graph. Easy!

Up next

I've already created a repository and filled in some class file parsing code, see Shakyboi on GitHub. We won't be using any of the fancy byte code manipulation libraries like ASM, but instead use some duct tape and spit and write our own. In the next installment I'll discuss the implementation of the basic class level tree shaker.

Discuss this post on Twitter.