How to compact a ZIP file

In my previous sneak preview post, I showed how the upcoming TrueZIP 7.3 release can get used to grow a ZIP file by appending archive entries to its end rather than assembling a new ZIP file. While this improves performance, it grows the ZIP file over time and so you may want to compact the ZIP file once again. Here’s how to do this.

Prerequisites

First, you need to have set the FsOutputOption.GROW preference and created a ZIP file with it. With this output option preference set, any new or updated ZIP entry is appended to the end of the ZIP file. When your application terminates or calls TFile.umount() (alias TFile.sync()), then a new central directory is written to the end of the ZIP file. This is the equivalent of a multi-session disc (CD, DVD etc.) for ZIP files - including EAR, JAR, WAR etc.

The Issue

This mode of operation may save a lot of time if a ZIP file is large compared to newly added or updated archive entries. However, it may leave a lot of redundant data in the ZIP file: If you overwrite an existing ZIP entry, the old version is still physically present in the ZIP file, but it’s just not listed in the central directory at the end of the ZIP file. In addition, for any update a new central directory will get appended to the end of the ZIP file. Again, the old version will still be present in the ZIP file, but it’s just not visible to an application. This is like an unreferenced object on the JVM heap which hasn’t been garbage collected yet - it consumes memory, but it unused.

Suppose this is the layout of a ZIP file after it has been initially created with the entries entry1 and entry2 (presented as pseudo-XML):

<entry name=entry1/>
<entry name=entry2/>
<central-directory/>

Now if the FsOutputOption.GROW output option preference is set and the application overwrites entry2, the layout of the updated archive file will look as follows:

<entry name=entry1/>
<entry name=entry2/> <!-- old entry data -->
<central-directory/> <!-- old directory data -->
<entry name=entry2/> <!-- new entry data -->
<central-directory/> <!-- new directory data -->

As you can see, there are two entries with the name entry2 and two central directories, each with the old and the new data. It’s easy to imagine that this strategy wastes a lot of space if your application is updating the ZIP file a lot.

The Resolution

So what you may need is a way to compact the ZIP file to this layout again:

<entry name=entry1/>
<entry name=entry2/> <!-- new entry data -->
<central-directory/> <!-- new directory data -->

Compacting the ZIP file again is really simple. In TrueZIP 7.3, a new method has been added to the TFile class for this purpose. You can call it like this:

TFile archive = new TFile(archive.zip);
// Create or update the archive file here.
...
archive.compact();

The method TFile.compact() will simply make a structural copy of the archive file to a temporary archive file and then move the result over to the original archive file again. It’s the structural copy which will effectively remove all redundant archive entry contents and meta data, including central directories.

Mind that this operation has no means to detect if there is actually any redundant data present in a ZIP file. Any invocation will perform exactly the same steps, so if the ZIP file is already compact, then this will just waste time and temporary space in the platform file system.

Compacting TAR Files

Although there is no central directory in a TAR file, there may still be redundant archive entries. Removing them works the same way as with ZIP files. Just call TFile.compact() as shown before.