Appending entries to ZIP files with TrueZIP 7.3

This post is a sneak preview of a new feature in TrueZIP 7.3: The ability to add ZIP entries to a ZIP file fast by appending them to its end rather than performing a full update. This feature is the equivalent to a multi-session disc (CD, DVD etc.) for ZIP files and can significantly improve the overall performance of a TrueZIP application.

Motivation

By default, TrueZIP is configured to produce the smallest possible archive files. This is achieved by applying the following strategy:

  1. Select the maximum compression ratio in the archive drivers.
  2. Perform an archive update (see below) upon any of the following events:
    • An existing archive entry is going to get overwritten with new contents, or
    • an existing archive entry is going to get updated with new meta data.

This strategy applies an archive update whenever required to avoid the writing of redundant archive entry contents or meta data to the resulting archive file. An archive update is basically a copy operation where all archive entries which haven’t been written yet get copied from the input archive file to the output archive file.

However, while this strategy produces the smallest possible archive files, it may yield bad performance if the number and contents of the archive entries to create or update are pretty small compared to the total size of the resulting archive file: When an archive update is performed, the overall amount of I/O data is in the order of O(n * se = sa) in big O notation, where n is the total number of archive entries, se is the average size of these archive entries - including content and meta data - and sa is the total size of the resulting archive file.

How To Append Entries To ZIP Files

Therefore, as of TrueZIP 7.3, you can change this strategy by setting the FsOutputOption.GROW output option preference when writing archive entry contents or updating their meta data. When set, this output option preference allows archive files to grow by appending new or updated archive entries to their end and inhibiting archive update operations.

You can set this output option preference in the global configuration as follows:

class MyApplication extends TApplication {

    @Override
    protected void setup() {
        // This should obtain the global configuration.
        TConfig config = TConfig.get();
        // Set FsOutputOption.GROW for appending-to rather than reassembling an
        // archive file.
        config.setOutputPreferences(
                config.getOutputPreferences.set(FsOutputOption.GROW));
    }

    ...
}

Of course, you can also set this output option preference on a case-by-case basis as follows:

// We are going to append entry to archive.zip.
TFile file = new TFile(archive.zip/entry);

// First, push a new current configuration on the inheritable thread local
// stack.
TConfig config = TConfig.push();
try {
    // Set FsOutputOption.GROW for appending-to rather than reassembling an
    // archive file.
    config.setOutputPreferences(
            config.getOutputPreferences.set(FsOutputOption.GROW));

    // Now use the current configuration and append the entry to the archive
    // file even if it's already present.
    TFileOutputStream out = new TFileOutputStream(file);
    try {
        // Do some output here.
        ...
    } finally {
        out.close();
    }
} finally {
    // Pop the current configuration off the inheritable thread local stack.
    config.close();
}

Archive Driver Support

Note that it’s specific to the archive file system driver if this output option preference is supported or not. If it’s not supported, then it gets silently ignored, thereby falling back to the default strategy of performing an archive update whenever required to avoid writing redundant archive entry data. Currently, the situation is like this:

  • The drivers of the module TrueZIP Driver ZIP fully support this output option preference, so it’s available for EAR, JAR, WAR etc.
  • The drivers of the module TrueZIP Driver ZIP.RAES only support redundant archive entry contents and meta data. You cannot append to an existing ZIP.RAES file.
  • The drivers of the module TrueZIP Driver TAR only support redundant archive entry contents. You cannot append to an existing TAR file.

Performance Considerations

Returning to the performance discussion from above, lets assume that an application updates u of the n existing archive entries with new contents. Again, the average entry size including contents and meta data is se. Now with FsOutputOption.GROW set, the overall amount of I/O data is just in the order of O(u * se + n) rather than O(n * se = sa). The additional O(... + n) results from reading the central directory at the end of the ZIP file and appending an updated version to its end again. As you can see, the total size of the archive file sa has been erased from the formula, which is a major performance increase if u is significantly smaller than n.

Following are some corner cases where it might not be very reasonable to use the GROW preference:

  1. If u = 0, then there is no update to the archive file at all, so using the GROW preference makes no difference.
  2. If u ~ n, e.g. by updating all archive entries, then the result is O(u * se + n) ~ O(n * se + n) = O(sa + n), which is a minor performance decrease because of writing the updated central directory. It also results in about double the size of the archive file because almost every archive entry is now duplicated. The latter may be irrelevant if n is small.
  3. If se is very small, e.g. by writing empty archive entries, then the result is O(u * se + n) ~ O(u + n). The resulting archive file might contain more archive entry meta data than content, especially because of the updated central directory.