Monday, August 4, 2014

URL-Safe Compressed and Enhanced UUID/GUID

Below is a simple method to compress a UUID (128 bits represented by 32 hexadecimal characters with an additional 4 separator characters) into a 22 character string (base64). But, a 22 character base64 string can actually hold 132 bits of data (6 bits per char X 22 chars). As such, this method injects 4 additional random bits of data which increases the potential number of available unique identifiers by a factor of 16.

In addition, all the selected base64 characters are URL-safe.

Example:

This UUID : 7e47c34a-eebc-4387-b5a4-c6b558bdc407

is compressed down to this: 35Hw0ruvEOHbWkxrVYvcQH

public class KeyGen {

    private KeyGen() {
    } // constructor

    // base64url, see:  http://tools.ietf.org/html/rfc4648 section 5
    private static String chars
        = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_";

    /**
     * Generates a UUID and compresses it into a base 64 character string;  this
     * results in a 22 character string and since each character represents 6 bits
     * of data that means the result can represent up to 132 bits.  However, since
     * a UUID is only 128 bits, 4 additional randomize bits are inserted into the
     * result (if desired); this means that the number of available unique IDs is
     * increased by a factor of 16
     *
     * @param enhanced specifies whether or not to enhance the result with 4
     *                 additional bits of data since a 22 base64 characters
     *                 can hold 132 bits of data and a UUID is only 128 bits
     * @return a 22 character string where each character is from the file and url safe
     * base64 character set [A-Za-z0-9-_]
     */
    public static String getCompressedUuid(boolean enhanced) {
        UUID uuid = UUID.randomUUID();
        return compressLong(uuid.getMostSignificantBits(), enhanced)
               + compressLong(uuid.getLeastSignificantBits(), enhanced);
    } // getCompressedUuid()

    // compress a 64 bit number into 11 6-bit characters
    private static String compressLong(long key, boolean enhance) {
        // randomize 2 bits as a prefix for the leftmost character which would
        // otherwise only have 4 bits of data in the 6 bits
        long prefix = enhance ? (long)(Math.random() * 4) << 62 : 0;

        // extract the first 6-bit character from the key
        String result = "" + chars.charAt((int)(key & 0x3f));

        // shifting in 2 extra random bits since we have the room
        key = ((key >>> 2) | prefix) >>> 4;

        // iterate thru the next 10 characters
        for (int i = 1; i < 11; i++) {
            // strip off the last 6 bits from the key, look up the matching character
            // and prepend that character to the result
            result = chars.charAt((int)(key & 0x3f)) + result;
            // logical bit shift right so we can isolate the next 6 bits
            key = key >>> 6;
        }

        return result;
    } // compressLong()

} // class KeyGen



2 comments:

Unknown said...

Good Job!!
I'm confused. The compression purpose is to reduce the number of bits. However, you have GUID with 128bit, after compression you obtain 132 bits, so 4 more bits. I can see the advantage of increasing the number of GUID but What about the compression? It seems like it doesn't help for compression. Am I right?

Russ Jackson said...

Hi Dominique, that's a great observation and question.

When you compress the character representation of the GUID down using a base 64 encoding you can get it down to 22 characters. You could then expand that 22 chars back into a 128 bit GUID by reversing the process (you have to drop the last 4 bits, which could have been filled in with meaningless data).

But, since a base 64 character requires 6 bits to uniquely represent it, 22 characters can effectively hold 132 bits of numeric data. So, the 22 character base 64 string has 4 extra bits.

If you don't plan to reconstitute those 22 chars into a 128 bit number you can use those 4 extra bits to increase the number of unique 22 character strings by a factor of 16. If you don't use those extra bits you can just fill them in with zeros.

Look at it this way...

32 characters = 256 bits of character data (8 bits per character)

But, only half of those bits are used to represent the 128 bit GUID. Each 4 bits of numeric data in the GUID is expanded into 8 bits of character data in the string.

Likewise, 22 characters = 132 bits of character data. Of that, with a base 64 encoding, 128 bits of that represents the numeric data. In that very last 6 bit base 64 character only 2 of its bits are being used. There are 4 bits available for actual data.

Putting those 4 bits to work by filling them with random data instead of all zeros lets you increase the number of unique 22 character strings by a factor of 16.

I hope I was more articulate with this response. Let me know if I didn't clear that up.

Thanks.

Russ