Monday, August 4, 2014

URL-Safe Compressed and Enhanced UUID/GUID

Below is a simple method to compress a UUID (128 bits represented by 32 hexadecimal characters with an additional 4 separator characters) into a 22 character string (base64). But, a 22 character base64 string can actually hold 132 bits of data (6 bits per char X 22 chars). As such, this method injects 4 additional random bits of data which increases the potential number of available unique identifiers by a factor of 16.

In addition, all the selected base64 characters are URL-safe.


This UUID : 7e47c34a-eebc-4387-b5a4-c6b558bdc407

is compressed down to this: 35Hw0ruvEOHbWkxrVYvcQH

public class KeyGen {

    private KeyGen() {
    } // constructor

    // base64url, see: section 5
    private static String chars
        = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_";

     * Generates a UUID and compresses it into a base 64 character string;  this
     * results in a 22 character string and since each character represents 6 bits
     * of data that means the result can represent up to 132 bits.  However, since
     * a UUID is only 128 bits, 4 additional randomize bits are inserted into the
     * result (if desired); this means that the number of available unique IDs is
     * increased by a factor of 16
     * @param enhanced specifies whether or not to enhance the result with 4
     *                 additional bits of data since a 22 base64 characters
     *                 can hold 132 bits of data and a UUID is only 128 bits
     * @return a 22 character string where each character is from the file and url safe
     * base64 character set [A-Za-z0-9-_]
    public static String getCompressedUuid(boolean enhanced) {
        UUID uuid = UUID.randomUUID();
        return compressLong(uuid.getMostSignificantBits(), enhanced)
               + compressLong(uuid.getLeastSignificantBits(), enhanced);
    } // getCompressedUuid()

    // compress a 64 bit number into 11 6-bit characters
    private static String compressLong(long key, boolean enhance) {
        // randomize 2 bits as a prefix for the leftmost character which would
        // otherwise only have 4 bits of data in the 6 bits
        long prefix = enhance ? (long)(Math.random() * 4) << 62 : 0;

        // extract the first 6-bit character from the key
        String result = "" + chars.charAt((int)(key & 0x3f));

        // shifting in 2 extra random bits since we have the room
        key = ((key >>> 2) | prefix) >>> 4;

        // iterate thru the next 10 characters
        for (int i = 1; i < 11; i++) {
            // strip off the last 6 bits from the key, look up the matching character
            // and prepend that character to the result
            result = chars.charAt((int)(key & 0x3f)) + result;
            // logical bit shift right so we can isolate the next 6 bits
            key = key >>> 6;

        return result;
    } // compressLong()

} // class KeyGen


Dominique BAMOUNI said...

Good Job!!
I'm confused. The compression purpose is to reduce the number of bits. However, you have GUID with 128bit, after compression you obtain 132 bits, so 4 more bits. I can see the advantage of increasing the number of GUID but What about the compression? It seems like it doesn't help for compression. Am I right?

Russ Jackson said...

Hi Dominique, that's a great observation and question.

When you compress the character representation of the GUID down using a base 64 encoding you can get it down to 22 characters. You could then expand that 22 chars back into a 128 bit GUID by reversing the process (you have to drop the last 4 bits, which could have been filled in with meaningless data).

But, since a base 64 character requires 6 bits to uniquely represent it, 22 characters can effectively hold 132 bits of numeric data. So, the 22 character base 64 string has 4 extra bits.

If you don't plan to reconstitute those 22 chars into a 128 bit number you can use those 4 extra bits to increase the number of unique 22 character strings by a factor of 16. If you don't use those extra bits you can just fill them in with zeros.

Look at it this way...

32 characters = 256 bits of character data (8 bits per character)

But, only half of those bits are used to represent the 128 bit GUID. Each 4 bits of numeric data in the GUID is expanded into 8 bits of character data in the string.

Likewise, 22 characters = 132 bits of character data. Of that, with a base 64 encoding, 128 bits of that represents the numeric data. In that very last 6 bit base 64 character only 2 of its bits are being used. There are 4 bits available for actual data.

Putting those 4 bits to work by filling them with random data instead of all zeros lets you increase the number of unique 22 character strings by a factor of 16.

I hope I was more articulate with this response. Let me know if I didn't clear that up.