|
|
Output Style: Ready to encode. |
About Huffman Encoding: This browser-based utility, written by me in JavaScript, allows you to compress and decompress plaintext using a Huffman Coding, which performs compression on the character level. It reduces the amount of space used by common characters, essentially making the average character take up less space than usual.
Efficacy of the Compressor: The compression ratio (full size divided by compressed size) of any one compression is displayed in the output, and can range from .33 (for a single character) to 8 (for an infinitely-long repetition of a single character). This utility performs optimal character-wise compression - that is, for any set of randomly generated characters (even with different weights for each character), a Huffman Coding will perform maximal compression. However, for non-random strings of characters, a Huffman Coding will not perform maximal compression, and is thus only one part of modern compression algorithms.
Self-Containment: Technically speaking, a Huffman Coding is only a set of rules for compressing data. Likewise, it is possible to have encoded data without the coding itself, which can be useful in cases where all applicable data is relatively similar. This implementation, however, generates a new coding for each input, and the compressed output includes both the coding and the encoded data. This way, the output is entirely self-contained, and can be decompressed using this tool at any time, on any machine.
Encoding Protocol: To achieve the self-containment mentioned above, this utility has a protocol for returning the Huffman Coding and compressed data as a single binary string. This binary string starts with padding zeros (as necessary to fill out the output style), and then a one, which signals that the coding will begin on the next bit. At this point, the root node of the Huffman Tree is created. Note that the root always branches - if the text only contains one character, a superfluous second one will be added to complete the tree.
Next, a traversal is started from the root. If the next bit is a one, the next child becomes a leaf node which contains the next 8 bits (which are then skipped) as data; otherwise, it is a branch, and the traversal moves down to it until it is completed. Once the tree is complete, the code for any byte can be found by taking the path from the tree's root to the data's leaf (i.e. left right right) and making each 'left' a zero and each 'right' a one. From there, a dictionary describing the relationship between bytes and codes is created, and then used to encode the input bytes; this encoding is appended to the coding itself.
Data Safety: One benefit of writing a tool like this in JavaScript is that, while it's easily accessible on the web, it doesn't actually need to communicate with a central server. This reduces server usage, but more importantly, means that compression and decompression do not send any information over the internet, keeping it private. Note, however, that this encoding is in no way cryptographically-secure, and thus isolation to your computer is the only data security method in place here.