
Lempel-Ziv-Welch (LZW) is a universal lossless data compression algorithm created by Abraham Lempel, Jacob Ziv, and Terry Welch. It was published by Welch in 1984 as an improved implementation of the LZ78 algorithm published by Lempel and Ziv in 1978. The algorithm is designed to be fast to implement but is not usually optimal because it performs only limited analysis of the data.
The compressor algorithm builds a string translation table from the text being compressed. The string translation table maps fixed-length codes (usually 12-bit) to strings. The string table is initialized with all single-character strings (256 entries in the case of 8-bit characters). As the compressor character-serially examines the text, it stores every unique two-character string into the table as a code/character concatenation, with the code mapping to the corresponding first character. As each two-character string is stored, the first character is sent to the output. Whenever a previously-encountered string is read from the input, the longest such previously-encountered string is determined, and then the code for this string concatenated with the extension character (the next character in the input) is stored in the table. The code for this longest previously-encountered string is output and the extension character is used as the beginning of the next string.
The decompressor algorithm only requires the compressed text as an input, since it can build an identical string table from the compressed text as it is recreating the original text. However, an abnormal case shows up whenever the sequence character/string/character/string/character (with the same character for each character and string for each string) is encountered in the input and character/string is already stored in the string table. When the decompressor reads the code for character/string/character in the input, it cannot resolve it because it has not yet stored this code in its table. This special case can be dealt with because the decompressor knows that the extension character is the previously-encountered character.[1]
Compressor algorithm:
w = NIL;
while (read a char c) do
if (wc exists in dictionary) then
w = wc;
else
add wc to the dictionary;
output the code for w;
w = c;
endif
done
output the code for w;
Decompressor algorithm:
read a char k;
output k;
w = k;
while (read a char k) do
if (k > 255 && index k exists in dictionary) then
entry = dictionary entry for k;
else if (k > 255 && index k does not exist in dictionary)
entry = w + w[0];
else
entry = k;
endif
output entry;
add w+entry[0] to the dictionary;
w = entry;
done
TOBEORNOTTOBEORTOBEORNOTSuppose that we are using 256 (8-bit) ASCII code as the default dictionary. The length of this input is 24 characters. So this input requires 24 * 8 = 192 bits to store.
|
style="vertical-align: top; text-align: center; font-weight: bold;">c |
style="vertical-align: top; text-align: center; font-weight: bold;">wc |
style="vertical-align: top; text-align: center; font-weight: bold;">dictionary |
||
| T |
<NIL> |
T |
||
| O |
T |
TO |
T |
TO = <256> |
| B |
O |
OB |
O |
OB = <257> |
| E |
B |
BE |
B |
BE = <258> |
| O |
E |
EO |
E |
EO = <259> |
| R |
O |
OR |
O |
OR = <260> |
| N |
R |
RN |
R |
RN = <261> |
| O |
N |
NO |
N |
NO = <262> |
| T |
O |
OT |
O |
OT = <263> |
| T |
T |
TT |
T |
TT = <264> |
| O |
T |
TO |
||
| B |
TO |
TOB |
<256> |
TOB = <265> |
| E |
B |
BE |
||
| O |
BE |
BEO |
<258> |
BEO = <266> |
| R |
O |
OR |
||
| T |
OR |
ORT |
<260> |
ORT = <267> |
| O |
T |
TO |
||
| B |
TO |
TOB |
||
| E |
TOB |
TOBE |
<265> |
TOBE = <268> |
| O |
E |
EO |
||
| R |
EO |
EOR |
<259> |
EOR = <269> |
| N |
R |
RN |
||
| O |
RN |
RNO |
<261> |
RNO = <270> |
| T |
O |
OT |
||
| OT |
<263> |
After compression we have this sequence of 9-bits codes in output:
TOBEORNOT<256><258><260><265><259><261><263>
This sequence requires 16 * 9 = 144 bits to store.
|
style="vertical-align: top; text-align: center; font-weight: bold;">k |
style="vertical-align: top; text-align: center; font-weight: bold;">entry |
w+entry[0] |
style="vertical-align: top; text-align: center; font-weight: bold;">dictionary |
||
| T |
T |
T |
|||
| O |
T |
O |
TO |
O |
TO = <256> |
| B |
O |
B |
OB |
B |
OB = <257> |
| E |
B |
E |
BE |
E |
BE = <258> |
| O |
E |
O |
EO |
O |
EO = <259> |
| R |
O |
R |
OR |
R |
OR = <260> |
| N |
R |
N |
RN |
N |
RN = <261> |
| O |
N |
O |
NO |
O |
NO = <262> |
| T |
O |
T |
OT |
T |
OT = <263> |
| <256> |
T |
TO |
TT |
TO |
TT = <264> |
| <258> |
TO |
BE |
TOB |
BE |
TOB = <265> |
| <260> |
BE |
OR |
BEO |
OR |
BEO = <266> |
| <265> |
OR |
TOB |
ORT |
TOB |
ORT = <267> |
| <259> |
TOB |
EO |
TOBE |
EO |
TOBE = <268> |
| <261> |
EO |
RN |
EOR |
RN |
EOR = <269> |
| <263> |
RN |
OT |
RNO |
OT |
RNO = <270> |
The method became widely used in the program compress, which became a more or less standard utility in Unix systems circa 1986. (It has since disappeared from many for both legal and technical reasons.) Several other popular compression utilities also used the method, or closely related ones.
It became very widely used after it became part of the GIF image format in 1987. It may also (optionally) be used in TIFF files.
LZW compression provided a better compression ratio, in most applications, than any well-known method available up to that time. It became the first widely used universal data compression method on computers. It would typically compress large English texts to about half of their original sizes.
Today, an implementation of the algorithm is contained within the popular Adobe Acrobat software program.
This example shows the LZW algorithm in action, showing the status of the output and the dictionary at every stage, both in encoding and decoding the message. In order to keep things clear, let us assume that we're dealing with a simple alphabet - capital letters only, and no punctuation or spaces. This example has been constructed to give reasonable compression on a very short message; when used on real data, repetition is generally less pronounced, and so the initial parts of a message will see little compression. As the message grows, however, the compression ratio tends asymptotically to the maximum.[2] A message to be sent might then look like the following:
TOBEORNOTTOBEORTOBEORNOT#
The # is a marker used to show that the end of the message has been reached. Clearly, then, we have 27 symbols in our alphabet (the 26 capital letters A through Z, plus the # character). A computer will render these as strings of bits; 5-bit strings are needed to give sufficient combinations to encompass the entire dictionary. As the dictionary grows, the strings will need to grow in length to accommodate the additional entries. A 5-bit string gives 25 = 32 possible combinations of bits, and so when the 33rd dictionary word is created, the algorithm will have to start using 6-bit strings (for all strings, including those which were previously represented by only five bits). Note that since the all-zero string 00000 is used, and is labeled "0", the 33rd dictionary entry will be labeled 32. The initial dictionary, then, will consist of the following:
# = 00000
A = 00001 B = 00010
C = 00011 .
. .
Z = 11010
If we weren't using LZW, and just sent the message as it stands (25 symbols at 5 bits each), it would require 125 bits. We will be able to compare this figure to the LZW output later. We are now in a position to apply LZW to the message.
Symbol: Bit Code: New Dictionary Entry:
(= output)
T 20 = 10100 O 15 = 01111 28: TO <--- Don't forget, we originally had 27 symbols, so the next one is 28th.
B 2 = 00010 29: OB E 5 = 00101 30: BE
O 15 = 01111 31: EO <--- start using 6-bit strings R 18 = 010010 32: OR
N 14 = 001110 33: RN O 15 = 001111 34: NO
T 20 = 010100 35: OT TO 28 = 011100 36: TT
BE 30 = 011110 37: TOB OR 32 = 100000 38: BEO
TOB 37 = 100101 39: ORT EO 31 = 011111 40: TOBE
RN 33 = 100001 41: EOR OT 35 = 100011 42: RNO
# 0 = 000000 43: OT#
This is somewhat clearer:
Current Next Output Value Extended Sequence Char (# of bits) Dictionary
NULL T T O 20 = 5 bits 27: TO <-- This IS the 28th entry, but the initial entries are numbered 0-26 so this is #27.
O B 15 = 5 bits 28: OB B E 2 = 5 bits 29: BE
E O 5 = 5 bits 30: EO O R 15 = 5 bits 31: OR
R N 18 = 6 bits 32: RN <-- Starting at R, 6 bits are used