Specification
Tamp uses a single-pass DEFLATE-like algorithm that is optimized for a fast, simple implementation in micropython. Trade-offs were made between code-complexity, speed, memory-usage, and compression-ratio. This document describes the compression algorithm, design choices, and the data format. Tamp outputs a continuous bit stream, where the most-significant-bit (MSb) comes first.
Stream Header
Tamp has a single header byte described in the following table. The locations are zero index from the beginning of the stream. The bit-location 0 is equivalent to typical MSb position 7 of the first byte.
Bits |
Name |
Description |
|---|---|---|
[7,6,5] |
window |
Number of bits, minus 8, used to represent the size
of the shifting window.
e.g. A 12-bit window is encoded as the number 4, |
[4,3] |
literal_size |
Number of bits, minus 5, in a literal.
For example, |
[2] |
custom_dictionary |
A custom dictionary initialization method was used and must be provided at decompression. |
[1] |
extended |
Enables extended format features (RLE, extended match encoding). Generally improves compression, introduced in tamp v2.0.0. |
[0] |
more_header |
Next byte is header byte 2 (see Header Byte 2). Implies
|
Header Byte 2
Present only when more_header (bit 0 of byte 1) is set.
All 8 bits are reserved for future use and must be 0.
Decompressors must reject non-zero values with TAMP_INVALID_CONF.
Stream Encoding/Decoding
After the header bytes is the data stream. The datastream is written in bits, so all data is packed tightly with no padding. At a high level, Tamp applies LZSS compression, followed by a fixed, pre-defined Huffman coding for representing the match length.
LZSS
Tamp uses a slightly modified LZSS compression algorithm. Modifications are made to make the implementation simpler/faster.
Initialize a ring_buffer of size
1 << windowdefined in the header. See Dictionary Initialization for initialization details.Starting at the beginning of the plaintext, find the longest match existing in the dictionary. If no pattern (<2 character match) is found, output a literal. If a pattern is detected
literal:
0b1 | literal. The first bit (1) represents that the following bits represent a literal. Theliteralisliteral_sizebits long.token:
0b0 | length-huffman-code | offset. The first bit (0) represents that the following bits represent a token. The length of the pattern match is encoded via a pre-defined static Huffman code. Finally, theoffsetiswindowbits long, and points at the offset from the beginning of the dictionary buffer to the pattern. The shortest pattern-length is either going to be 2 or 3 bytes, depending onwindowandliteralparameters. The shortest pattern-length encoding must be shorter than an equivalent stream of literals. In the basic (non-extended) format, the longest pattern-length is the minimum pattern-length plus 13. When theextendedflag is set, longer matches are possible via extended match encoding.
Classically, the offset is from the current position in the buffer. Doing so results
in the offset distribution slightly favoring smaller numbers. Intuitively, it makes
sense that patterns are more likely to occur closer to the current text. A proposed idea
was to include the offset most-significant bit with the length-huffman-code.
Unfortuneately, the probability distribution wasn't biased enough to increase compression.
For this reason, we just encode the simple offset from the beginning of the dictionary.
This is easier to implement and has the potential to execute quicker.
Similarly, attempts to include the is_literal flag in the huffman coding did not
increase compression. An explicit is_literal flag made the code faster and simpler.
Dictionary Initialization
For short messages, having a better initial dictionary can help improve compression ratios.
The amount of improvement would be dependent on the type of data being compressed.
Given that the contents of raw-binary data could be anything, we chose to focus on improving
typical english text. In order to save device space, a whole dictionary is not saved to disk.
Instead, we select 16 common characters appropriate for the literal bit width and
pseudo-randomly fill up the dictionary. We use the XorShift32 pseudo-random number generator
due to it's good randomness characteristics, and simple implementation.
The character table is chosen based on the literal parameter:
literal=7 or 8: 16 common english text and markup characters:
\x000ei>to<ans\nr/. ^there is a "space" there. (or as explicit hex values) 0x20 0x00 0x30 0x65 0x69 0x3E 0x74 0x6F 0x3C 0x61 0x6E 0x73 0x0A 0x72 0x2F 0x2E
literal=5 or 6: The 16 most common english characters (
" etaoinshrdlcumw") downshifted to the target bit width by masking with(1 << literal) - 1. Since ASCII encodes letter position in the low bits, this preserves alphabetic identity.
Warning
When the extended header flag is not set (v1 format), literal is
always treated as 8 for backwards compatibility, regardless of the configured
value.
The chosen seed, 3758097560, was discovered experimentally to give typically good results for 7/8 bit data.
All of this amounts to a few percent compression improvement for short messages.
def _xorshift32(seed):
while True:
seed ^= (seed << 13) & 0xFFFFFFFF
seed ^= (seed >> 17) & 0xFFFFFFFF
seed ^= (seed << 5) & 0xFFFFFFFF
yield seed
def initialize_dictionary(size, literal=8):
if literal <= 5:
chars = bytes([c & 0x1F for c in b" etaoinshrdlcumw"])
elif literal <= 6:
chars = bytes([c & 0x3F for c in b" etaoinshrdlcumw"])
else:
chars = b" \x000ei>to<ans\nr/."
def _gen_stream(xorshift32):
for _ in range(size >> 3):
value = next(xorshift32)
yield chars[value & 0x0F]
yield chars[value >> 4 & 0x0F]
yield chars[value >> 8 & 0x0F]
yield chars[value >> 12 & 0x0F]
yield chars[value >> 16 & 0x0F]
yield chars[value >> 20 & 0x0F]
yield chars[value >> 24 & 0x0F]
yield chars[value >> 28 & 0x0F]
return bytearray(_gen_stream(_xorshift32(3758097560)))
Huffman Coding
Huffman coding encodes high-probability values with less bits, and less-likely values with more bits. In order for huffman coding to work, no encoding is allowed to be a prefix of another encoding. If all values have equal probability, simple binary encoding is more efficient.
The following maps the pattern-size (to be added to the minimum pattern-length) to the bits representing the huffman code.
huffman_coding = {
0: 0b0,
1: 0b11,
2: 0b1000,
3: 0b1011,
4: 0b10100,
5: 0b100100,
6: 0b100110,
7: 0b101011,
8: 0b1001011,
9: 0b1010100,
10: 0b10010100,
11: 0b10010101,
12: 0b10101010,
13: 0b100111,
"FLUSH": 0b10101011,
}
The match-size probabilities that generated this table were generated over the enwik8 dataset. This huffman coding was chosen such that the longest huffman code is 8 bits long, making it easier to store and index into. The maximum match-size is more likely than the second-highest match-size because all match-sizes greater than the maximum size get down-mapped.
For any given huffman coding schema, a equivalent coding can be obtained by inverting all the bits (reflecting the huffman tree). The single-bit, most common code 0b0 representing a pattern-size 2 is intentionally represented as 0b0 instead of 0b1. This makes the MSb of all other codes be 1, simplifying the decoding procedure because the number of bits read doesn't strictly have to be recorded.
Extended Format (v2.0.0+)
When the extended header bit is set, two additional token types are available:
RLE (Run-Length Encoding) and Extended Match. These use Huffman symbols 12 and 13
respectively, which in the basic format would represent match sizes min_pattern_size + 12
and min_pattern_size + 13.
Extended Huffman Encoding
Both RLE and Extended Match use a secondary Huffman encoding to represent their payload values. This encoding combines a Huffman code (without the literal flag) with trailing bits:
Read the Huffman symbol (12 for RLE, 13 for Extended Match) with the literal flag (
0b0).Decode an additional Huffman code (reusing the same table, but without the leading literal flag bit). In this secondary context, symbol 14 (the FLUSH bit pattern
0b10101011) is a valid value, giving Huffman indices 0 through 14.Read trailing bits (4 bits for RLE, 3 bits for Extended Match).
Combine:
value = (huffman_index << trailing_bits) + trailing_bits_value
RLE Token (Symbol 12)
RLE encodes runs of repeated bytes efficiently. The repeated byte is implicitly
the last byte written to the window buffer. If no bytes have been written yet
(i.e., window_pos == 0), the byte at position window_size - 1 of the
initial dictionary is used.
huffman_code[12]=0xAA(9 bits including literal flag)countranges from 2 to 241:(14 << 4) + 15 + 2 = 241
Window update: Only the first 8 bytes are written to the dictionary (no wrap-around). If fewer than 8 bytes remain before the end of the window buffer, only those bytes are written. Writing only 8 bytes preserves useful dictionary content for future pattern matches, since filling the window with a single repeated byte would reduce match opportunities.
RLE Token Structure:
+---+------------+----------------------------+
| 0 | huffman[12]| extended_huffman(count - 2) |
+---+------------+----------------------------+
|1b | 8 bits | (1 ~ 8) + 4 bits |
+---+------------+----------------------------+
Extended Match Token (Symbol 13)
Extended Match allows pattern matches longer than the basic format's maximum of
min_pattern_size + 13. It is used when a match exceeds min_pattern_size + 11.
huffman_code[13]=0x27(7 bits including literal flag)offsetiswindowbits, pointing to the start of the patternMaximum extra size:
(14 << 3) + 7 + 1 = 120Maximum total match size:
min_pattern_size + 11 + 120 = min_pattern_size + 131
The -12 offset ensures extended matches start at min_pattern_size + 12, leaving
symbols 0-11 for basic matches (0-11 maps to min_pattern_size through min_pattern_size + 11).
Window constraints: The source pattern cannot span past the window buffer boundary; the compressor terminates extended matches early if they would cross this boundary. Similarly, destination writes do not wrap-around; only bytes up to the end of the window buffer are written. This simplifies implementation while having minimal impact on compression ratio (approximately 0.02% loss).
Extended Match Token Structure:
+---+------------+-------------------------+--------+
| 0 | huffman[13]| extended_huffman(sz) | offset |
+---+------------+-------------------------+--------+
|1b | 6 bits | (1 ~ 8) + 3 bits | window |
+---+------------+-------------------------+--------+
Where sz = match_size - min_pattern_size - 12
Flush Symbol
A special FLUSH symbol is encoded as the least likely Huffman code.
In many compression algorithms, a flush() can only be called at the end of the
compression stream, and the compressor cannot be used anymore.
In microcontroller applications, the user may want to flush the compressor buffer
while still continuing to compress more data. Examples include:
Flushing a chunk of logs to disk to prepare if power is removed.
Pushing a chunk of collected data to a remote server.
Since the output bit stream must be byte-aligned for writing, there may be up to 7
pending bits that have not yet formed a complete byte.
Invoking the flush method can have one of two results:
If the output is already byte-aligned and
more_headeris not set, no action is performed.If there are pending bits, or if
more_headeris set, the FLUSH Huffman code is written. Nooffsetbits are written following the FLUSH code. The remaining bits are zero-padded to the next byte boundary.
On reading, if a FLUSH is read, the reader discards the remaining bits up to the
next byte boundary.
When more_header is set, a FLUSH is always emitted (even when byte-aligned)
to support append mode (see Dictionary Reset (Double-FLUSH)).
In the worst case (1 pending bit), a FLUSH symbol (9 bits) and 6 padding bits are
written, adding 15 bits of overhead to the output stream.
Dictionary Reset (Double-FLUSH)
Dictionary reset allows appending new compressed data to an existing stream without retaining the previous compressor state. After a reset, both sides start fresh with a default dictionary.
When the more_header (header byte 1, bit 0) is set, two consecutive FLUSH tokens signal a dictionary reset.
Without more_header, double-FLUSHes are treated as two ordinary flushes.
Old decompressors (<2.1.0) reject more_header streams at the header, preventing silent corruption.
On the compressor side, reset_dictionary flushes pending data, writes
exactly two consecutive FLUSHes, then re-initializes the window and all
internal state.
On the decompressor side: 1. Re-initializes the window (see Dictionary Initialization). Custom dictionaries cannot be supplied at this time. 2. Resets window position to zero. 3. Continues decompressing with a fresh dictionary.
Miscellaneous
No terminating character is builtin. Tamp relies on external framing (such as from the filesystem) to know when the data stream is complete. The final byte of a stream is zero-padded. The maximum padding is 7 zero bits.