1.58 bit Algebra

This week I’ve looked at the LLM optimization of using 1.58 bit algebra to speed up the training and transformation.

The idea is to operate with ternaries: ${-1,0,1}$. To store such a number in a naive way, we will require two bits, however, by using modulos, more data can be stored within a block of bits.

Since $n$ bits can store $2^n$ states, storing $m$ ternaries will require $3^m$ states. If $2^n=3^m$ then $n=m \frac{ln(3)}{ln(2)}$. Also, $\frac{ln(3)}{ln(2)} \approx 1.58496250072116 \approx 1.58$ so for each ternary we will need 1.58 bits.

Here is a table to determine how many tri-state values can be packed in differnet bit blocks:

Bits	3-state size	bits per data
2	1	2.00
3	1	3.00
4	2	2.00
5	3	1.67
6	3	2.00
7	4	1.75
8	5	1.60
16	10	1.60
32	20	1.60
64	40	1.60
128	80	1.60
256	161	1.59
512	323	1.59
1024	646	1.59
2048	1292	1.59
4096	2584	1.59

Note that all of 8-, 16- or 32-bit approaches of storage will deliver a $1.60$ bits per ternary, which is very close to the ideal ratio of $1.58$. A very small improvement will start happening with 512 bits, however doing multicplication, division and modulos on 512-bit values to pack and unpack the values is not fun and definitively not computationally optimal.

The next issue that we need to pay attention is the numerical stability of the system.

In the ternary space we define the two operators, addition and multiplication, as following:

a	b	a+b	a*b
-1	-1	-1	1
-1	0	-1	0
-1	1	0	-1
0	-1	-1	0
0	0	0	0
0	1	1	0
1	-1	0	-1
1	0	1	0
1	1	1	1

Given this operations, we can evaluate:

\[1+ 1 - 1 - 1 = ((1+1) - 1 ) - 1 = (1 -1) - 1 = -1\]

\[1+ 1 - 1 - 1 = 1 + (1 + (- 1 - 1)) = 1 + (1 - 1) = 1\]

This is caused by the numerical instability introduced by $1+1=1$ and $-1 - 1 = -1$

The more meaningful way of implementing order of operation would be similar to voting: determine how many $-1$ and how many $1$ are and establish the result based on who has majority.

$1+ 1 - 1 - 1 = 0$ Because there are two $1$ s and two $-1$

The lack of associativity of the unary addition operator requires a special treatment of addition. In a neural network,sum of products is a common operation:

\[R = b + \sum_{1}^{n} A_i * w_i\]

Here $R$ is the result, $b$ is the bias, $n$ is the size of the previous layer, $A$ is the output of the previous layer and $w$ are the weights of the current neuron.

$n$ is generally a large number, much larger than 40, 20 or 5, which are the number of ternaries that can be stored in 64, 32 or 8. For this reason, the implementation of dot product ($\sum_{1}^{n} A_i * w_i $) for a fixed size must return the number winning ternaries in the sum. $1+1+1$ will return $3$ while $-1-1$ will return $-2$. Once the whole sum has been evaluated, the value will be transformed back in a ternary.