← Home

What If Neural Networks Don't Need Floating Point?

Estimated reading time: 14-20 minutes | ~3,600 words

TL;DR

  • I built a ternary ({-1, 0, +1}) inference engine from scratch in C: 1,595 lines, no third-party dependencies (Apple Accelerate used only for the BLAS comparison and FP32 output projection on macOS).
  • Ternary matmul replaces every floating-point multiply with an integer addition or subtraction. A weight of +1 means “add.” A weight of -1 means “subtract.” A weight of 0 means “skip.”
  • My SDOT kernel runs at 95 GOPS on an M3 Pro. 5.5x faster than hand-written NEON FP32, and 1.7x faster than Apple’s Accelerate BLAS. The speedup comes from memory bandwidth, not compute.
  • A synthetic transformer config (12 layers, dim=2048, 472M ternary params, random weights, vocab=1000) runs at 126 tokens/sec on CPU. No GPU. The 2-bit packed representation is 118 MB vs. 1.9 GB at FP32. The SDOT kernel reads INT8-expanded weights (~472 MB); the benchmark binary also keeps the packed bitmasks resident for the lossless V1 path, bringing total live ternary storage to ~590 MB.
  • ARM’s I8MM instruction (vmmlaq) was 4x slower than SDOT for this workload. Instruction width doesn’t matter if data layout doesn’t match.

Table of Contents


Start With Bandwidth

LLMs are memory-bound systems. Every generated token means rereading a huge pile of weights from cache or DRAM. Once those weight matrices get bigger than cache, inference speed is mostly a question of how quickly the hardware can move them, not how quickly it can multiply them.

Mark Horowitz’s 2014 ISSCC plenary is useful here because it points in the same direction. At 45nm:

OperationEnergy (pJ)
INT8 Add0.03
INT32 Add0.1
FP32 Multiply3.7
FP32 FMA (mul + add)4.6
DRAM Read640

A floating-point multiply costs 37x more energy than an integer add. And a DRAM read costs ~140x more than an FP32 FMA. Arithmetic matters. Moving weights matters more. These ratios hold roughly across process nodes; everything scales together.

A modern LLM does billions of multiply-accumulates per token. Every single one is a weight times an activation. If the weight is constrained to {-1, 0, +1}, two things happen. The arithmetic gets simpler. And the representation gets smaller.

No floating-point multiplier on the hot path. But the thing I ended up caring about more wasn’t the arithmetic. It was the data movement. If ternary weights are represented in a way the CPU likes, the real win is that each layer reads much less data from memory.

I wanted to know how far that goes in practice, so I built an inference engine that doesn’t use floating-point multiplication for any linear layer. What I think this project establishes is the systems side of ternary inference: if the weights are natively ternary, a CPU kernel can be very fast. What it does not establish is downstream model quality for this exact engine. For that, I lean on the BitNet papers.


Three Values Are Enough

The idea has a decade of history. BinaryConnect (Courbariaux et al., 2015) trained networks with weights constrained to {-1, +1}. XNOR-Net (Rastegari et al., 2016) binarized both weights and activations, replacing dot products with XNOR + popcount (two single-cycle CPU instructions).

The accuracy gap was brutal. XNOR-Net hit 51% top-1 on ImageNet versus AlexNet’s 57%. Binary networks stayed academic for years.

The breakthrough came from Microsoft Research in 2024. BitNet b1.58 (Ma et al.) uses ternary weights: {-1, 0, +1}. That third value, zero, is the key. It gives the network a way to say “this input doesn’t matter for this output.” Pure binary networks waste capacity using ±1 pairs to cancel out irrelevant connections. Ternary gets explicit sparsity for free.

The “1.58” comes from information theory: log₂(3) ≈ 1.58 bits per weight. Each extra 0.58 bits over pure binary carries meaningful information about which connections matter.

The result: from 3B parameters up, ternary models trained from scratch match FP16 transformers on standard benchmarks. Microsoft’s BitNet b1.58-2B-4T (April 2025, MIT license) matches Qwen2.5 1.5B on GSM8K math reasoning while using 0.4 GB of non-embedding memory and 82% less energy per inference.

This isn’t post-training compression. The model is trained from scratch at ternary precision using a straight-through estimator. Full-precision shadow weights accumulate gradient updates during training, but every forward pass quantizes them to {-1, 0, +1}. After training, the shadow weights are thrown away.

I think the missing intuition here is redundancy. Modern transformers are wildly overparameterized. They have enough spare capacity that training can adapt to a coarse weight alphabet, as long as the optimization process knows about that constraint from the start. Ternary inference sounds absurd if you imagine taking a good FP16 model and crushing it down afterward. It sounds much less surprising if you imagine training a very large model from scratch while telling it up front that weights only get three values and the network needs to route information around that.

That distinction matters for the rest of this post. BitNet’s papers are the evidence that trained ternary models can preserve quality. My implementation is evidence for a different claim: once the weights are ternary, the runtime system can be simple and fast.


What a Ternary Linear Layer Actually Computes

A standard linear layer computes y = x @ W. Each output element is a dot product: y[i] = Σ x[j] * W[i][j].

In BitNet’s BitLinear layer, this becomes:

Input x (FP16)


RMSNorm


Absmax quantise activations → INT8
    │  scale_a = max(|x|) / 127
    │  x_q = clip(round(x / scale_a), -128, 127)

Absmean quantise weights → ternary {-1, 0, +1}
    │  β = mean(|W|)
    │  W_q = clip(round(W / β), -1, +1)

Integer matmul:  y_q = x_q @ W_q     ← additions and subtractions only


Dequantise:  y = y_q * β * scale_a

The key: x_q @ W_q is an INT8 vector multiplied by a ternary matrix. Since every weight is -1, 0, or +1, each “multiply” is actually an add, subtract, or skip. No FP multiplier touches the hot path.


Building the Kernel

I built three kernel variants, each teaching something about why the next one is faster. The full source is on GitHub: 1,595 lines of C total.

V0: Naive (branch per weight)

The obvious implementation. For each output element, walk the weights and branch on each value:

void ternary_matvec_naive(const float *x, const int8_t *w,
                          float *out, int in_dim, int out_dim, float scale) {
    for (int o = 0; o < out_dim; o++) {
        float sum = 0.0f;
        const int8_t *row = w + o * in_dim;
        for (int i = 0; i < in_dim; i++) {
            if (row[i] == 1)       sum += x[i];
            else if (row[i] == -1) sum -= x[i];
        }
        out[o] = sum * scale;
    }
}

At 2048×2048 on an M3 Pro: 14,300 µs. Terrible. Two branches per weight, no vectorization, no data reuse. But it’s correct, and that’s what matters for a reference.

V1: Bitmask LUT (lossless, FP32 precision)

Store weights as two bitmasks (one for +1 positions, one for -1 positions). Extract 4-bit nibbles from each mask, use them as an index into a 256-entry lookup table that maps to a NEON float32x4_t sign vector ({+1.0, 0.0, -1.0, 0.0}, etc.), then vmlaq_f32:

// 16 branchless vmlaq per 64-bit word, fully unrolled
#define NIBBLE_FMA(nib, acc_var, offset)                           \
{                                                                   \
    uint32_t pn = (uint32_t)(p >> ((nib)*4)) & 0xF;               \
    uint32_t nn = (uint32_t)(n >> ((nib)*4)) & 0xF;               \
    acc_var = vmlaq_f32(acc_var, vld1q_f32(x + i + (offset)),      \
                       vld1q_f32(g_sign_lut[(nn << 4) | pn]));    \
}

At 2048×2048: 277 µs. 16x weight compression (2 bits per weight), no activation quantization, no precision loss. The 4 KB LUT lives permanently in L1 cache.

V2: INT8 SDOT (BitNet-compatible precision)

The real breakthrough. Instead of FP32 activations × ternary signs, quantize activations to INT8 (matching what BitNet actually does), store weights as pre-expanded INT8 {-1, 0, +1}, and use ARM’s vdotq_s32, a single instruction that computes four 4-element INT8 dot products simultaneously. That’s 16 multiply-accumulates per instruction, versus 4 for FP32 vmlaq_f32.

The inner loop processes 4 output rows at a time using an interleaved weight layout so all memory accesses are sequential:

for (int i = 0; i + 31 < padded; i += 32, wi += 128) {
    int8x16_t x0 = vld1q_s8(xq + i);
    int8x16_t x1 = vld1q_s8(xq + i + 16);

    a0 = vdotq_s32(a0, x0, vld1q_s8(wint + wi));
    b0 = vdotq_s32(b0, x0, vld1q_s8(wint + wi + 16));
    c0 = vdotq_s32(c0, x0, vld1q_s8(wint + wi + 32));
    d0 = vdotq_s32(d0, x0, vld1q_s8(wint + wi + 48));

    a1 = vdotq_s32(a1, x1, vld1q_s8(wint + wi + 64));
    b1 = vdotq_s32(b1, x1, vld1q_s8(wint + wi + 80));
    c1 = vdotq_s32(c1, x1, vld1q_s8(wint + wi + 96));
    d1 = vdotq_s32(d1, x1, vld1q_s8(wint + wi + 112));
}

Each iteration loads 32 bytes of activations (shared across 4 rows) and 128 bytes of interleaved weights. Eight vdotq_s32 instructions process 128 INT8 multiply-accumulates. The compiler generates ldp (load-pair) instructions and interleaves them with SDOT for near-perfect pipeline utilization. I verified this by inspecting the assembly output with cc -S.

At 2048×2048: 44 µs. That’s 95 GOPS and 5.5x the hand-written NEON FP32 baseline.

The short version: V1 is the lossless path. It keeps FP32 activations and compresses weights aggressively, but it still pays lookup and decode overhead in the hot loop. V2 is the BitNet-style path. It quantizes activations to INT8, expands weights into an SDOT-friendly layout, and wins on throughput because it removes decode cost and lines up the representation, the instruction, and the memory layout.


The Optimization Journey: 3,693µs to 44µs

I ran 135 experiments in an automated loop. The kernel improved 84x from baseline. Here’s where the time went:

StepTechniqueµsGainInsight
1Naive SIMD (expand bitmask → int8 → NEON FMA)3,6931.0xBitmask expansion dominates
3Sequential 4-float chunks1,0693.5xSequential access beats gather
4256-entry sign LUT9601.1xL1-resident LUT beats arithmetic
64-accumulator unroll4322.2xPipeline saturation
8Remove all per-nibble branches3011.4xZero-multiply is free; branches aren’t
9Full unroll (16 macros per 64-bit word)2771.1xEliminate loop overhead
10Switch to INT8 SDOT704.0x16 MACs/instruction vs 4
114-row output tiling521.3xShare activation loads
12Interleaved weight layout461.1xSequential reads for prefetcher

Step 10 was the biggest single jump: switching from FP32 FMLA (4 MACs per instruction) to INT8 SDOT (16 MACs per instruction) gave a clean 4x. Steps 11-12 squeezed out another 35% through data layout changes that make every memory access sequential.

The pattern wasn’t “compute first, memory second” as two clean buckets. The pre-SDOT work still mattered a lot, but the biggest single jump came from switching representations so the hardware could stream weights sequentially and issue wider dot products with no decode overhead. The hot path was always limited more by representation and data movement than by the cost of a multiply.


Why It’s Fast: Memory, Not Compute

When I first optimized the ternary kernel to beat FP32 using the sign LUT approach, I thought the win came from eliminating multiplications, then I added a proper NEON-optimized FP32 baseline and the “10x speedup” vanished. Ternary was actually slower at 277 µs versus FP32’s 251 µs.

The real speedup appeared when I switched to INT8 SDOT. Why? Two compounding effects:

  1. 4x less weight data. INT8 weights are 1 byte each versus FP32’s 4 bytes. At 2048×2048, that’s 4 MB versus 16 MB of weight data per full matrix traversal.
  2. 4x wider instructions. vdotq_s32 processes 16 INT8 MACs per instruction versus vmlaq_f32’s 4 FP32 FMAs.

Together these should give 16x. I measure 5.5x. The gap is activation quantization overhead (~4% of runtime), loop bookkeeping (~5%), and the fact that activations are still loaded from memory in both cases.

This is the roofline story in miniature. Once the kernel is bandwidth-bound, arithmetic width only helps until weight loads saturate the memory system. Activation loads stay roughly constant between the FP32 and ternary paths, so the 16x theoretical advantage collapses to the ~5.5x I actually see.

The scaling data confirms this is a bandwidth story:

Matrix SizeFP32 WeightsSDOT µsFP32 µsSpeedup
768×7682.25 MB~6~30~5x
2048×204816 MB442525.7x
4096×409664 MB2221,5346.9x
8192×8192256 MB9635,6885.9x

At 4096×4096, FP32 weights (64 MB) spill out of the M3 Pro’s L2 cache while INT8 weights (16 MB) still fit. That’s where the speedup peaks at 6.9x. The scaling curve follows the memory hierarchy more closely than it follows instruction throughput.


Beating Apple’s BLAS

The strongest possible FP32 baseline isn’t my hand-written NEON. It’s Apple’s Accelerate framework. cblas_sgemv uses the AMX coprocessor and highly optimized NEON code that Apple has tuned for their own silicon.

This comparison needs one caveat. It isn’t numerically apples-to-apples. I’m comparing a kernel specialized for natively ternary weights against Apple’s best dense FP32 matvec. I think that’s still the right practical comparison, because if your model weights are genuinely ternary, FP32 BLAS is the obvious baseline you’d otherwise use. But the conclusion is narrow: this is evidence that a specialized ternary kernel can beat an optimized FP32 dense baseline for ternary workloads, not that my C kernel somehow beats Apple BLAS in general.

Matrix SizeAccelerate BLAS µsMy SDOT µsTernary wins by
2048×204874441.7x
8192×81922,8059632.9x

A from-scratch ternary kernel in 589 lines of C beats Apple’s vendor-optimized FP32 implementation for this workload. Not because ternary compute is universally faster, but because the kernel reads 4x less data from memory. The advantage grows at scale because memory bandwidth increasingly dominates.

I think this is the most important result in the entire project. It means ternary inference isn’t just a trick for resource-constrained devices. If the weights are natively ternary, a representation-aware kernel can be genuinely faster than the best FP32 dense baseline even on high-end hardware.


The Full Transformer

An inference engine isn’t a matmul benchmark. I built the full transformer stack: RMSNorm, Rotary Position Embeddings, multi-head attention with KV cache, feed-forward network with ReLU², and argmax sampling. Every linear layer uses the SDOT kernel. Embeddings and the output projection stay FP32 (they’re a tiny fraction of parameters).

Token IDs → Embedding → [N × Transformer Block] → RMSNorm → Output → Argmax

                    ┌─────────┴──────────┐
              Attention Block         FFN Block
           ┌──────────────┐     ┌──────────────┐
           │ RMSNorm      │     │ RMSNorm      │
           │ Q,K,V (SDOT) │     │ Up (SDOT)    │
           │ RoPE         │     │ ReLU²        │
           │ MHA + KV$    │     │ Down (SDOT)  │
           │ O proj (SDOT)│     │              │
           └──────────────┘     └──────────────┘

I benchmark two synthetic configurations with random ternary weights and vocab=1000. This measures throughput, not model quality. In other words: this section demonstrates the systems side of ternary inference, not the quality-retention story from the BitNet papers.

ModelLayersDimTernary Paramsms/tokentok/sec
Small651214.7M0.24~4,100
Large122048472M7.95126

126 tokens per second on 472M ternary parameters, CPU only, no GPU. That’s about 18x human reading speed. At 2 bits per weight, the packed representation is 118 MB versus 1,887 MB for FP32 — a 16x compression ratio. The SDOT kernel reads INT8-expanded weights (~472 MB for this config, still 4x smaller than FP32). The benchmark binary also keeps packed bitmasks resident for the lossless V1 kernel, so total live ternary storage is ~590 MB.

Profiling the large model shows where time goes:

Component% of forward pass
Ternary matmul (QKV + O + FFN up + FFN down)94%
Attention (RoPE + scores + softmax + value sum)6%
Other (RMSNorm, residuals)<1%

94% of the forward pass is ternary matmul. Optimizing the kernel was the right call. Attention is only 6% because I’m benchmarking autoregressive generation at short sequence lengths (pos ≤ 21). For longer sequences or prefill, attention would grow, but the matmul kernel would still dominate.

The generation is deterministic: same weights, same seed, same token sequence every time:

Generating 10 tokens: [42, 538, 574, 600, 23, 347, 269, 579, 827, 371, 622]

Nonsense with random weights, but the mechanism is correct: forward → argmax → feed token back as next input.


What Didn’t Work

I tried 25+ dead ends across 135 experiments. A few of the more interesting failures:

ARM I8MM (vmmlaq_s32). The M3 Pro has this instruction. It computes a 2×8 · 8×2 → 2×2 INT8 matrix multiply, doing 32 MACs per instruction versus SDOT’s 16. In theory, 2x faster. In practice, 4x slower: 184.6 µs/row versus 44.5 µs/row.

The problem is data layout. vmmlaq expects its operands as 2×8 tiles, but our weights are stored as separate rows. Each call needs 4 vld1_s8 loads + 2 vcombine_s8 just to construct the tiles: 6 instructions of overhead per 1 instruction of compute. SDOT with 4-row interleaving needs only 1 load per SDOT.

TBL-based 2-bit expansion. I tried using NEON’s vqtbl1q_s8 to expand 2-bit packed weights to INT8 on the fly, keeping the 16x compression while using SDOT. Five instructions per 16 weights for expansion versus one load for pre-expanded INT8. Always slower. The expansion overhead dominates the bandwidth saving.

FP16 pre-expanded weights. Store ternary weights as FP16 {-1.0, 0.0, +1.0} and use vfmaq_f16 for 8 FMAs per instruction. Sounds like 2x over FP32. But FP16 weights are 8x larger than bitmasks. The bandwidth explosion kills any compute gain. 8.7x regression.

Arithmetic sign vectors. Instead of loading from the 256-entry LUT, construct the sign vector arithmetically from nibble bits using vdup, vand, vcgt, vbsl. Eight instructions versus one LUT load. 2.8x slower. The LUT is 4 KB and sits in L1 permanently.

Profile-guided optimization, -O3, __attribute__((hot)), restrict, cache-line alignment, explicit prefetch hints. None of these moved the needle. The hot path is all NEON intrinsics. The compiler already generates near-optimal ldp + sdot scheduling. I confirmed by reading the assembly.


The Roofline

Is 44 µs actually good? I computed a roofline model to check.

M3 Pro P-core peak (estimated):
  2 SDOT/cycle × 16 MACs/SDOT × 4 GHz = 128 INT8 GOPS

My kernel: 95 GOPS = 74.5% of peak

74.5% of theoretical maximum. The remaining 25.5% breaks down roughly as: memory latency for weight loads (~15%), loop bookkeeping and accumulator reduction (~5%), activation quantization (~4%), function call overhead (~1%).

The baseline kernel (experiment #1) was at 4.5% of peak. 135 experiments brought it to 74.5%. There’s no further improvement possible without hand-written assembly or fundamentally different hardware, and the assembly output already shows the compiler doing what I’d write by hand.


What I Learned

Much of FP32 precision appears unnecessary in large transformer linear layers trained under that constraint. I don’t think floating point is universally pointless. Training needs dynamic range and tiny updates. My own engine keeps FP32 for embeddings, output projection, normalization, and softmax-adjacent numerics. But for large transformer linear layers trained around the constraint, sign plus rough magnitude seems to be enough for useful inference. BitNet showed that scaling laws still hold at 1.58 bits. The models just need more parameters to compensate, and above ~3B the gap vanishes.

The speedup is bandwidth, not arithmetic. I spent weeks assuming the win came from replacing multiplies with adds. It doesn’t. At small sizes, FP32 NEON is just as fast because both kernels are compute-bound. The ternary advantage appears when weight data exceeds cache. 4x smaller weights mean 4x less bandwidth, and bandwidth is what matters. This is the same lesson as every matrix algebra optimization since the 1990s.

Data layout > instruction selection. SDOT with 4-row tiling and interleaved weights beats I8MM’s theoretically wider vmmlaq by 4x. The right data layout for a simpler instruction beats the wrong layout for a fancier one. This generalizes well beyond ternary inference: quantized kernels only pay off when layouts match dot-product instructions, columnar databases win when their layout matches the access pattern, and GPU tensor cores only hit headline throughput when data is tiled the way the hardware expects.

Build to understand. Like my HNSW from scratch and LSM storage engine projects, building from zero forces you to justify every piece. I didn’t start knowing that interleaved weight layout would matter or that the LUT approach is optimal for the lossless path. I arrived at those conclusions by measuring 135 experiments and discarding 25+ dead ends. The answer is obvious in retrospect; the path to finding it was not.

The entire engine (three matmul kernels, batch GEMM, full transformer forward pass, KV cache, RoPE, argmax sampling, benchmark harness, Accelerate BLAS comparison, and profiling) is 1,595 lines of C. No third-party dependencies. I think that’s the strongest argument for ternary inference: the algorithm is simple enough to fit in your head, and the hardware does the rest.