ATOMiK started as a set of equations on a whiteboard. XOR-based delta-state algebra: four operations, provably correct, with some interesting properties. But equations on a whiteboard don't ship products. This is the story of how those equations became silicon — three generations of custom hardware, from a first blink test to a custom 64-bit RISC-V CPU with native ATOMiK instructions and HD video output.
Chapter 1: The Math
The core insight is deceptively simple. Instead of storing state and copying it around, you store a reference point and accumulate XOR deltas:
current_state = initial_state XOR accumulatorXOR gives you an Abelian group for free — commutative, associative, self-inverse, with identity element zero. We proved all of this formally with 92 Lean4 theorems. Not "we tested it" — we proved it. The algebra is correct by construction.
The practical consequence: deltas can arrive in any order, from any number of producers, and the result is identical. No locks. No consensus protocol. No ordering constraints. The accumulator is a shared resource by design.
Chapter 2: First Silicon — PicoRV32 + ATOMiK (v1)
The first hardware target was the Tang Nano 9K — a $13.50 FPGA board with a Gowin GW1NR-9K chip, 8,640 LUTs, and 26 block RAMs. We paired ATOMiK with a PicoRV32 RISC-V soft core.
The ATOMiK core lives on the memory bus as an MMIO peripheral. The CPU writes to specific addresses to trigger LOAD, ACCUM, READ, and SWAP operations. A toggle-handshake CDC bridge crosses between the CPU clock domain (25.2 MHz) and the ATOMiK domain (81 MHz).
// ATOMiK v1 — MMIO-mapped operations
#define ATOMIK_BASE 0x20000000
#define ATOMIK_LOAD (ATOMIK_BASE + 0x00)
#define ATOMIK_ACCUM (ATOMIK_BASE + 0x04)
#define ATOMIK_READ (ATOMIK_BASE + 0x08)
#define ATOMIK_SWAP (ATOMIK_BASE + 0x0C)
*(volatile uint32_t*)ATOMIK_LOAD = 0xDEADBEEF;
*(volatile uint32_t*)ATOMIK_ACCUM = 0x000000FF;
uint32_t state = *(volatile uint32_t*)ATOMIK_READ;
// state == 0xDEADBE10Result: single-bank ATOMiK at 81 MHz, 94.5 million operations per second. The entire SoC fits in 44% of the GW1NR-9K. 11/11 hardware tests pass. The core has +23% Fmax margin — it could run faster, but we're limited by the PicoRV32's bus timing.
Chapter 3: Custom CPU — RV64I + ATOMiK ISA (v2/v3)
MMIO works, but it costs bus cycles. Every ATOMiK operation requires a store instruction, a bus transaction, and a load to read the result. What if ATOMiK operations were native CPU instructions?
We built a custom 64-bit RISC-V CPU from scratch. Not a fork — a ground-up implementation with a pipelined FSM (FETCH, DECODE, EXECUTE, WRITEBACK), SPI XIP flash boot, UART, and native ATOMiK custom instructions using the RISC-V custom-0 opcode space:
// ATOMiK v3 — Native ISA extensions (custom-0 opcode 0x0B)
// funct3 encoding:
// 000 = LOAD (set reference state)
// 001 = ACCUM (XOR delta into accumulator)
// 010 = READ (reconstruct current state)
// 011 = SWAP (atomic read-and-reset)
// In assembly:
.insn r 0x0b, 0, 0, x0, a0, x0 # LOAD a0
.insn r 0x0b, 1, 0, x0, a1, x0 # ACCUM a1
.insn r 0x0b, 2, 0, a2, x0, x0 # READ -> a2
.insn r 0x0b, 3, 0, a3, x0, x0 # SWAP -> a3ATOMiK operations now execute in a single EXECUTE stage cycle — the same cost as an ADD or XOR instruction. No bus overhead. No MMIO latency. Zero extra cycles.
Chapter 4: HD Video — 1280x720@60Hz (v3.1)
To demonstrate delta-driven display, we added an HDMI output pipeline. On a $13.50 FPGA. At 1280x720@60Hz.
This required 6 pixel pipeline optimizations to hit the 74.25 MHz pixel clock: 3-stage TMDS encoding, pre-registered RNG and cursor flags in svo_tcard, parallel prefix gray-to-binary conversion, split font pipeline, pre-registered BRAM ports, and a register buffer between encoder and TMDS serializer.
The delta display module sits in the video pipeline between the overlay and the encoder. It maintains a per-scanline buffer and applies LUT-mapped delta colors in real time — the display literally shows state changes as they happen, driven by the ATOMiK accumulator.
Final v3.1.0 resource usage on the GW1NR-9K: 6,287 LUT (73%), 3,783 CLS (88%), 20/26 BSRAM (77%). Pixel Fmax: 74.384 MHz (+0.18% margin). We hit the practical optimization ceiling — CLS at 88% is the binding constraint.
Chapter 5: Scaling Up — Zynq Characterization
The Tang Nano 9K proved the architecture. But 8,640 LUTs limits you to a single ATOMiK bank. What happens when you have 53,200 LUTs?
We characterized ATOMiK on the Xilinx Zynq XC7Z020, sweeping from 1 to 512 parallel banks across 4 synthesis strategies (baseline, area, aggressive, maximum):
| Banks | Fmax (MHz) | LUT | LUT % | Gops/s |
|---|---|---|---|---|
| N=1 | 444.4 | 302 | 0.6% | 0.4 |
| N=4 | 347.8 | 543 | 1.0% | 1.4 |
| N=16 | 266.7 | 941 | 1.8% | 4.4 |
| N=64 | 205.1 | 3,498 | 6.6% | 13.4 |
| N=256 | 148.1 | 15,197 | 28.6% | 38.1 |
| N=512 | 135.6 | 23,542 | 44.3% | 69.7 |
Sub-linear LUT scaling: 512 banks costs only 44.3% of the fabric. Each additional bank adds ~34 LUTs beyond the first — the shared infrastructure (BRAM, merge tree, CDC bridge) amortizes across all banks.
69.7 billion operations per second. On a $99 development board. With 56% of the fabric still available for your application logic.
What we learned
- Start with the math. The 92 Lean4 proofs caught edge cases that testing never would have found. When your algebra is provably correct, debugging hardware becomes purely a plumbing exercise.
- $13.50 is enough. You don't need a $10,000 FPGA board to validate a novel architecture. The Tang Nano 9K proved everything we needed — the Zynq just showed it scales.
- Custom instructions matter. Going from MMIO (v1) to native ISA (v3) eliminated all bus overhead. ATOMiK ops execute at the same cost as ALU operations.
- The ceiling is the fabric, not the design. At N=512, we're using 44% of the XC7Z020. N=1024 needs an XC7Z045 (218K LUT) or UltraScale+. The ATOMiK core itself has no inherent scaling limit.
What's next
The ALINX AX7020 Zynq board is on the bench. Next step: PS+PL block design, Linux driver integration, and live benchmarks with the kernel module talking to hardware-accelerated ATOMiK contexts. After that: ASIC evaluation on Sky130.
The math works. The software works. The hardware works. Now we scale.
Try ATOMiK today
Get the SDK
Join 247+ developers building with delta-state algebra