The AI matrix processor purpose-built for nothing else. 3,000+ TOPS. Every transistor devoted to the one operation that matters.
Every design decision — the ISA, the memory hierarchy, the control logic — was made by asking: does this help matrix multiplication? If not, it doesn't exist on APEX-1.
64 × 128×128 MAC arrays across 4 dies. Weights stream downward, activations rightward. Zero broadcast overhead. Each cell performs one multiply-accumulate per clock, every clock.
1,048,576 MACs256 MB on-chip scratchpad with no cache hierarchy, no TLB, no eviction logic. The compiler controls every byte movement. Zero cache misses — not by luck, by construction.
20 TB/s internal BWA flat 18-instruction ISA with no branches, no out-of-order execution, no speculative prefetch. Every operation is scheduled at compile time. Latency is deterministic to the cycle.
18 instructions totalStructured 2:4 sparsity decompressed on-the-fly from a compressed weight buffer plus a 2-bit index mask. Zero-weight multiplications are skipped at the hardware level, doubling effective throughput.
2× sparse throughputDual-path: im2col for arbitrary kernels, Winograd F(2,3) for 3×3 stride-1 convolutions. The compiler selects the optimal path. Winograd reduces multiply count by 2.25× with identical numerical output.
Winograd + im2colFour dies connected via UCIe 2.0 at 20 TB/s on a TSMC CoWoS-L silicon interposer. The compiler sees a single logical accelerator. Ring all-reduce for gradient sync runs in dedicated hardware.
4-die CoWoS packageNo general-purpose overhead. Every watt, every mm², every transistor allocated to AI compute.
The ISA is the primary security mechanism. There is no way to express a branch, a cache miss, a page fault, or a vertex shader in these 18 opcodes. The chip is architecturally incapable of general-purpose compute.
No hardware caches. No TLBs. No eviction logic. The compiler schedules every byte movement at compile time. Memory latency is fully deterministic — the scheduler has already accounted for it.
APEX-1 supports all four workloads from day one. FP32 gradient accumulation, ring all-reduce, and multi-card scale-out are first-class hardware features, not software workarounds.
256-entry FP32 accumulator register file. BF16 gradients are automatically upcast, preventing the catastrophic cancellation that destroys small gradient values when accumulated at half precision.
FP32 precisionGradient synchronization across all dies and cards runs in dedicated hardware via the ALL_REDUCE instruction. Ring and tree topologies are supported. The compiler inserts barriers; no runtime negotiation.
Hardware collectiveForward pass in FP8 E4M3 for maximum throughput. Gradient accumulation in FP32. Weight updates in BF16. The compiler manages format conversions at tile boundaries with zero programmer intervention.
FP8 / BF16 / FP3296 GB HBM supports 8B parameter models with Adam optimizer state (weights + gradients + first/second moments = 12 bytes/param). FP8 weights extend this to 96B parameters for inference.
8B params w/ AdamWinograd convolution in the forward pass, im2col in the backward pass where kernel symmetry breaks. Depthwise and grouped convolutions are supported natively via the CONV2D groups parameter.
Winograd forwardThree power operating points selectable per run: Performance (400W / 1GHz), Balanced (260W / 800MHz), Efficiency (120W / 500MHz). The BMC monitors 8 junction-temperature sensors and throttles automatically.
3 DVFS modesAPEX-1 is available for evaluation partnerships and early access programs. Architecture documentation and RTL available under NDA.