APEX-1 — AI Matrix Processor

Architecture

Built for one thing.
Exceptional at it.

Every design decision — the ISA, the memory hierarchy, the control logic — was made by asking: does this help matrix multiplication? If not, it doesn't exist on APEX-1.

⬡

Systolic Array Fabric

64 × 128×128 MAC arrays across 4 dies. Weights stream downward, activations rightward. Zero broadcast overhead. Each cell performs one multiply-accumulate per clock, every clock.

1,048,576 MACs

◈

Software-Managed SRAM

256 MB on-chip scratchpad with no cache hierarchy, no TLB, no eviction logic. The compiler controls every byte movement. Zero cache misses — not by luck, by construction.

20 TB/s internal BW

⟁

Static Sequencer

A flat 18-instruction ISA with no branches, no out-of-order execution, no speculative prefetch. Every operation is scheduled at compile time. Latency is deterministic to the cycle.

18 instructions total

◉

2:4 Sparsity Engine

Structured 2:4 sparsity decompressed on-the-fly from a compressed weight buffer plus a 2-bit index mask. Zero-weight multiplications are skipped at the hardware level, doubling effective throughput.

2× sparse throughput

◌

Convolution Engine

Dual-path: im2col for arbitrary kernels, Winograd F(2,3) for 3×3 stride-1 convolutions. The compiler selects the optimal path. Winograd reduces multiply count by 2.25× with identical numerical output.

Winograd + im2col

⬡

UCIe Scale-Out

Four dies connected via UCIe 2.0 at 20 TB/s on a TSMC CoWoS-L silicon interposer. The compiler sees a single logical accelerator. Ring all-reduce for gradient sync runs in dedicated hardware.

4-die CoWoS package

Specifications

Numbers that matter.

No general-purpose overhead. Every watt, every mm², every transistor allocated to AI compute.

Compute

Peak throughput INT83,145 TOPS

Peak throughput (2:4 sparse)6,291 TOPS

Peak throughput BF16786 TOPS

Peak throughput FP83,145 TOPS

Systolic arrays64 total (4 dies)

MAC cells1,048,576

Clock frequency1.0 GHz

Memory

HBM3e capacity96 GB

HBM bandwidth6 TB/s

On-chip SRAM256 MB

Internal SRAM BW20 TB/s

DMA channels8

Package & Power

Process nodeTSMC N3E

PackageCoWoS-L 4-die

Die area (each)~300 mm²

TDP~400 W

Host interfacePCIe 5.0 x16

Die-to-dieUCIe 2.0 · 20 TB/s

Performance vs. alternatives

APEX-1 INT8 (dense)3,145 TOPS

APEX-1 (2:4 sparse)6,291 TOPS

GPU competitor A~1,979 TOPS

GPU competitor B~1,457 TOPS

Previous gen TPU~918 TOPS

TOPS per watt

APEX-17.86 TOPS/W

GPU competitor A~2.51 TOPS/W

GPU competitor B~1.86 TOPS/W

Max model size (BF16 inference)

APEX-1 (96 GB HBM)48B params

GPU competitor (80 GB)40B params

GPU competitor (48 GB)24B params

Instruction Set

18 instructions.
Zero bloat.

The ISA is the primary security mechanism. There is no way to express a branch, a cache miss, a page fault, or a vertex shader in these 18 opcodes. The chip is architecturally incapable of general-purpose compute.

apex1_isa.v · instruction set reference · Rev 1.0

0x00

NOP

No operation

0x01

LOAD_TILE

HBM → scratchpad

0x02

STORE_TILE

Scratchpad → HBM

0x03

MATMUL

Systolic GEMM tile

0x04

CONV2D

im2col / Winograd

0x05

SPARSE_MM

2:4 sparse GEMM

0x06

ADD

Element-wise add

0x07

SCALE

Scalar multiply

0x08

ACTIVATE

GELU/ReLU/SiLU

0x09

NORMALIZE

LayerNorm/RMSNorm

0x0A

SOFTMAX

Numerically stable

0x0B

REDUCE_SUM

Dim reduction

0x0C

REDUCE_MAX

Max reduction

0x0D

TRANSPOSE

Layout permutation

0x0E

GRAD_ACCUM

BF16→FP32 gradient

0x0F

ALL_REDUCE

Ring / tree collective

0x10

DMA_PREFETCH

Scheduled async DMA

0x11

SYNC_BARRIER

Cross-tile barrier

Memory Hierarchy

Zero cache misses.
By construction.

No hardware caches. No TLBs. No eviction logic. The compiler schedules every byte movement at compile time. Memory latency is fully deterministic — the scheduler has already accounted for it.

MAC Register File

4 KBper array

0cycles

∞bandwidth

Tile Local SRAM

2 MBper cluster

2cycles

8 TB/s

Global Scratchpad SRAM

256 MBon-chip

10cycles

20 TB/s

HBM3e Off-Chip

96 GB2 stacks

~120cycles

6 TB/s

Training Support

Not just inference.
Full training.

APEX-1 supports all four workloads from day one. FP32 gradient accumulation, ring all-reduce, and multi-card scale-out are first-class hardware features, not software workarounds.

FP32 Gradient Accumulation

256-entry FP32 accumulator register file. BF16 gradients are automatically upcast, preventing the catastrophic cancellation that destroys small gradient values when accumulated at half precision.

FP32 precision

Ring All-Reduce

Gradient synchronization across all dies and cards runs in dedicated hardware via the ALL_REDUCE instruction. Ring and tree topologies are supported. The compiler inserts barriers; no runtime negotiation.

Hardware collective

Multi-Format Training

Forward pass in FP8 E4M3 for maximum throughput. Gradient accumulation in FP32. Weight updates in BF16. The compiler manages format conversions at tile boundaries with zero programmer intervention.

FP8 / BF16 / FP32

LLM Training Capacity

96 GB HBM supports 8B parameter models with Adam optimizer state (weights + gradients + first/second moments = 12 bytes/param). FP8 weights extend this to 96B parameters for inference.

8B params w/ Adam

CNN & Vision Training

Winograd convolution in the forward pass, im2col in the backward pass where kernel symmetry breaks. Depthwise and grouped convolutions are supported natively via the CONV2D groups parameter.

Winograd forward

DVFS for Training Runs

Three power operating points selectable per run: Performance (400W / 1GHz), Balanced (260W / 800MHz), Efficiency (120W / 500MHz). The BMC monitors 8 junction-temperature sensors and throttles automatically.

3 DVFS modes

APEX -1

Built for one thing.
Exceptional at it.

Systolic Array Fabric

Software-Managed SRAM

Static Sequencer

2:4 Sparsity Engine

Convolution Engine

UCIe Scale-Out

Numbers that matter.

18 instructions.
Zero bloat.

Zero cache misses.
By construction.

Not just inference.
Full training.

FP32 Gradient Accumulation

Ring All-Reduce

Multi-Format Training

LLM Training Capacity

CNN & Vision Training

DVFS for Training Runs

Ready to eliminate
the GPU tax?

APEX -1

Built for one thing.Exceptional at it.

Systolic Array Fabric

Software-Managed SRAM

Static Sequencer

2:4 Sparsity Engine

Convolution Engine

UCIe Scale-Out

Numbers that matter.

18 instructions.Zero bloat.

Zero cache misses.By construction.

Not just inference.Full training.

FP32 Gradient Accumulation

Ring All-Reduce

Multi-Format Training

LLM Training Capacity

CNN & Vision Training

DVFS for Training Runs

Ready to eliminatethe GPU tax?

Built for one thing.
Exceptional at it.

18 instructions.
Zero bloat.

Zero cache misses.
By construction.

Not just inference.
Full training.

Ready to eliminate
the GPU tax?