TSMC N3E · 4-Die CoWoS Package · Rev 1.0

APEX -1

The AI matrix processor purpose-built for nothing else. 3,000+ TOPS. Every transistor devoted to the one operation that matters.

3,145T
INT8 OPS/sec
6,291T
Sparse TOPS
6TB/s
HBM Bandwidth
96GB
HBM3e Memory

Built for one thing.
Exceptional at it.

Every design decision — the ISA, the memory hierarchy, the control logic — was made by asking: does this help matrix multiplication? If not, it doesn't exist on APEX-1.

Systolic Array Fabric

64 × 128×128 MAC arrays across 4 dies. Weights stream downward, activations rightward. Zero broadcast overhead. Each cell performs one multiply-accumulate per clock, every clock.

1,048,576 MACs

Software-Managed SRAM

256 MB on-chip scratchpad with no cache hierarchy, no TLB, no eviction logic. The compiler controls every byte movement. Zero cache misses — not by luck, by construction.

20 TB/s internal BW

Static Sequencer

A flat 18-instruction ISA with no branches, no out-of-order execution, no speculative prefetch. Every operation is scheduled at compile time. Latency is deterministic to the cycle.

18 instructions total

2:4 Sparsity Engine

Structured 2:4 sparsity decompressed on-the-fly from a compressed weight buffer plus a 2-bit index mask. Zero-weight multiplications are skipped at the hardware level, doubling effective throughput.

2× sparse throughput

Convolution Engine

Dual-path: im2col for arbitrary kernels, Winograd F(2,3) for 3×3 stride-1 convolutions. The compiler selects the optimal path. Winograd reduces multiply count by 2.25× with identical numerical output.

Winograd + im2col

UCIe Scale-Out

Four dies connected via UCIe 2.0 at 20 TB/s on a TSMC CoWoS-L silicon interposer. The compiler sees a single logical accelerator. Ring all-reduce for gradient sync runs in dedicated hardware.

4-die CoWoS package

Numbers that matter.

No general-purpose overhead. Every watt, every mm², every transistor allocated to AI compute.

Compute
Peak throughput INT83,145 TOPS
Peak throughput (2:4 sparse)6,291 TOPS
Peak throughput BF16786 TOPS
Peak throughput FP83,145 TOPS
Systolic arrays64 total (4 dies)
MAC cells1,048,576
Clock frequency1.0 GHz
Memory
HBM3e capacity96 GB
HBM bandwidth6 TB/s
On-chip SRAM256 MB
Internal SRAM BW20 TB/s
DMA channels8
Package & Power
Process nodeTSMC N3E
PackageCoWoS-L 4-die
Die area (each)~300 mm²
TDP~400 W
Host interfacePCIe 5.0 x16
Die-to-dieUCIe 2.0 · 20 TB/s
Performance vs. alternatives
APEX-1 INT8 (dense)3,145 TOPS
APEX-1 (2:4 sparse)6,291 TOPS
GPU competitor A~1,979 TOPS
GPU competitor B~1,457 TOPS
Previous gen TPU~918 TOPS

TOPS per watt
APEX-17.86 TOPS/W
GPU competitor A~2.51 TOPS/W
GPU competitor B~1.86 TOPS/W

Max model size (BF16 inference)
APEX-1 (96 GB HBM)48B params
GPU competitor (80 GB)40B params
GPU competitor (48 GB)24B params

18 instructions.
Zero bloat.

The ISA is the primary security mechanism. There is no way to express a branch, a cache miss, a page fault, or a vertex shader in these 18 opcodes. The chip is architecturally incapable of general-purpose compute.

apex1_isa.v · instruction set reference · Rev 1.0
0x00
NOP
No operation
0x01
LOAD_TILE
HBM → scratchpad
0x02
STORE_TILE
Scratchpad → HBM
0x03
MATMUL
Systolic GEMM tile
0x04
CONV2D
im2col / Winograd
0x05
SPARSE_MM
2:4 sparse GEMM
0x06
ADD
Element-wise add
0x07
SCALE
Scalar multiply
0x08
ACTIVATE
GELU/ReLU/SiLU
0x09
NORMALIZE
LayerNorm/RMSNorm
0x0A
SOFTMAX
Numerically stable
0x0B
REDUCE_SUM
Dim reduction
0x0C
REDUCE_MAX
Max reduction
0x0D
TRANSPOSE
Layout permutation
0x0E
GRAD_ACCUM
BF16→FP32 gradient
0x0F
ALL_REDUCE
Ring / tree collective
0x10
DMA_PREFETCH
Scheduled async DMA
0x11
SYNC_BARRIER
Cross-tile barrier

Zero cache misses.
By construction.

No hardware caches. No TLBs. No eviction logic. The compiler schedules every byte movement at compile time. Memory latency is fully deterministic — the scheduler has already accounted for it.

MAC Register File
4 KBper array
0cycles
bandwidth
Tile Local SRAM
2 MBper cluster
2cycles
8 TB/s
Global Scratchpad SRAM
256 MBon-chip
10cycles
20 TB/s
HBM3e Off-Chip
96 GB2 stacks
~120cycles
6 TB/s

Not just inference.
Full training.

APEX-1 supports all four workloads from day one. FP32 gradient accumulation, ring all-reduce, and multi-card scale-out are first-class hardware features, not software workarounds.

01

FP32 Gradient Accumulation

256-entry FP32 accumulator register file. BF16 gradients are automatically upcast, preventing the catastrophic cancellation that destroys small gradient values when accumulated at half precision.

FP32 precision
02

Ring All-Reduce

Gradient synchronization across all dies and cards runs in dedicated hardware via the ALL_REDUCE instruction. Ring and tree topologies are supported. The compiler inserts barriers; no runtime negotiation.

Hardware collective
03

Multi-Format Training

Forward pass in FP8 E4M3 for maximum throughput. Gradient accumulation in FP32. Weight updates in BF16. The compiler manages format conversions at tile boundaries with zero programmer intervention.

FP8 / BF16 / FP32
04

LLM Training Capacity

96 GB HBM supports 8B parameter models with Adam optimizer state (weights + gradients + first/second moments = 12 bytes/param). FP8 weights extend this to 96B parameters for inference.

8B params w/ Adam
05

CNN & Vision Training

Winograd convolution in the forward pass, im2col in the backward pass where kernel symmetry breaks. Depthwise and grouped convolutions are supported natively via the CONV2D groups parameter.

Winograd forward
06

DVFS for Training Runs

Three power operating points selectable per run: Performance (400W / 1GHz), Balanced (260W / 800MHz), Efficiency (120W / 500MHz). The BMC monitors 8 junction-temperature sensors and throttles automatically.

3 DVFS modes

Ready to eliminate
the GPU tax?

APEX-1 is available for evaluation partnerships and early access programs. Architecture documentation and RTL available under NDA.