2026-04-19 engineering 6 min read

What BEST Routing Actually Saves

Empirical speedups from hybrid EML/EDL/EXL/EAL dispatch. The numbers, honestly reported. Updated 2026-04-20 with SuperBEST FINAL table (neg=2n, mul=3n, sub=3n).
→ Run the benchmark suite yourself (instant, no login)

The EML operator family has three members:

Each generates all elementary functions, but they have different node counts for different primitives. BEST (Binary Expression Select & Transform) routing is the idea of dispatching each operation to whichever operator computes it cheapest.

Node count table

These are verified minimum node counts — the exact number of EML-family calls required:

Function Pure EML BEST Savings
exp(x) 1 1
ln(x) 3 1 −67%
neg(x) 6 2 −67%
x + y (x>0) 11 3 −73%
x − y (x>0) 5 3 −40%
x × y (x>0) 13 3 −77%
x ÷ y 15 1 −93%
x^n (x>0) 15 3 −80%
recip(x) 5 2 −60%
sin(x) 245 63 −74%

The big wins are div (93% via EDL), mul (77% via EXL), pow (80% via EXL), and add/neg (67–73% via EAL/EXL). The SuperBEST FINAL table (2026-04-20) achieves 71.2% total savings across 9 standard arithmetic operations — 21 nodes vs 73 naive. All entries proved optimal by exhaustive search. Addition for general x,y∈ℝ still costs 11n (EML is the only capable operator); the 3n entry applies when x>0.

Wall-clock speedups

Node count savings translate to wall-clock speedups, but not uniformly. The crossover threshold is about 20% node savings — below that, the dispatch overhead exceeds the compute savings.

Workload Node savings Speedup
sin/cos Taylor (TinyMLP) 74% 2.8×
Polynomial x⁴+x³+x² 54% 2.1×
GELU (Transformer FFN) 18% 0.93×

GELU is a negative result, honestly reported: 18% node savings falls below the crossover threshold, and BEST routing is slower on GELU-heavy workloads. If you're building a transformer FFN replacement, BEST routing doesn't help.

Performance kernels

We also have a Rust-accelerated kernel and a fused Python kernel. These compound with BEST:

Backend ms/step Speedup
Standard Python 8.3
FusedEMLActivation 2.3 3.6×
Fused + torch.compile 1.9 4.4×
Rust (monogate-core) 1.4 5.9×

Benchmark: EMLLayer depth=2, 256→256, batch=1024, CPU. The Rust kernel uses PyO3 bindings.

What BEST routing is not

BEST routing is a node-count optimization within the EML framework. It doesn't make EML competitive with purpose-built implementations of individual functions. A native sin(x) in hardware is faster than any EML tree, even a 1-node complex CBEST.

What BEST routing does: reduces the cost of expressing arbitrary combinations of elementary functions as EML trees, when you're already committed to the EML framework for other reasons (symbolic manipulation, gradient-free search, interpretability).

Install: pip install monogate · Rust source: monogate-core/

React