2026-04-19 engineering 6 min read

What BEST Routing Actually Saves

Empirical speedups from hybrid EML/EDL/EXL/EAL dispatch. The numbers, honestly reported. Updated 2026-04-20 with SuperBEST FINAL table (neg=2n, mul=3n, sub=3n).
→ Run the benchmark suite yourself (instant, no login)

The EML operator family has three members:

EML: exp(x) − ln(y)
EDL: exp(x) / ln(y)
EXL: exp(x) · ln(y)

Each generates all elementary functions, but they have different node counts for different primitives. BEST (Binary Expression Select & Transform) routing is the idea of dispatching each operation to whichever operator computes it cheapest.

Node count table

These are verified minimum node counts — the exact number of EML-family calls required:

Function	Pure EML	BEST	Savings
exp(x)	1	1	—
ln(x)	3	1	−67%
neg(x)	6	2	−67%
x + y (x>0)	11	3	−73%
x − y (x>0)	5	3	−40%
x × y (x>0)	13	3	−77%
x ÷ y	15	1	−93%
x^n (x>0)	15	3	−80%
recip(x)	5	2	−60%
sin(x)	245	63	−74%

The big wins are div (93% via EDL), mul (77% via EXL), pow (80% via EXL), and add/neg (67–73% via EAL/EXL). The SuperBEST FINAL table (2026-04-20) achieves 71.2% total savings across 9 standard arithmetic operations — 21 nodes vs 73 naive. All entries proved optimal by exhaustive search. Addition for general x,y∈ℝ still costs 11n (EML is the only capable operator); the 3n entry applies when x>0.

Wall-clock speedups

Node count savings translate to wall-clock speedups, but not uniformly. The crossover threshold is about 20% node savings — below that, the dispatch overhead exceeds the compute savings.

Workload	Node savings	Speedup
sin/cos Taylor (TinyMLP)	74%	2.8×
Polynomial x⁴+x³+x²	54%	2.1×
GELU (Transformer FFN)	18%	0.93×

GELU is a negative result, honestly reported: 18% node savings falls below the crossover threshold, and BEST routing is slower on GELU-heavy workloads. If you're building a transformer FFN replacement, BEST routing doesn't help.

Performance kernels

We also have a Rust-accelerated kernel and a fused Python kernel. These compound with BEST:

Backend	ms/step	Speedup
Standard Python	8.3	1×
FusedEMLActivation	2.3	3.6×
Fused + torch.compile	1.9	4.4×
Rust (monogate-core)	1.4	5.9×

Benchmark: EMLLayer depth=2, 256→256, batch=1024, CPU. The Rust kernel uses PyO3 bindings.

What BEST routing is not

BEST routing is a node-count optimization within the EML framework. It doesn't make EML competitive with purpose-built implementations of individual functions. A native sin(x) in hardware is faster than any EML tree, even a 1-node complex CBEST.

What BEST routing does: reduces the cost of expressing arbitrary combinations of elementary functions as EML trees, when you're already committed to the EML framework for other reasons (symbolic manipulation, gradient-free search, interpretability).

Install: pip install monogate · Rust source: monogate-core/

React