What BEST Routing Actually Saves
Empirical speedups from hybrid EML/EDL/EXL/EAL dispatch. The numbers, honestly reported.
Updated 2026-04-20 with SuperBEST FINAL table (neg=2n, mul=3n, sub=3n).
→ Run the benchmark suite yourself (instant, no login)
The EML operator family has three members:
- EML:
exp(x) − ln(y) - EDL:
exp(x) / ln(y) - EXL:
exp(x) · ln(y)
Each generates all elementary functions, but they have different node counts for different primitives. BEST (Binary Expression Select & Transform) routing is the idea of dispatching each operation to whichever operator computes it cheapest.
Node count table
These are verified minimum node counts — the exact number of EML-family calls required:
| Function | Pure EML | BEST | Savings |
|---|---|---|---|
| exp(x) | 1 | 1 | — |
| ln(x) | 3 | 1 | −67% |
| neg(x) | 6 | 2 | −67% |
| x + y (x>0) | 11 | 3 | −73% |
| x − y (x>0) | 5 | 3 | −40% |
| x × y (x>0) | 13 | 3 | −77% |
| x ÷ y | 15 | 1 | −93% |
| x^n (x>0) | 15 | 3 | −80% |
| recip(x) | 5 | 2 | −60% |
| sin(x) | 245 | 63 | −74% |
The big wins are div (93% via EDL), mul (77% via EXL), pow (80% via EXL), and add/neg (67–73% via EAL/EXL). The SuperBEST FINAL table (2026-04-20) achieves 71.2% total savings across 9 standard arithmetic operations — 21 nodes vs 73 naive. All entries proved optimal by exhaustive search. Addition for general x,y∈ℝ still costs 11n (EML is the only capable operator); the 3n entry applies when x>0.
Wall-clock speedups
Node count savings translate to wall-clock speedups, but not uniformly. The crossover threshold is about 20% node savings — below that, the dispatch overhead exceeds the compute savings.
| Workload | Node savings | Speedup |
|---|---|---|
| sin/cos Taylor (TinyMLP) | 74% | 2.8× |
| Polynomial x⁴+x³+x² | 54% | 2.1× |
| GELU (Transformer FFN) | 18% | 0.93× |
GELU is a negative result, honestly reported: 18% node savings falls below the crossover threshold, and BEST routing is slower on GELU-heavy workloads. If you're building a transformer FFN replacement, BEST routing doesn't help.
Performance kernels
We also have a Rust-accelerated kernel and a fused Python kernel. These compound with BEST:
| Backend | ms/step | Speedup |
|---|---|---|
| Standard Python | 8.3 | 1× |
| FusedEMLActivation | 2.3 | 3.6× |
| Fused + torch.compile | 1.9 | 4.4× |
| Rust (monogate-core) | 1.4 | 5.9× |
Benchmark: EMLLayer depth=2, 256→256, batch=1024, CPU. The Rust kernel uses PyO3 bindings.
What BEST routing is not
BEST routing is a node-count optimization within the EML framework. It doesn't make EML
competitive with purpose-built implementations of individual functions. A native
sin(x) in hardware is faster than any EML tree, even a 1-node complex CBEST.
What BEST routing does: reduces the cost of expressing arbitrary combinations of elementary functions as EML trees, when you're already committed to the EML framework for other reasons (symbolic manipulation, gradient-free search, interpretability).
Install: pip install monogate ·
Rust source: monogate-core/