EML Meets Neural Networks

What happens when you put a 1-operator symbolic head on a DNN


The standard story for neural networks and symbolic regression goes like this: first you train a DNN until it fits the data, then you run something like PySR or gplearn over the residuals to find a closed-form approximation. The DNN is a black box. The symbolic part happens after the fact, separately, with no shared structure.

This post explores a different idea. What if the DNN trunk and the symbolic head share the same mathematical vocabulary? Specifically: what if every activation function in the network is itself a cheap EML operator, and the symbolic head tries to distil the whole thing into one small expression?

The answer turns out to depend almost entirely on what the target function is built from.


The Architecture

The hybrid setup is straightforward. A DNN trunk — a few dense layers — extracts features from raw inputs. The final layer feeds into an EML symbolic head: a small tree of operators from the monogate family (eml(x,y) = exp(x) − ln(y) and its cousins). The head tries to compress what the trunk learned into a closed-form expression.

Training has two phases. First, standard backpropagation trains the full network end-to-end. Second, a distillation pass searches for the smallest EML tree that matches the trunk’s output on the training distribution.

The SuperBEST v4 cost system assigns each operator a cost in n-units:

OperatorCost
exp, ln, recip1n each
neg, mul, sub, div, sqrt2n each
pow3n
add_pos3n
add_gen11n

The cost of an expression is the sum of its operator costs. The goal of distillation is to find the lowest-cost EML tree that represents the function the DNN learned.


The D=4 Gradient Wall

There is a problem with pure EML trees as the architecture for every layer: they become untrainable past depth 3 or 4.

The issue is exponential gradient growth. Consider a 4-layer EML tree where each layer applies exp(·). The gradient of the output with respect to the first-layer input involves a product of four exp values. For inputs with magnitude greater than ~2, this product explodes. Backpropagation diverges.

DEML fixes this. The operator DEML(x, r) = exp(−x) − ln(r) uses exp(−x) instead of exp(x). The gradient of exp(−x) with respect to x is −exp(−x), which stays in (−1, 0) for all x > 0. The dampening is structural: adding the minus sign in the exponent bounds the gradient magnitude at each layer.

The LEAd operator (= softplus) fixes it in a different way. LEAd(x, r) = ln(exp(x) + r). Its gradient with respect to x is σ(x) = 1/(1 + exp(−x)), the sigmoid function. Sigmoid stays in (0, 1) for all real x. No explosion, no vanishing. This is why LEAd is the recommended activation for EML hybrid networks.


Activation Function Costs: The Headline Number

The headline result from NN-7 is a single equation:

softplus(x) = ln(1 + exp(x)) = LEAd(x, 1) = 1n

Softplus is not just EML-friendly — it is an EML primitive. When the DNN uses softplus activations throughout, every layer of the network is a 1n EML operation. The symbolic head can, in principle, read the whole network as one big EML expression and compress it.

The full activation shootout:

ActivationFormulaSuperBEST costGradientPINN suitable
softplusln(1+exp(x))1nσ(x) ∈ (0,1)Yes
sigmoid1/(1+exp(−x))7nσ(x)(1−σ(x))No
tanh(exp(x)−exp(−x))/(exp(x)+exp(−x))9n1−tanh²(x)No
Swishx·σ(x)9ncomplexNo
ReLUmax(0, x)N/A0 or 1No
GELUx·Φ(x)N/A (erf not in EML)smoothNo

ReLU and GELU are not in the EML operator closure. No finite EML tree represents them exactly. Distillation fails by definition when the activations are ReLU.

The PINN insight. Physics equations are built from exp and ln. Arrhenius kinetics is A·exp(−Ea/RT). Boltzmann distributions are exp(−E/kT)/Z. Entropy is −Σ p·ln(p). When the activation IS a 1n EML operator, the network layers speak the same mathematical language as the physics. Distillation recovers the physics equation, not just an approximation.


Where EML Wins: Transcendental Targets

EML has a structural advantage on any target built from exp, ln, or trig (via the CEML complex extension).

From the NN-6 SR benchmark analysis across 10 standard functions:

EML wins (5 functions):

The pattern is clear: if the target is what physics textbooks call a transcendental function, EML finds a compact closed form. PySR finds an approximation requiring many more nodes.


Where EML Loses: Polynomial Targets

EML loses (3 functions):

The structural reason is add_gen. General addition costs 11n in SuperBEST v4 because it requires routing through the full 11-operator sub-expression. Polynomials need many additions. A 6-term polynomial like Nguyen-4 needs 5 additions at 11n each — 55n just for the additions, on top of the 15n for the power terms.

Polynomial SR tools (PySR, gplearn) represent addition and multiplication as 1-node operations. They win on polynomials by design. EML’s cost structure penalises addition because exp(x) − ln(y) is the native operation, not x + y.

The honest score: 5 wins, 3 losses, 2 neutral out of 10 standard benchmarks.

EML symbolic heads are not general-purpose symbolic regression. They are specialised tools for physics-informed targets built from the exp-ln family.


Expression Distillation: What Actually Works

The NN-8 distillation analysis covers five canonical targets:

TargetEML costCompletenessPrediction
Arrhenius A·exp(−Ea/RT)7nCompleteSuccess
sigmoid 1/(1+exp(−x))7nCompleteSuccess
polynomial x³+x14nCompleteSuccess (slow)
sin(x)N/AIncompleteFails
ReLU max(0,x)N/AIncompleteFails

The completeness theorem explains the failure cases:

sin(x) fails because it has infinitely many real zeros (at x = nπ for all integers n). Any EML expression has at most finitely many real zeros — exp(g(x)) = ln(h(x)) produces isolated solutions, not an infinite discrete set. No finite EML tree can equal sin(x) on all of ℝ.

ReLU fails because EML expressions are smooth (C-∞, since exp and ln are analytic). ReLU has a non-differentiable kink at x = 0. A smooth function cannot equal a non-smooth one on any open interval.

Compression ratios when distillation succeeds are substantial. A 500-parameter MLP trained to compute Arrhenius kinetics distils to a 5-node EML tree: roughly 100:1 compression. The distilled expression is exact on the training distribution, not an approximation.


NN-9 evaluated six operator subsets across the five targets above. The key finding:

{EML, DEML, LEAd} is the sweet spot.

Expected success rates for this subset:

The full F16 set achieves lower rates on most targets despite higher expressivity, because the heterogeneous gradient magnitudes (add_gen contributes 11n worth of gradient signal vs 1n for the exp operators) make training unstable.

The recommendation stands: use {EML, DEML, LEAd} for PINN hybrid architectures. Add CEML only if trig targets are required.


Honest Conclusion

EML symbolic heads are a specialised tool. They work well — sometimes remarkably well — when the target function is built from the same mathematical atoms as the EML operator family: exponentials, logarithms, and their combinations.

They do not work well on polynomials, because the addition operator is expensive in EML’s cost structure. They cannot work at all on trigonometric or piecewise functions by completeness theory.

The practical scope is narrower than general symbolic regression but precisely matched to one important domain: physics-informed machine learning. Arrhenius kinetics, Boltzmann distributions, chemical equilibria, log-linear models — these are all native EML territory. For these targets, a DNN with softplus activations and an EML symbolic head can distil the learned function into an exact closed-form expression at ~100:1 compression.

That is a specific, honest, and useful result.


Monogate project — sessions NN-1 through NN-10. Source: github.com/almagu/monogate.