The ReLU–Softplus Error is Exactly ln(2)/β

Tier: THEOREM (proved)

Softplus is the smooth stand-in for ReLU everyone’s used for a decade: $$ \mathrm{softplus}_\beta(x) ;=; \frac{1}{\beta},\ln!\bigl(1 + e^{\beta x}\bigr), \qquad \mathrm{ReLU}(x) = \max(x, 0). $$ Make β big and softplus hugs ReLU closely. Make it small and you get a soft bend. Almost every neural-network textbook mentions the approximation; almost none of them tell you how good it is.

It turns out the answer is exact.


The bound

Theorem. For every real x and every β > 0, $$ 0 ;\le; \mathrm{softplus}\beta(x) - \mathrm{ReLU}(x) ;\le; \frac{\ln 2}{\beta}, $$ with the upper bound attained uniquely at x = 0. Consequently $$ |\mathrm{softplus}\beta - \mathrm{ReLU}|_{L^\infty(\mathbb{R})} ;=; \frac{\ln 2}{\beta}. $$

Two lines:

The maximum is hit exactly at the hinge, decays smoothly toward zero as |x| → ∞, and never exceeds ln(2)/β.


What this looks like numerically

200 points on [-1, 1], measuring max(softplus_β − ReLU):

βobserved max errorln(2)/β
10.69310.6931
20.34660.3466
40.17330.1733
80.08660.0866
160.04330.0433
320.02170.0217
640.01080.0108
1280.00540.0054
2560.00270.0027

Log-log fit across these points: slope -1.0000, intercept e^a = 0.6931. That’s −1 and ln 2 to four decimal places. The theorem isn’t asymptotic; it’s sharp at every β.


Why ln(2), specifically?

Because the kink of ReLU at x = 0 has value max(0, 0) = 0, whereas softplus at x = 0 evaluates to (1/β) ln(1 + e^0) = (1/β) ln(2). The constant is baked into the structure of how you round off a corner to a smooth curve. It doesn’t depend on your input distribution, on initialization, or on the architecture — you pay ln 2 for each unit of β-precision and no more.


ReLU isn’t in ELC. Softplus is.

The deeper structural point: ReLU is not a finite EML tree — it has a non-analytic kink at the origin, and finite EML trees are real-analytic on (0, ∞) with finitely many zeros on compacts. So ReLU ∉ ELC(ℝ).

Softplus, on the other hand, is in ELC: it’s literally $$ \ln!\bigl(1 + e^{\beta x}\bigr) / \beta $$ — one exp, one log, arithmetic. Depth 2 over leaves {x, 1, β}. That means every softplus layer in a neural network is a depth-O(1) EML computation, and every softplus-driven activation stays inside ELC at every layer. Swap in ReLU and you leave ELC at the first non-differentiable point — approximately approximable but never exactly representable.

This is the Tier-0 boundary at the activation-function level. ReLU is in the uniform closure of ELC (for any precision ε you want, some β gets you there), but not in ELC itself. The approximation tax is ln(2)/β per unit of precision — no more, no less.


Reproduce

import math

def softplus(beta, x):
    z = beta * x
    if z > 500: return x           # avoid overflow
    if z < -500: return 0.0
    return math.log1p(math.exp(z)) / beta

def relu(x): return x if x > 0 else 0.0

for beta in [1, 2, 4, 8, 16, 32, 64, 128, 256]:
    xs = [-1 + 2 * k / 200 for k in range(201)]
    err = max(abs(softplus(beta, x) - relu(x)) for x in xs)
    print(f"beta={beta:4d}  max_err={err:.6f}  ln(2)/beta={math.log(2)/beta:.6f}")

Output matches the table above to double-precision.


Consequence for practitioners

  1. If your network uses softplus with β ≈ 1 (the default in many frameworks), your activation is off from ReLU by up to 0.69 at the hinge. That’s often fine; sometimes it isn’t.
  2. To hit ReLU to within 0.001 you need β ≥ ln 2 / 0.001 ≈ 693. At β=693 the gradient near the origin is very steep — training may behave essentially like ReLU.
  3. softplus with β small gives you a genuinely smoother function with a quantified accuracy penalty. If that penalty is acceptable for your task, you keep differentiability (and ELC membership) for free.

The ln(2) constant is cheap. It’s also exact. No tables, no ambiguity.


Cite: Monogate Research (2026). “The ReLU–Softplus Error is Exactly ln(2)/β.” monogate research blog. https://monogate.org/blog/relu-softplus-exact-error

React