Equivalence Checking

Automatically proving two functions equivalent, finding counterexamples when they differ, and checking that implementations adhere to specifications.

One of Deppy's most practical features is automatic equivalence checking: given two Python functions with the same signature, Deppy can determine whether they always produce the same output — and if not, give you a concrete counterexample. No annotations, no proof terms, no decorators. Just two functions.

from deppy import check_equiv

def f(x: int) -> int:
    return x * 2

def g(x: int) -> int:
    return x + x

result = check_equiv(f, g)
print(result)
# EquivResult(EQUIVALENT, method='z3')

Deppy proved these functions identical for all integers — not by testing a handful of inputs, but by compiling both bodies to Z3 symbolic expressions and asking the SMT solver to find any input where they differ. Z3 reports UNSAT: no such input exists.


How It Works

Deppy's equivalence engine uses three strategies, tried in order:

  1. Z3 symbolic — Compile both function bodies to Z3 expressions. Check ∃x. f(x) ≠ g(x). If UNSAT → equivalent (proven). If SAT → counterexample (proven). If UNKNOWN → fall through.
  2. Property-based testing — Generate 1,000 random inputs (seeded for reproducibility), run both functions, compare outputs. High confidence but not a proof.
  3. Specification comparison — If both functions have @guarantee specs, compare the specs structurally.

The result always tells you which method was used and at what confidence level:

result = check_equiv(f, g)
result.equivalent   # True / False / None
                    #   True  — *proven* by Z3 or reflexivity
                    #   False — concrete counterexample found
                    #   None  — no proof AND no counterexample;
                    #           testing-based checks always end here
                    #           (see note below)
result.method       # 'z3', 'testing', 'spec', 'spec+z3', 'inconclusive'
result.confidence   # 1.0 for z3/refl; <1 for testing-based evidence
result.counterexample  # dict of {param: value} when inequivalent
result.likely_equivalent  # True iff proven OR testing confidence >= 0.9
                          #   — use this for candidate ranking,
                          #   never for soundness gates
Why testing returns None, not True. Testing samples inputs — 1000, 10000, however many — and checks that the two functions agree on each. Even agreement on every trial does not prove the functions are equal; the (N+1)-th input might disagree. So _testing_check_equiv returns equivalent=None with a confidence score computed from the number of trials. equivalent is True is reserved for real proofs (Z3 unsat, or reflexivity). Use likely_equivalent when you want "this is probably equivalent" for ranking; use equivalent is True when you need soundness.
verify_composition follows the same rule. The testing-only path returns verified=False now (the old behaviour of verified = (checked > 0) was wrong for the same reason above). To obtain verified=True the caller must supply a spec the Z3 path can discharge.
Spec equivalence ≠ code equivalence. When two functions have the same @guarantee string — or two guarantees that Z3 proves logically equivalent — check_equiv returns equivalent=None with a spec_info payload. Two functions can satisfy "result >= 0" while computing entirely different non-negative values; a matching spec is necessary but not sufficient for code equivalence.

Tier 1: Fully Automatic, Annotation-Free

🟢 Zero annotations needed — just pass two functions

For functions over int, float, and bool that use arithmetic, comparisons, branching, abs/min/max, and simple assignments, Deppy handles everything automatically via Z3. You write no decorators, no specs, no proof terms.

Example 1: Algebraic identity

def square_diff(a: int, b: int) -> int:
    return (a + b) * (a - b)

def explicit_diff(a: int, b: int) -> int:
    return a * a - b * b

result = check_equiv(square_diff, explicit_diff)
# EquivResult(EQUIVALENT, method='z3') ✓

Z3 proves (a+b)·(a−b) = a²−b² for all integers. No annotations needed.

Example 2: Branching refactors

def clamp_v1(x: int) -> int:
    if x < 0:
        return 0
    if x > 100:
        return 100
    return x

def clamp_v2(x: int) -> int:
    return max(0, min(x, 100))

result = check_equiv(clamp_v1, clamp_v2)
# EquivResult(EQUIVALENT, method='z3') ✓

The if-chain and the max/min expression are compiled to nested z3.If trees. Z3 proves they produce the same value for every integer x.

Example 3: Detecting a bug — counterexample

def safe_div(a: int, b: int) -> int:
    if b == 0:
        return 0
    return a // b

def unsafe_div(a: int, b: int) -> int:
    return a // b if b != 0 else a  # bug: returns a, not 0

result = check_equiv(safe_div, unsafe_div)
print(result)
# EquivResult(INEQUIVALENT, method='z3',
#   counterexample={'a': 1, 'b': 0})

Z3 found that when b=0, safe_div returns 0 but unsafe_div returns a. The counterexample is concrete — you can replay it yourself.

Example 4: Conditional assignment patterns

def relu_v1(x: int) -> int:
    if x < 0:
        x = 0
    return x

def relu_v2(x: int) -> int:
    return x if x >= 0 else 0

result = check_equiv(relu_v1, relu_v2)
# EquivResult(EQUIVALENT, method='z3') ✓

The "mutating if" pattern (if x < 0: x = 0 with no return and no else) is compiled to x = z3.If(x < 0, 0, x) — a conditional assignment. Deppy recognizes this Pythonic pattern automatically.

Example 5: Multi-parameter polynomial

def poly_factored(x: int, y: int) -> int:
    return (x + y) ** 2

def poly_expanded(x: int, y: int) -> int:
    return x * x + 2 * x * y + y * y

result = check_equiv(poly_factored, poly_expanded)
# EquivResult(EQUIVALENT, method='z3') ✓

Small-exponent ** (up to 6) is expanded to repeated multiplication, so (x+y)**2 becomes (x+y)*(x+y) in Z3.

Example 6: Augmented assignment

def accumulate(x: int, y: int) -> int:
    result = x
    result += y
    result += y
    return result

def direct(x: int, y: int) -> int:
    return x + 2 * y

result = check_equiv(accumulate, direct)
# EquivResult(EQUIVALENT, method='z3') ✓

What's covered automatically

Construct Handled by Z3? Example
Arithmetic (+, -, *, //, %)✅ Yesx * 2 vs x + x
Comparisons (<, <=, ==, !=, etc.)✅ Yesx >= 0
Boolean ops (and, or, not)✅ Yesx > 0 and x < 100
if/elif/else✅ Yesbranching → z3.If
Ternary (x if c else y)✅ Yesx if x > 0 else -x
Conditional assignment (if c: x = v)✅ Yesmutating-if pattern
abs, min, max✅ Yesabs(x)If(x≥0, x, -x)
Small powers (x**2 through x**6)✅ Yesexpanded to multiplication
+=, -=, *=✅ Yesaugmented assignment
Local variables (r = x + 1)✅ Yestracked in symbolic env
List/string operations❌ → testingxs.append(x)
Loops (for, while)❌ → testingfor x in xs:
Function calls (non-builtin)❌ → testinghelper(x)
Classes / method calls❌ → testingobj.method()
When Z3 can't handle a construct, Deppy falls back to property-based testing with 1,000 random samples. The result has method='testing' and confidence=0.95. It's a strong signal, but not a proof.

Specification Adherence (Also Automatic)

Beyond comparing two functions, Deppy can check whether a function adheres to its specification — fully automatically for arithmetic specs.

from deppy import check_adherence, guarantee

@guarantee("result >= 0")
def abs_val(x: int) -> int:
    if x >= 0:
        return x
    return -x

results = check_adherence(abs_val)
print(results[0])
# AdherenceResult(adheres=True, spec='result >= 0', method='z3')

Deppy symbolically executes the body, extracts the result expression per path, and checks that the specification holds on every path. If it finds a violation:

@guarantee("result > 0")   # wrong! fails at x=0
def abs_val_buggy(x: int) -> int:
    if x >= 0:
        return x
    return -x

results = check_adherence(abs_val_buggy)
print(results[0])
# AdherenceResult(adheres=False, spec='result > 0', method='z3',
#   counterexample={'x': 0})

The counterexample tells you exactly which input violates the spec: x=0 gives result=0, which does not satisfy result > 0.

Preconditions narrow the domain

from deppy import guarantee, requires

@requires("n > 0")
@guarantee("result > 0")
def double_pos(n: int) -> int:
    return n * 2

results = check_adherence(double_pos)
print(results[0])
# AdherenceResult(adheres=True, spec='result > 0', method='z3')
# Z3 checks: ∀n. (n > 0) → (n * 2 > 0)  ✓

The precondition n > 0 is added as a Z3 hypothesis, so Deppy only checks the guarantee for valid inputs.


Tier 2: Annotation-Guided Equivalence

🟡 Some annotations needed — specs help Deppy decide

When the function body is too complex for Z3 (lists, loops, higher-order functions), but you have specifications, Deppy can compare the specs to determine equivalence.

Example: Sorting implementations

from deppy import guarantee, check_equiv

@guarantee("result is sorted and is a permutation of input")
def my_sort_a(xs: list[int]) -> list[int]:
    # bubble sort
    xs = list(xs)
    for i in range(len(xs)):
        for j in range(len(xs) - 1):
            if xs[j] > xs[j+1]:
                xs[j], xs[j+1] = xs[j+1], xs[j]
    return xs

@guarantee("result is sorted and is a permutation of input")
def my_sort_b(xs: list[int]) -> list[int]:
    return sorted(xs)

result = check_equiv(my_sort_a, my_sort_b)
print(result)
# EquivResult(EQUIVALENT, method='testing', confidence=1.00)

Z3 can't handle the nested loops, so Deppy falls back to testing. Both functions produce the same output on all 1,000 random test inputs. The result has confidence≈1.0 — a strong signal, but not a formal proof.

However, if you want a proven result, you can tell Deppy that the functions satisfy the same spec, and use a proof term:

from deppy import Proof, by_ext

# Prove equivalence via shared spec
proof = (Proof("my_sort_a(xs) == my_sort_b(xs)")
    .by_ext("xs",
        lambda xs: by_axiom("sorted_unique",
            "a sorted permutation of a list is unique"))
    .qed())

Axiom Conflict Detection

When you build an AxiomRegistry from many sources (multiple sidecar files, several modules' axioms, a mix of user-supplied and auto-derived), the registry can flag pairs of axioms that are *syntactically* in conflict — same LHS shape, different RHS shape:

from deppy.proofs.pipeline import AxiomRegistry

reg = AxiomRegistry()
reg.add_all(axiomize(my_module, module="demo"), source="demo")
# ... add more axioms ...

for a, b, head in reg.conflicts():
    print(f"⚠ axioms {a} and {b} both define {head} differently")
conflicts() is purely syntactic. It compares axiom endpoints by their string representations (repr). This means:
  • False positives — two RHSes that are provably equal but written differently (1 + x vs x + 1) will be flagged as conflicting.
  • False negatives — two LHSes that are alpha- equivalent but written with different variable names (f(a) vs f(b)) will not match.

For semantic conflict detection, escalate the candidate pair to check_equiv (which routes through Z3) or check_spec_equiv (which compares spec strings). The registry's job is just to surface a candidate set fast enough for interactive use.


Tier 3: Manual Proof Terms

🔴 Full proof terms — for properties beyond automation

Some equivalences require genuine mathematical reasoning: induction over data structures, appeal to domain-specific lemmas, or transport across type equivalences.

Example: Proving a recursive identity by induction

from deppy import guarantee, lemma, induction, Proof

@guarantee("result == n * (n + 1) // 2")
def sum_to_loop(n: int) -> int:
    total = 0
    for i in range(n + 1):
        total += i
    return total

@guarantee("result == n * (n + 1) // 2")
def sum_to_formula(n: int) -> int:
    return n * (n + 1) // 2

# Z3 can prove the formula satisfies the spec:
from deppy import check_adherence
check_adherence(sum_to_formula)
# AdherenceResult(adheres=True, method='z3')

# But the loop version needs induction:
@lemma
@induction("n")
def sum_loop_correct(n: int):
    """sum_to_loop(n) == n * (n + 1) // 2"""
    # Base case: sum_to_loop(0) = 0 = 0 * 1 // 2  ✓
    # Inductive step: assume sum_to_loop(k) = k*(k+1)//2
    #   then sum_to_loop(k+1) = sum_to_loop(k) + (k+1)
    #                         = k*(k+1)//2 + (k+1)
    #                         = (k+1)*(k+2)//2  ✓
    pass

Example: Transport across duck-type equivalence

from deppy import path, by_duck_type, transport_proof, guarantee

class ArrayStack:
    @guarantee("result >= 0")
    def size(self) -> int: ...
    def push(self, item): ...
    def pop(self): ...

class LinkedStack:
    @guarantee("result >= 0")
    def size(self) -> int: ...
    def push(self, item): ...
    def pop(self): ...

# Build a path: ArrayStack ≃ LinkedStack (same Stack protocol)
@path(ArrayStack, LinkedStack)
def stack_path():
    return by_duck_type("Stack", evidence={
        "size": "same signature and guarantee",
        "push": "same signature",
        "pop": "same signature",
    })

# Transport: any proof about ArrayStack.size
# transfers to LinkedStack.size
@transport_proof(stack_path, "size returns non-negative")
def linked_size_nonneg():
    pass  # proof transferred automatically

This is not just testing — it's a genuine proof that any property proved about one implementation applies to the other, by virtue of their shared protocol structure.


Proving Inequivalence with Counterexamples

When Z3 finds that two functions are not equivalent, it provides a concrete counterexample — specific input values where the functions diverge. Deppy replays the counterexample to confirm:

def triple(x: int) -> int:
    return x * 3

def triple_approx(x: int) -> int:
    return x * 3 + 1

result = check_equiv(triple, triple_approx)
print(result.equivalent)       # False
print(result.method)           # 'z3'
print(result.counterexample)   # {'x': 0}
# triple(0) = 0, triple_approx(0) = 1 — different!

The counterexample is always replayed — Deppy executes both functions with the Z3-provided inputs and verifies the outputs really differ. This guards against Z3 model artifacts.

More counterexample examples

# Off-by-one
def f(x: int) -> int:
    return x + 1

def g(x: int) -> int:
    return x + 2

check_equiv(f, g)
# INEQUIVALENT, counterexample={'x': 0}
# f(0)=1, g(0)=2
# Branch difference
def sign_a(x: int) -> int:
    if x > 0: return 1
    if x < 0: return -1
    return 0

def sign_b(x: int) -> int:
    if x >= 0: return 1   # bug: 0 → 1 instead of 0
    return -1

check_equiv(sign_a, sign_b)
# INEQUIVALENT, counterexample={'x': 0}
# sign_a(0)=0, sign_b(0)=1

Ad-Hoc Spec Checking

You don't even need @guarantee — pass a spec string directly:

from deppy import check_adherence

def clamp(x: int) -> int:
    if x < 0: return 0
    if x > 100: return 100
    return x

# Check any property you like, without decorating:
check_adherence(clamp, spec="result >= 0")[0]
# AdherenceResult(adheres=True, spec='result >= 0', method='z3')

check_adherence(clamp, spec="result <= 100")[0]
# AdherenceResult(adheres=True, spec='result <= 100', method='z3')

check_adherence(clamp, spec="result >= 0 and result <= 100")[0]
# AdherenceResult(adheres=True, spec='result >= 0 and result <= 100', method='z3')

# Try a false property:
check_adherence(clamp, spec="result == x")[0]
# AdherenceResult(adheres=False, spec='result == x', method='z3',
#   counterexample={'x': -1})  — clamp(-1)=0 ≠ -1

Summary: The Three Tiers

Tier What you write What Deppy does Guarantee
Tier 1 Two functions — nothing else Z3 symbolic execution Mathematical proof or counterexample. confidence=1.0
Tier 2 Functions + @guarantee specs Spec comparison + testing High-confidence testing. confidence≈1.0
Tier 3 Functions + specs + proof terms Kernel-checked proof Machine-checked, exportable to Lean. KERNEL trust

The key insight: Tier 1 requires zero annotations. For the decidable fragment (integer/float arithmetic with branching), Deppy is a fully automatic equivalence prover — you just hand it two functions and get a proven verdict.

Try it now:
pip install git+https://github.com/thehalleyyoung/deppy.git
pip install z3-solver
python3 -c "
from deppy import check_equiv

def f(x: int) -> int: return x * x - 1
def g(x: int) -> int: return (x - 1) * (x + 1)

print(check_equiv(f, g))
"