Reliability - Recurr

Reliability lives at three layers in the Recurr stack:

Stripe Connect — the payment rail itself, run by Stripe. Stripe’s uptime track record applies directly to subscription billing.
Recurr’s webhook + control layer — the events stream, retry logic, replay surface, and operator dashboard.
The customer’s downstream systems — entitlement sync, BI ingest, CS tooling. Recurr’s webhook delivery is the contract; downstream handling is the customer’s.

This page covers Layer 2 — the surface Recurr owns operationally.

Service level commitments

Formal SLA terms finalise in customer-specific MSAs. The operational targets Recurr operates against:

Surface	Target
Webhook delivery (first attempt)	< 5 seconds from billing-event trigger
Webhook delivery (after retry)	99.9% success within 24 hours
Subscription state read API	< 200ms p95
Replay API	< 60 seconds for cohort up to 10K events
Operator dashboard	99.9% availability

These targets sit alongside Stripe’s own SLA — the underlying payment rail’s reliability propagates upward.

Incident response

When something breaks, Recurr’s response runs the same pattern:

Detection

Monitoring fires on signature failures: webhook delivery failure rate above baseline, latency spikes, error-code anomalies, dependency degradation (Stripe, Resend, downstream destinations). Synthetic checks run continuously against the API surface.

Triage + response

On-call engineer paged within 5 minutes of detection. Initial response within 15 minutes — acknowledgement, scope assessment, communication plan.

Communication

Status page updated within 30 minutes of detection. For customer-impacting incidents, designated technical + commercial contacts notified directly. Updates at 30-minute intervals while the incident is open.

Resolution + postmortem

Incident closed when impact ends and root cause is understood. Postmortem within 5 business days for any customer-impacting incident, shared with affected customers — what happened, why, what changes.

Error modes + handling

The common failure surfaces and how Recurr handles them:

Webhook delivery failure to customer endpoint

The customer’s endpoint returns 5xx or times out. Recurr retries on exponential backoff (1m, 5m, 30m, 2h, 12h, 24h × N) for up to 7 days. After 7 days the event lands in a dead-letter queue for manual replay. The customer’s event_id stays stable across retries, so downstream dedup remains straightforward. See the retry policy for full mechanics.

Subscription state collision (Apple receipt vs. web sub)

If a subscriber holds both an active Apple receipt and an active web subscription (mid-migration race condition), Recurr writes both states to the entitlement system and flags the conflict. The customer’s entitlement logic decides resolution; Recurr surfaces the conflict for manual review.

Failed payment on web sub

Stripe’s smart retries handle ~30% of involuntary churn on first failure. Recurr’s dunning layer then runs the configured recovery cadence — email + in-portal — before subscription cancels. The full recovery sequence is captured as events your CS team can react to in real time.

Identity bridge failure (migration arrival)

If a migrated subscriber arrives at branded checkout and the identity bridge can’t match them to their existing app account, the flow routes to a manual-resolution surface. Recurr’s CX support layer holds the subscriber’s state; the customer’s CX team gets a ticket with the relevant subscription context.

Stripe Connect dependency

Stripe is Recurr’s payment rail; Stripe outages affect subscription billing. Recurr can’t insulate from Stripe-side incidents but communicates them clearly when they happen — Stripe status flows into Recurr’s status page.

On-call coverage

While pre-customer, on-call is founder-led (Matt) with 24/7 paging. Response targets:

Critical incident (customer impact, data integrity): < 15 minutes
High severity (operational degradation): < 1 hour
Standard (non-blocking issue): next business day

Coverage shifts to a rotation as engineering team grows. Customers see no change in response targets through the transition.

Status page

Recurr’s status page is at recurr.instatus.com. All incidents, scheduled maintenance, and dependency degradations land there with timestamps and impact scope. Subscribe via email or RSS for incident notifications.

What’s not yet formalised

Honest pre-customer framing:

SOC 2 Type II — in progress; Type I target alongside customer-#1 deployment, Type II 12 months after. See compliance posture.
Multi-region failover — currently single-region (US-East). Multi-region with active failover scoped for post-customer-5.
Formal SLA percentages with credits — finalise in customer-specific MSAs. The operational targets above are what Recurr runs against; the contractual percentages reflect the same targets with explicit credit mechanics.

Cross-references

Risk register → — what could go wrong, segmented by exposure
Compliance posture → — SOC 2, GDPR, audit posture
Infrastructure → — where data lives, encryption, access controls
API reference overview → — webhook delivery + retry mechanics

​Service level commitments

​Incident response

​Error modes + handling

​Webhook delivery failure to customer endpoint

​Subscription state collision (Apple receipt vs. web sub)

​Failed payment on web sub

​Identity bridge failure (migration arrival)

​Stripe Connect dependency

​On-call coverage

​Status page

​What’s not yet formalised

​Cross-references