> ## Documentation Index > Fetch the complete documentation index at: https://recurr.dev/docs/llms.txt > Use this file to discover all available pages before exploring further. # Reliability > How Recurr operates in production: service level commitments, incident response, error modes, on-call coverage. Pre-customer formal SLA finalises with customer-specific MSAs. Reliability lives at three layers in the Recurr stack: 1. **Stripe Connect** — the payment rail itself, run by Stripe. Stripe's uptime track record applies directly to subscription billing. 2. **Recurr's webhook + control layer** — the events stream, retry logic, replay surface, and operator dashboard. 3. **The customer's downstream systems** — entitlement sync, BI ingest, CS tooling. Recurr's webhook delivery is the contract; downstream handling is the customer's. This page covers Layer 2 — the surface Recurr owns operationally. ## Service level commitments Formal SLA terms finalise in customer-specific MSAs. The operational targets Recurr operates against: | Surface | Target | | -------------------------------- | ----------------------------------------- | | Webhook delivery (first attempt) | \< 5 seconds from billing-event trigger | | Webhook delivery (after retry) | 99.9% success within 24 hours | | Subscription state read API | \< 200ms p95 | | Replay API | \< 60 seconds for cohort up to 10K events | | Operator dashboard | 99.9% availability | These targets sit alongside Stripe's own SLA — the underlying payment rail's reliability propagates upward. ## Incident response When something breaks, Recurr's response runs the same pattern: Monitoring fires on signature failures: webhook delivery failure rate above baseline, latency spikes, error-code anomalies, dependency degradation (Stripe, Resend, downstream destinations). Synthetic checks run continuously against the API surface. On-call engineer paged within 5 minutes of detection. Initial response within 15 minutes — acknowledgement, scope assessment, communication plan. Status page updated within 30 minutes of detection. For customer-impacting incidents, designated technical + commercial contacts notified directly. Updates at 30-minute intervals while the incident is open. Incident closed when impact ends and root cause is understood. Postmortem within 5 business days for any customer-impacting incident, shared with affected customers — what happened, why, what changes. ## Error modes + handling The common failure surfaces and how Recurr handles them: ### Webhook delivery failure to customer endpoint The customer's endpoint returns 5xx or times out. Recurr retries on exponential backoff (1m, 5m, 30m, 2h, 12h, 24h × N) for up to 7 days. After 7 days the event lands in a dead-letter queue for manual replay. The customer's `event_id` stays stable across retries, so downstream dedup remains straightforward. See the [retry policy](/api-reference/overview#retry-policy) for full mechanics. ### Subscription state collision (Apple receipt vs. web sub) If a subscriber holds both an active Apple receipt and an active web subscription (mid-migration race condition), Recurr writes both states to the entitlement system and flags the conflict. The customer's entitlement logic decides resolution; Recurr surfaces the conflict for manual review. ### Failed payment on web sub Stripe's smart retries handle \~30% of involuntary churn on first failure. Recurr's dunning layer then runs the configured recovery cadence — email + in-portal — before subscription cancels. The full recovery sequence is captured as events your CS team can react to in real time. ### Identity bridge failure (migration arrival) If a migrated subscriber arrives at branded checkout and the identity bridge can't match them to their existing app account, the flow routes to a manual-resolution surface. Recurr's CX support layer holds the subscriber's state; the customer's CX team gets a ticket with the relevant subscription context. ### Stripe Connect dependency Stripe is Recurr's payment rail; Stripe outages affect subscription billing. Recurr can't insulate from Stripe-side incidents but communicates them clearly when they happen — Stripe status flows into Recurr's status page. ## On-call coverage While pre-customer, on-call is founder-led (Matt) with 24/7 paging. Response targets: * Critical incident (customer impact, data integrity): \< 15 minutes * High severity (operational degradation): \< 1 hour * Standard (non-blocking issue): next business day Coverage shifts to a rotation as engineering team grows. Customers see no change in response targets through the transition. ## Status page Recurr's status page is at [recurr.instatus.com](https://recurr.instatus.com). All incidents, scheduled maintenance, and dependency degradations land there with timestamps and impact scope. Subscribe via email or RSS for incident notifications. ## What's not yet formalised Honest pre-customer framing: * **SOC 2 Type II** — in progress; Type I target alongside customer-#1 deployment, Type II 12 months after. See [compliance posture](/trust/compliance-posture). * **Multi-region failover** — currently single-region (US-East). Multi-region with active failover scoped for post-customer-5. * **Formal SLA percentages with credits** — finalise in customer-specific MSAs. The operational targets above are what Recurr runs against; the contractual percentages reflect the same targets with explicit credit mechanics. ## Cross-references * [Risk register →](/trust/risk-register) — what could go wrong, segmented by exposure * [Compliance posture →](/trust/compliance-posture) — SOC 2, GDPR, audit posture * [Infrastructure →](/trust/infrastructure) — where data lives, encryption, access controls * [API reference overview →](/api-reference/overview) — webhook delivery + retry mechanics