> ## Documentation Index
> Fetch the complete documentation index at: https://recurr.dev/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Reliability

> How Recurr operates in production: service level commitments, incident response, error modes, on-call coverage. Pre-customer formal SLA finalises with customer-specific MSAs.

Reliability lives at three layers in the Recurr stack:

1. **Stripe Connect** — the payment rail itself, run by Stripe. Stripe's uptime track record applies directly to subscription billing.
2. **Recurr's webhook + control layer** — the events stream, retry logic, replay surface, and operator dashboard.
3. **The customer's downstream systems** — entitlement sync, BI ingest, CS tooling. Recurr's webhook delivery is the contract; downstream handling is the customer's.

This page covers Layer 2 — the surface Recurr owns operationally.

## Service level commitments

Formal SLA terms finalise in customer-specific MSAs. The operational targets Recurr operates against:

| Surface                          | Target                                    |
| -------------------------------- | ----------------------------------------- |
| Webhook delivery (first attempt) | \< 5 seconds from billing-event trigger   |
| Webhook delivery (after retry)   | 99.9% success within 24 hours             |
| Subscription state read API      | \< 200ms p95                              |
| Replay API                       | \< 60 seconds for cohort up to 10K events |
| Operator dashboard               | 99.9% availability                        |

These targets sit alongside Stripe's own SLA — the underlying payment rail's reliability propagates upward.

## Incident response

When something breaks, Recurr's response runs the same pattern:

<Steps>
  <Step title="Detection">
    Monitoring fires on signature failures: webhook delivery failure rate above baseline, latency spikes, error-code anomalies, dependency degradation (Stripe, Resend, downstream destinations). Synthetic checks run continuously against the API surface.
  </Step>

  <Step title="Triage + response">
    On-call engineer paged within 5 minutes of detection. Initial response within 15 minutes — acknowledgement, scope assessment, communication plan.
  </Step>

  <Step title="Communication">
    Status page updated within 30 minutes of detection. For customer-impacting incidents, designated technical + commercial contacts notified directly. Updates at 30-minute intervals while the incident is open.
  </Step>

  <Step title="Resolution + postmortem">
    Incident closed when impact ends and root cause is understood. Postmortem within 5 business days for any customer-impacting incident, shared with affected customers — what happened, why, what changes.
  </Step>
</Steps>

## Error modes + handling

The common failure surfaces and how Recurr handles them:

### Webhook delivery failure to customer endpoint

The customer's endpoint returns 5xx or times out. Recurr retries on exponential backoff (1m, 5m, 30m, 2h, 12h, 24h × N) for up to 7 days. After 7 days the event lands in a dead-letter queue for manual replay.

The customer's `event_id` stays stable across retries, so downstream dedup remains straightforward. See the [retry policy](/api-reference/overview#retry-policy) for full mechanics.

### Subscription state collision (Apple receipt vs. web sub)

If a subscriber holds both an active Apple receipt and an active web subscription (mid-migration race condition), Recurr writes both states to the entitlement system and flags the conflict. The customer's entitlement logic decides resolution; Recurr surfaces the conflict for manual review.

### Failed payment on web sub

Stripe's smart retries handle \~30% of involuntary churn on first failure. Recurr's dunning layer then runs the configured recovery cadence — email + in-portal — before subscription cancels. The full recovery sequence is captured as events your CS team can react to in real time.

### Identity bridge failure (migration arrival)

If a migrated subscriber arrives at branded checkout and the identity bridge can't match them to their existing app account, the flow routes to a manual-resolution surface. Recurr's CX support layer holds the subscriber's state; the customer's CX team gets a ticket with the relevant subscription context.

### Stripe Connect dependency

Stripe is Recurr's payment rail; Stripe outages affect subscription billing. Recurr can't insulate from Stripe-side incidents but communicates them clearly when they happen — Stripe status flows into Recurr's status page.

## On-call coverage

While pre-customer, on-call is founder-led (Matt) with 24/7 paging. Response targets:

* Critical incident (customer impact, data integrity): \< 15 minutes
* High severity (operational degradation): \< 1 hour
* Standard (non-blocking issue): next business day

Coverage shifts to a rotation as engineering team grows. Customers see no change in response targets through the transition.

## Status page

Recurr's status page is at [recurr.instatus.com](https://recurr.instatus.com). All incidents, scheduled maintenance, and dependency degradations land there with timestamps and impact scope.

Subscribe via email or RSS for incident notifications.

## What's not yet formalised

Honest pre-customer framing:

* **SOC 2 Type II** — in progress; Type I target alongside customer-#1 deployment, Type II 12 months after. See [compliance posture](/trust/compliance-posture).
* **Multi-region failover** — currently single-region (US-East). Multi-region with active failover scoped for post-customer-5.
* **Formal SLA percentages with credits** — finalise in customer-specific MSAs. The operational targets above are what Recurr runs against; the contractual percentages reflect the same targets with explicit credit mechanics.

## Cross-references

* [Risk register →](/trust/risk-register) — what could go wrong, segmented by exposure
* [Compliance posture →](/trust/compliance-posture) — SOC 2, GDPR, audit posture
* [Infrastructure →](/trust/infrastructure) — where data lives, encryption, access controls
* [API reference overview →](/api-reference/overview) — webhook delivery + retry mechanics
