API latency spike
Resolved
Sep 29 at 11:00am HDT
Incident Report: Feature Flag Provider Outage
Summary
At ~17:00 UTC, our feature flag provider experienced an outage that led to elevated errors and slower response times in their API. Our caching system initially mitigated the impact, but as the outage worsened, cached values began to expire, resulting in elevated latency and slowness across our platform. Normal service was restored by ~18:40 UTC.
Timeline (UTC)
- 17:00 – Feature flag provider outage begins, causing elevated errors and slower API responses.
- 17:00–18:00 – Our caching systems mitigated the impact by serving cached flag values.
- 18:00–18:30 – Cached values expired, causing platform jobs to take longer and resulting in elevated latency and platform slowness.
- ~18:30 – Provider mitigated their outage and restored normal service.
- 18:40 – Platform throughput automatically recovered; normal service resumed.
Impact
- The US region was most heavily affected.
- EU and AU regions experienced only minor impact.
- Platform latency and job execution times were elevated for ~30 minutes until recovery.
Next Steps / Mitigation
- We are implementing additional safeguards to further reduce our dependency on the provider.
- These mitigations will help minimize and ultimately eliminate similar risks in the future.
Affected services
Updated
Sep 29 at 09:30am HDT
We’re back! API performance should now be stable. Thanks for bearing with us during the disruption.
Affected services
Created
Sep 29 at 09:00am HDT
We’re investigating an issue with our API that is causing increased latency and degraded performance for some users. We’re working to fix the problem as quickly as we can. We’ll share another update shortly.
Affected services