Resolved
Oct 1, 2025 at 11:30am UTC

Incident Report: US Database Cluster Outage

Summary

At approximately 04:10 AEST on 1st of October 2025, Relevance experienced a platform outage when our US database cluster entered a failure state. While the cluster appeared online, it was unable to process operations correctly, which caused the global API to become unavailable by 04:30 AEST.

On-call team was automatically paged. They began investigating and identified the issue was in our database infrastructure rather than our application itself. Initial remediation steps did not restore service, and we proceeded with a full database cluster reboot and redeploy of dependent services. This restored the platform to a healthy state by 07:00 AEST.

Timeline (AEST)

~04:10 – US database cluster entered failure state.
~04:17 – On-call engineer paged and began investigating.
~04:30 – Global API became unavailable for users in all three regions.
~06:00 – AU and EU regions stabilised after user migration.
~07:00 – US region connections terminated, services redeployed, platform restored.

Impact

Platform outage lasted for ~3 hours.
AU and EU users recovered by ~06:00 AEST; US fully restored at ~07:00 AEST.

Next Steps / Mitigation

We have implemented immediate changes to reduce the likelihood of similar failures.
Additional redundancy has been provisioned in our database cluster.
Monitoring has been expanded to detect conditions that may lead to this type of failure earlier.
Further hardening of the system is planned to improve resilience against provider-level issues.

Updated
Sep 30, 2025 at 11:25pm UTC

All services are now fully operational across all regions. The incident has been resolved, and our team will share a detailed postmortem once it is ready. Thank you for your patience and understanding during this disruption.

Updated
Sep 30, 2025 at 9:31pm UTC

We’re still investigating the database connection issue. The global API is currently available in the AU and EU regions, while the US region continues to be affected. Our team is actively working on restoring full service in the US and we’ll provide another update soon.

Created
Sep 30, 2025 at 6:15pm UTC

We’re investigating an issue with our database connections that is preventing jobs and triggers from synchronising. Our team is actively working on identifying the cause and restoring normal service. We’ll share another update shortly.

Issue with database connections

Incident Report: US Database Cluster Outage

Summary

Timeline (AEST)

Impact

Next Steps / Mitigation