Service Outage

Incident Report for Circle

Postmortem

Service Outage on January 15, 2024

All times are in PST

General Timeline

14:24: Our monitoring system alerted that the website and bot had been unresponsive to a health check.

14:25: Three members of our team responded to the alert and began investigating the incident. We posted a public announcement on this status page by 14:26 indicating we were aware of the issue and investigating.

14:27: Our primary hosting provider posted a status update for customers indicating they were aware of an issue affecting general network availability:
We are currently investigating an issue affecting general networking availability. This looks like an issue with our core routers causing a general bgp session drop. We are actively working to resolve this now.

14:28: We updated our status page to reflect the root cause as upstream, properly mark components as affected, and send notifications to subscribers.

15:08: Our hosting provider indicated the issue was caused by NYIIX and they were going to begin a manual failover

15:10: A public update was posted by our hosting provider:
As an update we have already identified the root cause however taking action to fix it is proving difficult as the core routers are being overloaded by a bad network configuration on our peering point.

15:18: NYIIX sends an update:
We experienced a broadcast storm today from 05:17pm until 06:30pm EST (UTC 22:17 to 23:30) on our network which affected services to our members. We were able to find where the storm originated and have shutdown the offending port. At this time all NYIIX alarms have cleared, and all services have been restored.

We will work directly with the customer and make sure they have fixed their side before enabling their port again.

15:19: We began seeing shards reconnect to Discord, an update was posted to the status page.

15:27: All systems were back online.

Incident Analysis

This incident was caused by an upsteam issue out of our control. Upon initial notification, our team took under 35 seconds to reply to the alert and begin investigating the situation. Although there was not much we could do on our end, we are highly satisfied with our swift response to a notification for downtime.

We distribute our core infrastructure across several providers which enabled Circle Premium to stay online throughout this networking outage and allowed us to demonstrate our commitment to subscribers.

The Future

We’re satisfied with the current steps from upstream providers to publish postmortems and share what occurred. We do not have plans to make any major infrastructure changes as a result of this outage.

Upcoming Changes

We’re currently testing a new load-distribution system for both of our Discord bots, while not related to this outage, this new system should enable faster development, better monitoring, and better performance for all servers! We hope to release this system by January 18th.

Posted Jan 15, 2024 - 19:07 PST

Resolved

Circle is back online and everything is operating normally.
Sorry for the inconvenience.

Posted Jan 15, 2024 - 15:27 PST

Monitoring

Problems have been resolved @ our upstream provider and Circle is beginning to reconnect.

Posted Jan 15, 2024 - 15:22 PST

Update

Circle is unavailable due to an incident at our primary hosting provider.

"We are currently investigating an issue affecting general networking availability. We are actively working to resolve this now."

**Circle Premium is hosted on another provider and is operational at this time.** Updates will be posted as they become available

Posted Jan 15, 2024 - 14:28 PST

Identified

We are aware of problems with Circle, the team is online and investigating.

Posted Jan 15, 2024 - 14:26 PST

This incident affected: Circle and Website.