Network Outage - New York City Metro (EWR2)
Incident Report for Equinix Metal
Postmortem

Outage Start Time: 13:17 PM (EDT) on July 17, 2018

Outage End Time: 16:35 PM (EDT) on July 17, 2018

Reason for Outage

Overview - What Happened?

Some Packet customers in the New York City metro region, serviced out of our Parsippany, NJ datacenter (EWR1), experienced a partial loss in connectivity during the above window.  This was caused by a combination of a failed network device and issues with a third-party network provider.

Please note that outage start and end times are provided out of transparency, to provide a detailed reference, however customer infrastructure was not down throughout this entire period.

Identifying the Core Issue / Resolution

At the onset of the outage window, we experienced a failed backbone router in our Newark, NJ (EWR2) network point of presence.  In conjunction with other area network POPs, including our New York City (JFK3) facility, this location provides our EWR1 datacenter with connectivity to other Packet datacenters as well as the wider Internet.

Given redundancies on our network, including diverse backbone fiber paths and points of Internet egress, the overwhelming majority of traffic on our network was failed over successfully without incident.

Unfortunately, we noticed that one specific provider (Verizon) continued to route traffic over our failed path for one of our network blocks long after our route advertisements were withdrawn. This created a partial outage for Verizon customers attempting to access Packet infrastructure, as well as other area broadband ISPs (e.g. Spectrum/Charter) leveraging the Verizon backbone for connectivity.  This problem cleared up approximately (2) hours later and connectivity was fully restored.

We dispatched an engineer to EWR2 to replace the failed network hardware. Traffic was restored in a gradual manner and full network redundancy was restored.

Moving Forward

Resulting from this outage, we’ve made some backbone-level configuration changes which will reduce our dependence on this specific transit circuit. We are also planning a comprehensive network overhaul in the NYC metro region to significantly improve capacity with additional (100G) backbone and provider links, as well as removing certain problematic transit providers from our vendor mix.

Please don’t hesitate to drop us a note (help@packet.net) if we can assist with any questions relating to this incident.

Regards,

Packet Infrastructure Team

Posted Aug 03, 2018 - 17:00 UTC

Resolved
We have resolved this issue and all routes are performing normally. We will continue to monitor for any issues.
Posted Jul 18, 2018 - 01:41 UTC
Update
Our engineers are continuing to work on out EWR2 site, gradually re-introducing backbone and external provider connectivity onto new hardware. We do not foresee any customer-facing outages at this time.
Posted Jul 17, 2018 - 21:49 UTC
Update
Reachability to Verizon has been restored.
Posted Jul 17, 2018 - 20:08 UTC
Identified
We have partitioned traffic away from our EWR2 network presence, where we are looking at a defective network device, and service has been restored at this time.

We are, however, continuing to receive reports of reachability issues limited to specific ISPs, namely Verizon; we are investigating.
Posted Jul 17, 2018 - 19:32 UTC
Investigating
We are currently experiencing a network outage in our EWR2 (Newark, NJ) facility.

Customers in EWR1 and other locations remain online, however customers may see temporary periods of increased packet loss as traffic is re-routed onto alternate paths.

We will provide additional information as it becomes available.
Posted Jul 17, 2018 - 17:19 UTC
This incident affected: Equinix Metal API, Equinix Metal Portal, and Equinix Metal Network.