EWR IPv6 Issue
Incident Report for Equinix Metal
Postmortem

Reason for Outage

Start Date: October 9th, 2018 @ Morning EDT

End Date: October 10th, 2018 @ 1:30 AM EDT

Internal Ticket: #40

Location: EWR1

Description: IPv6 Reachability Issues and Emergency DSR Maintenance

Outage Details

Packet experienced loss of IPv6 reachability in its EWR1 (New Jersey) facility from the morning of October 9th, 2018 until October 10th, 2018 around 1:30 AM EST.  This impacted customers in EWR1 to varying degrees until it was resolved through a configuration change applied to our routing infrastructure. IPv4 connectivity was not affected during the incident.

During investigation, Packet discovered that IPv6 routes were being dropped incorrectly from the spine layer in our network forwarding table (the spine is a pair of aggregate switches that connect our top of rack switching to our border routers).

To correct the issue, Packet had to modify the Packet Forwarding Engine (pfe) on the affected switches. This modification required traffic to be drained from the redundant switches one at a time and then allowing the switches’ PFE to fully reload. As load was shifted, customers may have seen increased latency as one switch was forced to handle all load.

Timeline

All times are in EDT.

  • Monday, October 8, 2018

    • Early AM - Packet  was alerted to an issue where a specific Packet customer was not able to pass IPv6 traffic properly in EWR1.  Initial troubleshooting was performed on the affected customer and Packet was able to regain reachability by bouncing the affected interface on the local ESR (top of rack switching).
    • 1:30 PM  - Packet experienced a similar issue with its internal control plane infrastructure.  At this point, Packet determined that the issue was not customer specific and was a more widespread issue.  The ticket was escalated to our senior Network team and a case was opened with our routing vendor (Juniper).
    • 2:30 PM - Investigation showed that the reachability issue was the result of a prefix limit being reached in TCAM on Packet’s DSR infrastrastructure, specifically for IPv6 traffic.  The fix was to modify the forwarding table to ignore multicast routes and increase prefix limit for Ipv4 and IPv6.
    • 4:30 PM - Packet was further notified that uplink sessions were bouncing for IPv6 only. This was confirmed and replicated, and a fix identified.  However a difference between Juniper's public documentation compared to its internal documentation resulted in longer impact. After reloading the affected chassis, BGP sessions established normally and remained connected.
    • 8:45 PM - Client maintained that they were still seeing packet loss from off net. This was confirmed using 3rd party monitoring tools.
    • 9:00 PM - Packet identified that the forwarding tables on the DSR infrastructure did not seem to be working properly.
    • 9:15 PM - Juniper was re-engaged on the case and upgraded to a Priority 1 ticket so that we could receive direct engagement with the ATAC team
    • *9:20 PM *- Packet shifted load off of dsr1.ewr1 (to dsr2) in an attempt to isolate traffic in case there was an ECMP issue and to better predict traffic flow patterns.
  • Tuesday, October 9th, 2018

    • 12:00 AM - It was determined that lpm-profile modifications where not working as expected. We began trying different profiles in an attempt to locate a profile that would handle necessary route requirements
    • 12:45 AM - Packet and Juniper reset profile back to using lpm-profile with a slight change to unicast setting while also only allowing IPv6 traffic over dsr1. Traffic to v6 locations started to become available, but with packet loss.
    • 1:30 AM - ECMP and dsr2.ewr1 not having newly tried lpm-profile was proved to be the cause of the remaining IPv6 packet loss. Lpm-profile on dsr2.ewr1 was modified to new setting and traffic stabilized.

Impact Notes

  1. At no time was IPv4 traffic impacted. This outage was isolated to IPv6 traffic only.
  2. Traffic was only affected from outside the Packet network. Inside the EWR1 datacenter traffic worked properly.
  3. Each modification required the PFE to be restarted for the changes to take effect. To avoid IPv4 routing stability we had to shift traffic multiple times between DSR switches which increased troubleshooting time.
  4. Some troubleshooting took longer than expected due to Juniper having incorrect documentation on their public website (as compared to their internal documentation). As such, changes did not take effect as expected until the proper documentation steps were discovered. Juniper TAC confirmed that their internal documentation had different procedures from what was shown on website.
Posted Oct 15, 2018 - 21:40 UTC

Resolved
This incident has been resolved.
Posted Oct 10, 2018 - 07:30 UTC
Update
Network issues to IPV6 have been resolved in EWR. We will continue to monitor for service disruption for the next several hours before resolving this incident.
Posted Oct 10, 2018 - 06:24 UTC
Update
We are still having instability on inbound IPv6 traffic in EWR. 'Working with our vendor to find a resolution to this issue.
Posted Oct 10, 2018 - 03:58 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Oct 10, 2018 - 00:27 UTC
Update
We are continuing to investigate this issue.
Posted Oct 09, 2018 - 23:57 UTC
Investigating
We are currently investigating this issue.
Posted Oct 09, 2018 - 23:52 UTC
This incident affected: Equinix Metal Network.