Nov 09 2017 outage


Posted on Nov 9, 2017 | By Hosting4Real

This morning (09-11-2017) we experienced a major outage between 08.07 and 10.38 (Europe/Amsterdam timezone) – the outage affected all services located in our RBX datacenter – affected services being server6, server7, server8, and shop.hosting4real.net – this post is our official post-mortem that all affected customers also received by email.

Glossary

  • DWDM/Optical equipment: DWDM means “Dense wavelength division multiplexing”, it’s used to transmit multiple light wavelengths over a single optical fiber cable to increase the network capacity on a single cable.
  • G: Short for gigabit
  • Interface card: It’s used to translate electrical and optical signals between equipment
  • RBX: Roubaix
  • GRA: Gravelines
  • TH2: Telehouse 2, Paris
  • GSW: Global Switch Paris

Timeline (Provided by data center supplier)

08h00: All 100G links to Roubaix DC are down
08h15: Unable to connect to optical equipment
08h40: Master chassis restarted manually
09h00: Still impossible to recover equipment
09h15: Remove all cables from chassis
09h30: Chassis boots up
09h40: No alarms on chassis and all configuration gone
10h00: Restore the last backup to chassis
10h15: Circuits start to come online
10h30: Most circuits up, 8 down
11h00: Some interface cards not detected by the system, some amplifiers are broken, RMA being made
11h30: Resetting all unrecognized interface cards, all circuits are online
14h15: Replace broken amplifiers
14h30: All circuits are up again, the protection works and last alarms fixed

Connectivity to services got restored at 10.33 (shop.hosting4real.net, RBX7, RBX8) and 10.38 (RBX6).

Post-mortem

This morning an incident occurred in our datacenter providers optical network that connects the RBX datacenters with 6 out of 33 points of presence (POP) in their network: Paris (TH2 and GSW), Frankfurt (FRA), Amsterdam (AMS), London (LDN) and Brussels (BRU).

The RBX datacenters have 6 optical fibers connected to DWDM equipment that gives a total of 80x100G connections on each individual fiber.

Each of these 100G links is connected to routers via two optical paths which are in distinct geographic locations, so in case there’s a fiber cut the system reconfigures itself in 50 milliseconds and all links stay up.

RBX is connected with a total of 4.4Tbps capacity, 44x100G links: 12x100G to Paris, 8x100G to London, 2x100G to Brussels, 8x100G to Amsterdam, 10x100G to Frankfurt, 2x100G to GRA datacenters and 2x100G to SBG datacenters.

At 08.01 all the 44x100G links disappeared in one go. Given the redundancy system in place, the root cause of the problem could not be the physical shutdown of 6 optical fibers simultaneously. It wasn’t possible to do a remote diagnostic of the chassis because the management interfaces were not working and it had to be managed directly in the routing rooms themselves to sort out the chassis by disconnecting the cables and restart the system to finally do the diagnostics with the equipment manufacturer. Attempts to reboot the system took a long time because the chassis need roughly 10-12 minutes to boot – this is the main reason that the incident lasted as long as it did.

Diagnostic

All the interface cards in use (ncs2k-400g-lk9, ncs2k-200g-cklc) went into “standby” mode. This could have been due to a loss of configuration. The provider, therefore, recovered the backup and reset the configuration, which allowed the system to reconfigure all the interface cards. The 100G links in the routers automatically came back and the RBX connection to the 6 POPs was restored at 10.34.

The problem was due to a software bug on the optical equipment. The database with the configuration is saved 3 times and copied to 2 monitoring cards. Despite these security measures, the database disappeared. The provider will work together with the manufacturer to find the source of the problem and help fix the bug. Our data center provider still have trust in the equipment manufacturer (one of the biggest in the world) – even if this bug is critical, uptime is a matter of design that must consider every single possibility and single point of failure and the system has to be designed in a more paranoid manner than was done until now.

Datacenter conclusion

Software bugs can exist, but incidents impacting customers are not acceptable. This is an issue we as datacenter take on our shoulders, despite all investments in network, optical fibers, and technologies – it resulted in 2.5 hours of downtime in RBX.

One solution to solve the problem is to replicate the optical node systems, 2 systems mean 2 databases, so even if the configuration is lost – only 50% of the capacity will disappear and not 100% as happened today. The project has been started already and hardware will arrive in a matter of days – when it arrives, configuration and a migration work will start in 2 weeks. Given today’s incident, this project has become a priority for all of our infrastructure, all DCs, all POPs.

Our conclusion

As shared hosting provider we haven’t been able to do much to prevent this – we split customers over 2 datacenters (RBX and GRA) – around 50% customers in each datacenter, our nameservers are spread across 4 datacenters and 3 different providers.

Backups are located in another data center at another provider – so even if all of our shared hosting infrastructures would die – backups would still be available.

We’re sincerely sorry about the downtime our customers experienced today. During the incident, we’ve done everything we could to get back to people contacting us on support@hosting4real.net and kept our Twitter profile up to date with the latest news.

The downtime today has been out of our hands, and we wouldn’t have been able to bring all customers back faster than was done today – we do not wish these issues to happen – but hardware, power or networking can fail. Murphy’s Law applies.

We’re working towards moving https://shop.hosting4real.net/ out of our RBX datacenter location in the future to prevent a similar scenario where customers couldn’t access our support page directly.

Root cause of RBX outage

The RBX optical equipment consists of 1 chassis containing a total of 20 frames, the outage was caused by 3 separate events:

1: Node controller CPU overload of the master frame

Each optical node has a master frame that allows exchanging information between other nodes and swapping with slaves. On the master frame, the database is saved to two controller cards.

At 07h50 the datacenter noticed to have communication problems between nodes connected to it and showed a CPU overload on the master frame – the cause of the CPU overload is still unknown and both the datacenter and manufacturer are still investigating the root cause.

2: Cascade switchover

Following the CPU overload of the node, the master frame made a switchover of the controller boards, after the switchover and the CPU overload, they experienced a Cisco software bug – this bug happens on large systems and results in a controller switching every 30 seconds. Normally this stabilizes itself. However, Cisco has scheduled a software release 10.8 to be released at the end of November.

3: Loss of the database

At 08h00, following the cascade switchover event, it hit another software which de-synchronizes the timing between the two controller cards of the master frame. This bug caused a command to be sent to both controller cards at the same time to set the database to “0” (Wipe it). This resulted in all configuration getting lost.

The action plan
  • Upgrade the controllers from TNCE to TNCS models which double CPU and RAM available per controller, these will be switched out 13th November in RBX and 14th November in GRA – the same replacement will be done for controllers in Strasbourg and Frankfurt.
  • Upgrade all nodes to Cisco software release 10.8
  • Upgrade to an intermediate version to finalize upgrading to 10.8
  • Split large nodes into two separate sets of controllers per DC/POP (Big task)
Posted in: Outages