Posted on Nov 9, 2017 | By Hosting4Real
This morning (09-11-2017) we experienced a major outage between 08.07 and 10.38 (Europe/Amsterdam timezone) – the outage affected all services located in our RBX datacenter – affected services being server6, server7, server8, and shop.hosting4real.net – this post is our official post-mortem that all affected customers also received by email.
08h00: All 100G links to Roubaix DC are down 08h15: Unable to connect to optical equipment 08h40: Master chassis restarted manually 09h00: Still impossible to recover equipment 09h15: Remove all cables from chassis 09h30: Chassis boots up 09h40: No alarms on chassis and all configuration gone 10h00: Restore the last backup to chassis 10h15: Circuits start to come online 10h30: Most circuits up, 8 down 11h00: Some interface cards not detected by the system, some amplifiers are broken, RMA being made 11h30: Resetting all unrecognized interface cards, all circuits are online 14h15: Replace broken amplifiers 14h30: All circuits are up again, the protection works and last alarms fixed
Connectivity to services got restored at 10.33 (shop.hosting4real.net, RBX7, RBX8) and 10.38 (RBX6).
This morning an incident occurred in our datacenter providers optical network that connects the RBX datacenters with 6 out of 33 points of presence (POP) in their network: Paris (TH2 and GSW), Frankfurt (FRA), Amsterdam (AMS), London (LDN) and Brussels (BRU).
The RBX datacenters have 6 optical fibers connected to DWDM equipment that gives a total of 80x100G connections on each individual fiber.
Each of these 100G links is connected to routers via two optical paths which are in distinct geographic locations, so in case there’s a fiber cut the system reconfigures itself in 50 milliseconds and all links stay up.
RBX is connected with a total of 4.4Tbps capacity, 44x100G links: 12x100G to Paris, 8x100G to London, 2x100G to Brussels, 8x100G to Amsterdam, 10x100G to Frankfurt, 2x100G to GRA datacenters and 2x100G to SBG datacenters.
At 08.01 all the 44x100G links disappeared in one go. Given the redundancy system in place, the root cause of the problem could not be the physical shutdown of 6 optical fibers simultaneously. It wasn’t possible to do a remote diagnostic of the chassis because the management interfaces were not working and it had to be managed directly in the routing rooms themselves to sort out the chassis by disconnecting the cables and restart the system to finally do the diagnostics with the equipment manufacturer. Attempts to reboot the system took a long time because the chassis need roughly 10-12 minutes to boot – this is the main reason that the incident lasted as long as it did.
All the interface cards in use (ncs2k-400g-lk9, ncs2k-200g-cklc) went into “standby” mode. This could have been due to a loss of configuration. The provider, therefore, recovered the backup and reset the configuration, which allowed the system to reconfigure all the interface cards. The 100G links in the routers automatically came back and the RBX connection to the 6 POPs was restored at 10.34.
The problem was due to a software bug on the optical equipment. The database with the configuration is saved 3 times and copied to 2 monitoring cards. Despite these security measures, the database disappeared. The provider will work together with the manufacturer to find the source of the problem and help fix the bug. Our data center provider still have trust in the equipment manufacturer (one of the biggest in the world) – even if this bug is critical, uptime is a matter of design that must consider every single possibility and single point of failure and the system has to be designed in a more paranoid manner than was done until now.
Software bugs can exist, but incidents impacting customers are not acceptable. This is an issue we as datacenter take on our shoulders, despite all investments in network, optical fibers, and technologies – it resulted in 2.5 hours of downtime in RBX.
One solution to solve the problem is to replicate the optical node systems, 2 systems mean 2 databases, so even if the configuration is lost – only 50% of the capacity will disappear and not 100% as happened today. The project has been started already and hardware will arrive in a matter of days – when it arrives, configuration and a migration work will start in 2 weeks. Given today’s incident, this project has become a priority for all of our infrastructure, all DCs, all POPs.
As shared hosting provider we haven’t been able to do much to prevent this – we split customers over 2 datacenters (RBX and GRA) – around 50% customers in each datacenter, our nameservers are spread across 4 datacenters and 3 different providers.
Backups are located in another data center at another provider – so even if all of our shared hosting infrastructures would die – backups would still be available.
We’re sincerely sorry about the downtime our customers experienced today. During the incident, we’ve done everything we could to get back to people contacting us on firstname.lastname@example.org and kept our Twitter profile up to date with the latest news.
The downtime today has been out of our hands, and we wouldn’t have been able to bring all customers back faster than was done today – we do not wish these issues to happen – but hardware, power or networking can fail. Murphy’s Law applies.
We’re working towards moving https://shop.hosting4real.net/ out of our RBX datacenter location in the future to prevent a similar scenario where customers couldn’t access our support page directly.
The RBX optical equipment consists of 1 chassis containing a total of 20 frames, the outage was caused by 3 separate events:
Each optical node has a master frame that allows exchanging information between other nodes and swapping with slaves. On the master frame, the database is saved to two controller cards.
At 07h50 the datacenter noticed to have communication problems between nodes connected to it and showed a CPU overload on the master frame – the cause of the CPU overload is still unknown and both the datacenter and manufacturer are still investigating the root cause.
Following the CPU overload of the node, the master frame made a switchover of the controller boards, after the switchover and the CPU overload, they experienced a Cisco software bug – this bug happens on large systems and results in a controller switching every 30 seconds. Normally this stabilizes itself. However, Cisco has scheduled a software release 10.8 to be released at the end of November.
At 08h00, following the cascade switchover event, it hit another software which de-synchronizes the timing between the two controller cards of the master frame. This bug caused a command to be sent to both controller cards at the same time to set the database to “0” (Wipe it). This resulted in all configuration getting lost.