Our outage earlier today (09-01-2021)

Posted on Jan 9, 2021 | By Hosting4Real

Earlier today, between 11.45 am and 12:06 pm Europe/Amsterdam timezone, we experienced an outage on three servers and a partial outage on another server.

We'll go over what happened for the three servers either be entirely unavailable or partially in the remaining one.

The long version

All affected servers are located in our new colocation rack in the south of the Netherlands; these servers are Gigabyte R152-Z31 servers, each equipped with an AMD 7402P CPU, 128GB DDR4 ECC memory, 2x1.92TB Samsung PM983 U.2 NVMe drives, a 10g/25g Mellanox ConnectX-4 Lx EN NIC and redundant power supplies.

In the rack we have our servers in, we have two power feeds (Feed A and Feed B); each feed comes from two separate paths in the datacenter, meaning we're able to lose one of the feeds without loss of power to our servers.

All servers connect PSU1 to Feed A and PSU2 to Feed B via C14 to C13 cables.

As a part of a visit to the data center to rack two new servers, we also had to do some work on how the power is handled; as a result of this, we had to remove the cables of one power supply (PSU1 in this particular case), change out the cables, this went perfectly fine, all servers stayed online since the servers continued to receive all their power via the second power supply going to feed B.

After we finished with the A feed, we had to move onto the B feed and do the same here, and we disconnect all the cables from the B feed; however, in the process, two cables was rather tight in the C14 plugs in the B feed, resulting in when "jiggling" the B feed cables for those two servers, the cables going to the first four servers came loose enough for them to lose their connection to the A feed.

We quickly realize this (Thanks to red blinking lights and beep sounds from the power supplies!), we plug in the cables again, and power returns to all four servers as expected.

The very first thing we do is to confirm the downtime on our monitoring; we see alerts for NLCP02, NLCP03 and NLCP04, nothing about NLCP01 - we immediately after post on our Twitter that we're aware of the three servers are down, and they're booting up.

NLCP03 quickly boots and returns to normal operation after just 4 minutes of downtime.
NLCP04 also boots up as expected and returns to normal operation after just 5 minutes of downtime.

NLCP02, however pings, we see the system available, we're able to log into it, but the load on the system was so high that it prevented serving pretty much any traffic, we saw that serving pages over HTTP worked, however slower than usual. Serving pages over HTTPS wasn't possible, likely due to the CPU sits at 100% load, and serving traffic over HTTPS does require a fair amount of more computing power to establish the initial SSL connection.

We proceed to investigate the unavailability, first by stopping the webserver outright, after this we killed the hundreds of PHP processes trying to all execute and use up resources - we let the load drop slowly, however despite the webserver not being started and all PHP processes being effectively stopped (except cronjobs), we still see the load quite a lot higher than what would be expected.

We proceed with checking the logs for anything suspicious going on that would cause the high load, but everything looks rather normal, except a few timeouts of services that don't affect the overall functionality of websites but rather some security features we run on our systems.

Suddenly, out of the blue, 15-20 minutes after we initially had lost the power, everything returned to normal. Websites loaded, traffic worked as expected, and CPU usage back to slightly above normal levels (It's to be expected after an outage).

We get our recovery alert for NLCP02 - 18 minutes of downtime.

We then proceed connecting the servers back on the B feed; additionally, we add some C14 plug lock-inserts From Tripplite (Thanks to the datacenter for providing these!):

(Image Credit goes to Tripplite)

What this does it tightening the connection between the C14 connector and the power bar we connect into - depending on each port in the power bar, we could either use these or not. Some ports were already tight enough not to impose a problem. However at least one feed on each server has the lock-inserts on their cable.

We go over every single cable once again, ensuring the fit is tight enough that it doesn't simply fall out whenever we pull slightly in the cables.

A friend and customer of ours reaches out to us, informing us that he experienced the load on the server he's on was high. However, the issue here is that he is located on NLCP01. We didn't receive any downtime alerts for this server. We look at the monitoring graphs for the particular server and verifies that the load on the server indeed had been high around the same time.

Logging into the server and doing an uptime reveals an uptime of 147 days. If a server had lost power, the system's uptime couldn't be 147 days; it should be only a couple of minutes by then.

Investigation of the event log from ipmitool does list the assertion and de-assertion of power being lost on either PSUs, but only on PSU1 once.

That this was not appearing in the logs means the system itself didn't directly register a full power loss.

Now, we did connect the cables only seconds after we lost power, but it shouldn't be possible that the system continues to run, even for just a couple of seconds when there's no actual power connected.

While we know it can happen with many electronics, the power usually disappears a second or two after you pull out the cable from your wall, such as your radio, TV, or similar. It shouldn't be possible for servers that do require more power.

After consulting with a good friend about it, we learned that if fast enough, you could unplug and plug in a power cable to a PSU without the system turning off - since there can be multiple seconds of power available within the PSU to supply to the underlying system. Depending on the capacitor size and the system's usage, this can last from literally milliseconds to multiple seconds. After all, this is how the power loss protection mechanism in SSDs (solid-state drives) works - they have a capacitor on the SSD that can keep enough charge for the SSD to flush writes to persistent storage.

Verifying the same information on NLCP02, we saw this also turned out to the the case. The system didn't actually experience a full power loss.

It also explains why NLCP03 and NLCP04 ended up rebooting since they were the last ones to be plugged into the power bar again.

For some time, what remained unexplained is why the NLCP01 and NLCP02 servers had a high load for about 20 minutes.

Servers are complex systems. Many components have to talk together, and since they're designed to handle errors themselves, it also means there's a lot of various protections in the hardware and software side of things.

The slowdown can be a voltage instability protection either from the server, the operating system, or even the CPU. Even a combination of all could play a factor. Sadly we don't know the underlying cause, but it's safe to assume that it was a protection mechanism in the system that resulted in the slowdown. The CPU frequency remained stable throughout the slowdown.

The high load on NLCP01 and NLCP02 continued for the same amount of time. The system load increased and dropped within seconds of each other.

NLCP02 had an overall higher load than NLCP01. We also did expect this to be the case since the server does handle more traffic overall.

The higher system load also means that NLCP02 ended up being unavailable for most customers. It still did serve traffic, but very slowly and at a reduced rate.

The lack of alerts from NLCP01 is due to the system being able to keep up most of the time; however, it still saw a slowdown.

Our monitoring saw an overall higher response time of our test page on the server. However, not high enough for it to trigger an alert.

The outage sucks, no doubt, and we're sorry for the inconvenience caused by it. Mistakes do happen. We're always cautious when working with any equipment, we want to ensure a stable service for everyone, but today did show us that things do go wrong.

We also learned more about how our systems react; we did see some of these protections in action. We're still curious about what caused it. The System Event Log from IPMI shows that the power loss on the one power supply was first de-asserted 20 minutes after it was asserted in the first place.

38 | 01/09/2021 | 10:28:38 | Power Supply #0xe7 | Power Supply AC lost | Asserted
39 | 01/09/2021 | 10:30:46 | Power Supply #0xe7 | Power Supply AC lost | Deasserted
3a | 01/09/2021 | 10:43:59 | Power Supply #0xe6 | Power Supply AC lost | Asserted
3b | 01/09/2021 | 11:03:51 | Power Supply #0xe6 | Power Supply AC lost | Deasserted

(Times are in UTC)

The assertion of power loss occurred 60 seconds before the load going up. The de-assertion occurred 60 seconds before the system load going down.

If we find the exact cause of what in the server triggered the slowdown, we will update the post.

TLDR;

Servers are connected to two feeds (A and B) for redundancy
Switch of power cables on A feed went fine
Proceeding to B feed went fine until two cables were hard to get out, so we had to jiggle the cables
Jiggling the two cables resulted in the four cables for NLCP01 to NLCP04 to become loose enough to lose the power
We plug in all four servers again, power is restored
We see alerts for NLCP02, NLCP03, and NLCP04 coming in. We post on Twitter, acknowledging we're aware of the issue.
NLCP03 comes online after 4 minutes, NLCP04 after 5 minutes.
NLCP02 pings, but high load, and no traffic served. We do a few tricks trying to get the system load down, such as stopping the webserver, kill all PHP processes, let the system relax for a minute or two, start-up services.
Load remains too high on NLCP02; we continue to investigate logs for a possible explanation of the slowdown
Load recovers out of the blue after 15-20 minutes after we initially lost power
NLCP02 up after 18 minutes.
Connect servers to B feed again, add plug lock-inserts provided by the DC
Informed by customer load on NLCP01 had been high, despite no alerts
After investigation, it turns out, we returned the power so quickly to NLCP01 and NLCP02 that it didn't reboot (yay capacitors!)
Consult with a friend who's an electrical nerd to see if it could be possible for capacitors to keep power for the system for multiple seconds (It can)
Further investigation to see why the throttle happened, we expect it has to do with a protection mechanism to handle voltage instability, either in the server or CPU.

Posted in: Outages