KVM@TACC Outage December 6-13, 2023

Resolved Posted by Cody Hammock on December 06, 2023
Outage start Wednesday, December 06, 2023 3 p.m.
Expected end Monday, December 11, 2023 6 p.m.

Dec 13: The outage is now resolved.

You should observe that I/O performance on ephemeral disks is back to normal, as they have been migrated to a different storage backend.
Note: If you attach an additonal cinder volume of type "ceph-hdd", that volume may experience unexpectedly delayed I/O as the system state settles, but this will not impact the majority of instances.


Dec 11th: narrowing test cases shows that the fsync issue applies to the hdd storage pool, but does not seem to impact the SSD storage pool. Although somewhat tight on total capacity, we are migrating all VM ephemeral disks to the SSD pool to mitigate the issue.

We are also rebooting the kvm01 control node, as we're observing hardware and driver errors on its internal AP and tunnel interfaces.


Dec 10th: We are still investigating utilizing the mentioned data-sources. We observe potentially lost messages at the storage protocol level, but haven't yet replicated the losses at the network level to identify which path may be problematic.


Dec 9th: We have gotten low-level debug logs from the kvm-ceph `librbd` driver, and are investigating to isolate these logs and related packet dumps to isolate where a failure may be occuring.


Dec 8th: We've isolated the perfomance issues to intermittent delays in the `fdatasync` system call or other `sync` operations. We were able to identitfy several storage components that were contributing to these latencies, but still have not isolated a root cause. Symptoms were slightly improved, but not resolved.


5:30 Dec 7th: We are still investigating performance issues.

We have replicated the reported behavior, namely that within a VM, internet downloads and package installation sometimes seem to "pause" for 5+ minutes before resuming as if nothing happened.

However, we do not yet have a root cause. If your work can tolerate this kind intermittent interruption by extending timeouts, you may be able to continue work, but we're not yet able to make any guarantees.

Again, thank you for your patience as we work to get to the bottom of this.


2:43 PM Dec 7th: We are still investigating performance issues on KVM@TACC.

Launching VMs and reconfiguring networks (among other actions) should be back to a "normal" amount of time, as we identified and corrected high load and frequent retries impacting the database and message bus that these actions depend on.

However, we are still receiving reports of slow perfomance once a VM is booted, especially for tenant network traffic.

We are still investigating these remaining issues.


 

3:00 PM Dec 6th: The KVM@TACC site is experiencing an issue that is causing intermittent availability. Staff is working to correct the problem. During this time, errors you encounter may not be the fault of your experiments or actions. Please try again after the outage is resolved. 

Thank you for your patience.