Back in January 2020, we were introducing our new Cloud Cluster, which offers both High Availability as well as simplified node migrations and shared storage.
For the new environment, we have a four node GlusterFS running (plus two Arbiter Nodes) which offers (normally) very fast I/O and IOPS. Every node was built up on SSDs from different vendors to avoid any issue with defective series.
Using drives from a single vendor is bad
A defective series already happened to us back in 2018, were almost 32 newly bought SSDs from a single vendor died within four weeks, causing three of eight servers of a customer to get corrupted. Lessons learned, since that happened, we are mixing different types of SSDs in our servers to avoid dataloss at all cost.
Finding the bad performing culprit
In our current GlusterFS setup, we are using SSDs from four different vendors. 4 SSDs of one of those vendors were performing quite badly – most nodes threw a average IOwait of 20-50%. We’ve decided to buy other SATA PCIe HBA, Dell Perc H310 in IT-Mode, which allowed us to get rid of the onboard HBA in question, previously we thought it would slow down the I/O.
That changed the situation slightly, but all nodes were still not performing as they should. As we are using Software Raid, we were able to trace back the performance issues to a single drive vendor. A tool called “atop” was pretty helpful to find out, which drive was constantly threwing high IOwait.
Resolving the situation
We’ve kicked out the bad performing drives of the array in every node, which immediately increased the I/O and decreased the IOwait to nearly zero. Terrible, isnt it?
I guess, if we had designed the environment on Hardware Raid, it would have been a lot more nerve-racking and quite costly to get those issues resolved.