ESXi issues connected with storage.
There is a lot articles over the internet with information or questions regarding the storage manage by ESXi. No wonder, as storage issue is the most serious one and for many VMware administrators hard to debug. Problems can be connected with storage high latency as well as with general connectivity problems.
VMware has introduced several mechanism to track the storage status and as the result update the log files.
Please take a look closer to article: https://kb.vmware.com/s/article/2113956
As we can see in this document (down the site) there are several entries admin can look for.
First of all check vobd.log for entry like:
Lost access to volume <uuid><volume name> due to connectivity issues. Recovery attempt is in progress and the outcome will be reported shortly
We need to understand, that when volume is in the lost access state, host will have no I/O read/writes as long as heartbeat I/O can be completed.
For that reason, usually in vmkernel.log the following entries can be seen:
HB at offset XXXX – Reclaimed heartbeat [Timeout]:
And also in vobd.log:
entries like this: [Timeout] [HB state
The reclaim should be very frequently and occurs every second.
Of course, during storage problem, virtual systems cannot write to this storage too. Virtual system will try to be online as long as they can sustain the huge latency or at the end (if the problem remains) disk inaccessibility. It really depends from the system how it behaves when lost access to the disk, usually windows is going to end up with blue screen and linux will run but very unstable and at the end the only thing you can do is to reboot it.
In this scenario (host is disconnected from datastore too long – >5sec) you should see in logs the following:
NMP: nmp_DeviceRequestFastDeviceProbe: NMP device "naa.XXXX" state in doubt; requested fast path state update...
This is because of timeouts and HBA drivers aborts commands.
In kb: https://kb.vmware.com/s/article/1022026 explanation:
- Array backup operations (LUN backup, replication, etc)
- General overload on the array
- Read/Write Cache on the array (misconfiguration, lack of cache, etc)
- Incorrect tiered storage used (SATA over SCSI)
- Fabric issues (Bad ISL, outdated firmware, bad fabric cable/GBIC)
One last thing is worth mentioning is great blog with scsi code decoder: https://www.virten.net/vmware/esxi-scsi-sense-code-decoder
So you can examine log similar to the: Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0. Act:NONE
In conclusions, problems on storage systems can be very dangerous. With hypervisor we have additional layer which need to know what is currently happening in infrastructure (storage, SAN issues) and react accordingly. We do not want to guest systems stop in any of such issues. But from the other side, system should be aware of such issue, so it can stop write to disk (which are currently not visible to hypervisor) This kind of issues are usually hard to debug for system admin. Unfortunately, from
No Comments