Good day all.
This article is just one of many I hope to write on the subject. Storage Spaces Direct aka Azure Stack HCI is a product from Microsoft, but is also specific to Hardware Vendors. Since #IWork4Dell I will say my troubleshooting is specific for Dell Platform Ready Nodes.
The scope of this document is how to approach dealing with Disks that go lost communication and other storage related issues like disk replacement. Steps below will be assigned links as I complete additional articles. The basic troubleshooting is:
- How many Disks Can I lose?
- Whats the best script to see the disk layout?
- How do I determine if a disk is bad?
- How do I replace the disk?
The last two are answered in this article. To find out if a disk is bad, you have a few things to look at. There are counters in the Event log, a power-shell command and Windows Admin Center that can help you access if a disk is bad or not.
Windows Event logs show Disk errors
In Applications and Services -> Microsoft -> WIndows -> StorDiag -> Microsoft-Windows-Storage-ClassPnP/Admin
In Applications and Services -> Microsoft -> WIndows -> StorDiag -> Microsoft-Windows-Storage-ClassPnP/Operational
In Applications and Services -> Microsoft -> WIndows ->StorageSpaces-Driver-> Operational
Look for Event 505. This event will have Disk Failures and Sense Key codes you can look up. If there are other failures you find, please post to this article and let me know.
Test diagnostics of the pool with stordiag
Just takes 2 commands –
output is an html
PS C:\WINDOWS\system32> stordiag /?
Collects storage and file-system diagnostic logs and outputs them to a folder.
StorDiag [-collectEtw] [-out <PATH>]
-collectEtw Collect a 30-second long ETW trace if run from an elevated session
-collectPerf Collect disk performance counters
-checkFSConsistency Checks for the consistency of the NTFS file system
-diagnostic outputs a storage diagnostic report
-bootdiag output boot sectors of the disk
-out <PATH> Specify the output path. If not specified, logs are saved to %TEMP%\StorDiag
Using Historical performance data
The third tool to check for a bad disk is to use power-shell to find the history of the drives with their read and write latencies. Below are two forms of the command:
form1 (use a uniqueId in your array your interested in checking)
$BadDisk=Get-PhysicalDisk -UniqueId 13DD1Z5155DXEX
$BadDisk | Get-StorageReliabilityCounter | FL *
Get-PhysicalDisk | Get-StorageReliabilityCounter | Sort-Object DeviceId | ft DeviceId,ReadErrorsTotal,ReadErrorsUncorrected,ReadLatencyMax,WriteErrorsTotal,WriteErrorsUncorrected,WriteLatencyMax -AutoSize
These two commands will give you tables of disks with counters showing the latgency history. If you have a dozen errors, your not to concerned. If you have a thousand errors or a large amound, well it may be time to think about replacing that disk. all the disks should be in a tight range, in terms of failure history.
Using Historical data using Windows Admin Center
The final way to look at weather to replace a disk is to look at the same historical information as the last commands. The difference is you can use the GUI in windows admin center to check for disk reliability history. Download Windows Admin Center here.
- Set up your connection to Storage Spaces Direct!
- Watch the video about 12 minute to 12.21 second mark to see where the storage counters are located
- another example of historical data in server 2019 S2d WssD with Windows Admin Center
SO there you have it, 4 different ways to look at storage data to figure out if you need to replace your disk. finally I will include how to go about getting the disk Blinked and how to replace that drive, once you have received the dispatch :
Basic End to End checking for Disk Replacement
This is courtesy of the hard work of Jim Gandy. None of his work goes unappreciated and I will tell you he is the best Dell Technical Support has to offer. You dont get support from him or the team without Purchasing your S2d Ready Node with Support for Storage Spaces Direct. There is real value in the support we provide.
(1)Check S2D Health to see if there are other problems that take priority of disk problem
Get-StorageSubSystem cluster* | Debug-StorageSubSystem # 2016
(2)Check if any Storage Jobs are running
(3)Find the UniqueId of the disk that is in an unhealthy state
Get-PhysicalDisk | FT UniqueId,MediaType,CanPool,OperationalStatus,HealthStaus
(4)Add to variable $BadDisk
$BadDisk=Get-PhysicalDisk -UniqueId PlaceyourIDhere
(5)Check this disk of errors
$BadDisk | Get-StorageReliabilityCounter | FL *
(6)If you have errors then dispatch a replacement drive (errors defined from step 4 or 5) – Applies to step F below
Below shows how to replace or repair the disk and place back in Pool
(A)Retire the disk
Set-PhysicalDisk -UniqueId $BadDisk.UniqueId -Usage Retired
(B)Remove the disk from the Storage Pool
Remove-PhysicalDisk -PhysicalDisks $BadDisk -StoragePoolFriendlyName S2D* -Confirm:$false
(C)Optimize the Storage Pool
Optimize-StoragePool -FriendlyName S2D*
(D)Repair the Virtual Disk
Get-VirtualDisk | Repair-VirtualDisk
(E)Wait for Storage Jobs to complete
(F)If no errors in step 6 then re-add the disk back to the Storage Pool and repeat steps 9-11
Add-PhysicalDisk -StoragePoolFriendlyName S2D* -PhysicalDisks $disk
(G)If you had errors in step 6 then enable drive indicator if needs replaced so tech and find the driver to replace it
Enable-PhysicalDiskIdentification -UniqueId $BadDisk.UniqueId
(H)How to see if indicator is on
Get-PhysicalDisk | ? IsIndicationEnabled -eq True | FT Ser*,isin*
I hope this has been helpful and it should help you if you need to replace your disk or if you need to repair your Virtual disk.