Troubleshooting Storage Spaces Direct Storage Problem. Replace or reset Disk

Good day all.

 

This article is just one of many I hope to write on the subject. Storage Spaces Direct aka Azure Stack HCI is a product from Microsoft, but is also specific to Hardware Vendors. Since #IWork4Dell I will say my troubleshooting is specific for Dell Platform Ready Nodes.

 

The scope of this document is how to approach dealing with Disks that go lost communication and other storage related issues like disk replacement. Steps below will be assigned links as I complete additional articles. The basic troubleshooting is:

 

  1. How many Disks Can I lose?
  2. Whats the best script to see the disk layout?
  3. How do I determine if a disk is bad?
  4. How do I replace the disk?

 

The last two are answered in this article. To find out if a disk is bad, you have a few things to look at. There are counters in the Event log, a power-shell command and Windows Admin Center that can help you access if a disk is bad or not.

 

Windows Event logs show Disk errors

In Applications and Services -> Microsoft -> WIndows -> StorDiag -> Microsoft-Windows-Storage-ClassPnP/Admin

In Applications and Services -> Microsoft -> WIndows -> StorDiag ->  Microsoft-Windows-Storage-ClassPnP/Operational

In Applications and Services -> Microsoft -> WIndows ->StorageSpaces-Driver-> Operational 

Look for Event 505. This event will have Disk Failures and Sense Key codes you can look up. If there are other failures you find, please post to this article and let me know.

 

Test diagnostics of the pool with stordiag

Just takes 2 commands –

ipmo storage

stordiag -diagnostic

output is an html

PS C:\WINDOWS\system32> stordiag /?

Collects storage and file-system diagnostic logs and outputs them to a folder.

StorDiag [-collectEtw] [-out <PATH>]

-collectEtw           Collect a 30-second long ETW trace if run from an elevated session

-collectPerf          Collect disk performance counters

-checkFSConsistency   Checks for the consistency of the NTFS file system

-diagnostic           outputs a storage diagnostic report

-bootdiag             output boot sectors of the disk

-out <PATH>           Specify the output path. If not specified, logs are saved to %TEMP%\StorDiag

 

Using Historical performance data

The third tool to check for a bad disk is to use power-shell to find the history of the drives with their read and write latencies. Below are two forms of the command:

form1 (use a uniqueId in your array your interested in checking)

$BadDisk=Get-PhysicalDisk -UniqueId 13DD1Z5155DXEX

$BadDisk | Get-StorageReliabilityCounter | FL *

 

form2

Get-PhysicalDisk | Get-StorageReliabilityCounter | Sort-Object DeviceId | ft DeviceId,ReadErrorsTotal,ReadErrorsUncorrected,ReadLatencyMax,WriteErrorsTotal,WriteErrorsUncorrected,WriteLatencyMax -AutoSize

 

These two commands will give you tables of disks with counters showing the latgency history. If you have a dozen errors, your not to concerned. If you have a thousand errors or a large amound, well it may be time to think about replacing that disk. all the disks should be in a tight range, in terms of failure history.

 

Using Historical data using Windows Admin Center

The final way to look at weather to replace a disk is to look at the same historical information as the last commands. The difference is you can use the GUI in windows admin center to check for disk reliability history. Download Windows Admin Center here.

 

 

SO there you have it, 4 different ways to look at storage data to figure out if you need to replace your disk. finally I will include how to go about getting the disk Blinked and how to replace that drive, once you have received the dispatch :

 

Basic End to End checking for Disk Replacement 

This is courtesy of the hard work of Jim Gandy. None of his work goes unappreciated and I will tell you he is the best Dell Technical Support has to offer.  You dont get support from him or the team without Purchasing your S2d Ready Node with Support for Storage Spaces Direct. There is real value in the support we provide.

(1)Check S2D Health to see if there are other problems that take priority of disk problem

Get-HealthFault #2019

Get-StorageSubSystem cluster* | Debug-StorageSubSystem # 2016

(2)Check if any Storage Jobs are running

Get-StorageJob

(3)Find the UniqueId of the disk that is in an unhealthy state

Get-PhysicalDisk | FT UniqueId,MediaType,CanPool,OperationalStatus,HealthStaus

(4)Add to variable $BadDisk

$BadDisk=Get-PhysicalDisk -UniqueId PlaceyourIDhere

(5)Check this disk of errors

$BadDisk  | Get-StorageReliabilityCounter | FL *

(6)If you have errors then dispatch a replacement drive (errors defined from step 4 or 5) – Applies to step F below

Below shows how to replace or repair the disk and place back in Pool 

(A)Retire the disk

Set-PhysicalDisk -UniqueId $BadDisk.UniqueId -Usage Retired

(B)Remove the disk from the Storage Pool

Remove-PhysicalDisk -PhysicalDisks $BadDisk -StoragePoolFriendlyName S2D* -Confirm:$false

(C)Optimize the Storage Pool

Optimize-StoragePool -FriendlyName S2D*

(D)Repair the Virtual Disk

Get-VirtualDisk | Repair-VirtualDisk

(E)Wait for Storage Jobs to complete

Get-StorageJob

(F)If no errors in step 6 then re-add the disk back to the Storage Pool and repeat steps 9-11

Add-PhysicalDisk -StoragePoolFriendlyName S2D* -PhysicalDisks $disk

(G)If you had errors in step 6 then enable drive indicator if needs replaced so tech and find the driver to replace it

Enable-PhysicalDiskIdentification -UniqueId $BadDisk.UniqueId

(H)How to see if indicator is on

Get-PhysicalDisk | ? IsIndicationEnabled -eq True | FT Ser*,isin*

 

I hope this has been helpful and it should help you if you need to replace your disk or if you need to repair your Virtual disk.

 

Thank you,

Louis

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s