Storage is the shared hardware resource between cluster nodes, making it to be single point of failure. It needs downtime to fix storage error. The challenge of replacing cluster disk is Windows Cluster Service writes down disk digital signature somewhere in Registry, you have to match them back after replacing cluster disk. I spent sleepless hours to apply such change, below is that I learn.
*Important
Make sure data page of user database is not corrupted is VERY IMPORTANT for this kind of change. You have 2 choices:
1. Run DBCC CHECKDB as first step, to see the original status is everything ok.
2. Run DBCC CHECKDB as last step, to see the final status is everything ok too.
or
1. Run DBCC CHECKDB as first step, to see the original status is everything ok.
2. Get file hash value by any file verification tool like File Checksum Integrity Verifier.
3. Calculate file hash value again as the last step, to see no bit is changed during file transfer process.
Method 1: Change Disk Signature
You should have dumpcfg.exe which can be found in Windows 2000 Resource Kit for Windows 2000 and 2003 or have the diskpart.exe shipped by windows 2008. Method 1 can also be used to replace Q drive.
1. Stop SQL Service.
2. Copy data of the whole drive to some place. I prefer to use Robocopy.
3. Power off nodes, leave one node on.
4. Stop Windows cluster service on the alive node.
5. Mount the new disk to the node, typically it’s LUN in the SAN which can be accessed by all the nodes.
6. Set the drive letter to any temp letter.
7. Copy data back to the new drive.
8. Remove the old disk.
9. Rename the new drive to the old drive letter.
10. Start Windows cluster service. It will fail at the first time.
11. Run eventvwr to check the error, you will get error message "Event ID: 1034
", "The disk associated with cluster disk resource %DriveLetter% could not be found. The expected signature of the disk was %Disk Signature%. ". Write down the disk signature you see here.
12. For 32bit windows 2000/2003, use dumpcfg.exe to write expected signature to the disk: dumpcfg.exe -s signature drive_no.
13. For 64bit windows 2003 or windows 2008, use the windows 2008 diskpart’s new feature SET ID to write disk signature. http://technet.microsoft.com/en-us/library/cc753840(WS.10).aspx
14. Start Windows cluster service again. It should work now.
15. Start SQL Service.
Method 2: Change Registry Value
You dont need any extra tool here.
1. Run cluadmin to start Cluster Administrator. Set SQL Server/SQL Agent/etc.’s “Affect Group” setting to “false”. It avoids failover caused by any unexpected error.
2. Stop SQL Service.
3. Copy data of the whole drive to some place. Robocopy please.
4. Back to ”Cluster Administrator”. Select SQL Server/ Agent/ Full text search/ etc.’s property -> dependency -> modify. Remove all the disks then apply the change. This step is to make sure SQL Server cluster resources won’t be deleted when dropping those disk.
5. If SQL Server resources are deleted mistakenly, recreate them with action plan from http://support.microsoft.com/kb/810056.
6. Make sure all the cluster node are online, delete the disk resources from Cluster Administrator.
7. Delete disk volume from Disk Management tool.
8. Mount the new disk to active node.
9. Create a new group “Test” in Cluster Administrator tool. Set it to be online.
10. Add all the new disks to group “Test”. Set them to be online.
11. Switch the group “Test” to every node, making sure the disk resources working well on each node.
12. Restore previous SQL data which we backup in step 3.
13. In Cluster Administrator, move the disk resources from group “Test” to SQL group. Delete group “Test”.
14. For each resource which be modified in step 4, select property -> dependency -> modify, re-add the disk resource as dependency. Apply the change.
15. For each resource which be modified in step 1, change “Affect Group” setting back to “true”.
16. Start the SQL Server service on active node.
HTH.