Failover Cluster Troubleshooting

There’s nothing quite like logging in to a customer’s system first thing Monday morning only to be greeted with this:

Cluster_report

I discovered this when I wasn’t able to log into the customer’s ILINX Capture implementation. The logged error (failure to locate the SQL Server) led me to take a look at the SQL Server’s configuration to confirm that its service was not running on either node of the cluster, and the error I got when trying to start that (a clustered resource could not be activated) led me to check on the clustered resources themselves.

Just FYI, I’m a software developer by trade with just enough network admin experience to be competent at troubleshooting things like this, so forgive me if butcher this explanation slightly: At a high level, every clustered resource has two components – the resource itself and an IP address tied to it. If either of these components exhibit a failure during initialization, the entire clustered resource becomes unavailable. I’m still kicking myself a bit for not seeing it sooner (it took a comment from a co-worker noting that, “At least the data [was] still there, if you can ever get to it” to get me thinking on the right track), but the above readout actually tells us exactly what we need to know in this case: The resources themselves were unavailable because their IP addresses failed to activate correctly. This could have been caused by a number of things, but whatever the reason it had to be fixed quickly. In Windows Server 2012 R2, this can be handled through the Failover Cluster Manager:

Failover_cluster_UI
Failover_cluster_UI_2
Name and addresses hidden for privacy purposes

In both Cluster Core and Rules menu, we can see an IP Address value assigned. This needs to match the IP address assigned to the respective resource in DNS and can be set via its Properties dialog, found under its context menu:
IP_Address_UI

From here, the IP address and its subnet mask can be configured, and once that is done, the IP Address resource can be brought online, which will cascade to the other components as well. Taking a final look at get-clusterResource in PowerShell will confirm the change at this point:
Cluster_report_2

This is only one side of the coin, however, the other being if the resource itself fails (in this case, it would be either SQL Server services of the logical disk). We’ll save that for another time though.

Jesse Kinney
Systems Engineer
ImageSource, Inc.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s