Hi all,
I need a bit of help with a really strange problem we've got with out
SQL cluster. The setup is - 2 windows 2003 nodes, sharing a das, sql
2000 enterprise sp4. Cluster was up and running for a couple of days,
the one of our offices said they could not connect to it. This box is
held in one office, we have 5 or so other offices that connect to this
cluster, and only one of them could not connect.
So, network problems? Well remote desktop to either node of the cluster
and we can connect to the machines in the remote office. Setup ethereal
on a PC in the remote office, try to connect via query anaylzer to the
cluter and we can see packets going back and forth happily with the IP
of the cluster etc. and no other traffic on the line is affected. Q
hair being pulled.
So, we failover it over to the second node. Everything starts working
again. Please explain that one. We give up for the afternoon to save
our hair.
Skip to a few days later (today now). And it all fails again, same as
last time. Failback to first node, it still doesn't work, fail over to
second, oh it work again. For a few minutes, then dies again.
Heeeeeelllp!!!
Treat this as you would any network communications failure.
Name resolution
IP path resolution (ping)
SQL client connect (Query Analyzer)
application connect.
Somewhere you will find where the glitch is. I would guess DNS cache
problems on the client, but that is just to impress everyone if it really is
the problem.
Geoff N. Hiten
Senior Database Administrator
Microsoft SQL Server MVP
<kristan.mcdonald@.googlemail.com> wrote in message
news:1165936405.727921.56250@.16g2000cwy.googlegrou ps.com...
> Hi all,
> I need a bit of help with a really strange problem we've got with out
> SQL cluster. The setup is - 2 windows 2003 nodes, sharing a das, sql
> 2000 enterprise sp4. Cluster was up and running for a couple of days,
> the one of our offices said they could not connect to it. This box is
> held in one office, we have 5 or so other offices that connect to this
> cluster, and only one of them could not connect.
> So, network problems? Well remote desktop to either node of the cluster
> and we can connect to the machines in the remote office. Setup ethereal
> on a PC in the remote office, try to connect via query anaylzer to the
> cluter and we can see packets going back and forth happily with the IP
> of the cluster etc. and no other traffic on the line is affected. Q
> hair being pulled.
> So, we failover it over to the second node. Everything starts working
> again. Please explain that one. We give up for the afternoon to save
> our hair.
> Skip to a few days later (today now). And it all fails again, same as
> last time. Failback to first node, it still doesn't work, fail over to
> second, oh it work again. For a few minutes, then dies again.
> Heeeeeelllp!!!
>
|||Hi Geoff,
Thanks for the advice, the weird thing about this is other network
connections seem fine, it's just SQL that fails, and even then it's not
a total block. Name resolution seems to work fine and I can connect to
the shares on the server and copy files off so there's nothing wrong
with the physical layer. If I try to connect using query analyser, it
will reject my password and I'll get a login failed if I enter the
wrong one, or try to open the connection if I use the correct one, so
at least at some level SQL is responding to these requests.
We had an incident yesterday where the remote office lost SQL again,
however when odbcping and isql, I seemed to be able to connect and run
queries. I wouldn't treat this as gospel though as everything was a bit
rushed and confused trying to get connectivity restored. It also seemed
that if you turned off the object browser in query analyzer you could
get a connection.
I'm starting to think it might be SQL server itself, but I don't know
enough about how query analyzer connects to verify it, and it still
doesn't explain why the odbc application connections fail... or does
it?
Any help appreciated!
Thanks,
Kristan
Geoff N. Hiten wrote:
> Treat this as you would any network communications failure.
> Name resolution
> IP path resolution (ping)
> SQL client connect (Query Analyzer)
> application connect.
> Somewhere you will find where the glitch is. I would guess DNS cache
> problems on the client, but that is just to impress everyone if it really is
> the problem.
> --
> Geoff N. Hiten
> Senior Database Administrator
> Microsoft SQL Server MVP
|||What do your SQL error logs say?
What is your configuration? Are you running per Processor Licenses or
CAL-based?
Do you have any concurrent connection limits configured? What does
sp_configure show?
You said the system was SP4. Have you applied the 2187 hotfix yet?
I think you need to start fresh. You say the connections are intermittent
and only for one branch. Are the lost connections always from this one
branch? If so, it is your physical layer, in the network. Check your
routers from dropped packets. Make sure no one is using AUTO NEGOTIATE on
the network link and duplex.
Is there anything special about this branch? Is it the furthest away? Does
it have the slowest connection speed? Does it have the most users?
If an outage is sporadic but affects everyone equally, it is the system
resource, or at least a common point in between. If you can localize a
problem (like it is affecting only one branch or one group of users), then
it will be somewhere unique to them, but at a common point for them, like a
router at the branch site.
Hope this helps, but without any additional data, that's about the best
we've got.
Sincerely,
Anthony Thomas
<kristan.mcdonald@.googlemail.com> wrote in message
news:1166013373.732672.101210@.j44g2000cwa.googlegr oups.com...[vbcol=seagreen]
> Hi Geoff,
> Thanks for the advice, the weird thing about this is other network
> connections seem fine, it's just SQL that fails, and even then it's not
> a total block. Name resolution seems to work fine and I can connect to
> the shares on the server and copy files off so there's nothing wrong
> with the physical layer. If I try to connect using query analyser, it
> will reject my password and I'll get a login failed if I enter the
> wrong one, or try to open the connection if I use the correct one, so
> at least at some level SQL is responding to these requests.
> We had an incident yesterday where the remote office lost SQL again,
> however when odbcping and isql, I seemed to be able to connect and run
> queries. I wouldn't treat this as gospel though as everything was a bit
> rushed and confused trying to get connectivity restored. It also seemed
> that if you turned off the object browser in query analyzer you could
> get a connection.
> I'm starting to think it might be SQL server itself, but I don't know
> enough about how query analyzer connects to verify it, and it still
> doesn't explain why the odbc application connections fail... or does
> it?
> Any help appreciated!
> Thanks,
> Kristan
> Geoff N. Hiten wrote:
really is
>
|||Thanks for your help and suggestions - SQL error logs don't indicate
anything, and we're running per-processor. No connection limits set,
and I've not applied 2187 but will shortly.
We've got a bit deeper with this now, and it seems like it may be line
related as we've found a few more apps (exchange for one) that seems to
suffer a similar problem.
Connecting via ISQL when the outage is taking place works, however
doing something like select * from sysdatabases (which I guess is
similar to what query analyzer does at login if the object browser is
showing) I get only the first 10-15 databases (same place each time)
and it dies. Looking at packet traces from both ends there's loads of
retransmission and out of sequence arrivals going on, so I'm now 99%
certain it's not SQL related.
Still doesn't explain why I can run the same query on a box on exactly
the switch and it will work, whereas doing in on the cluster will fail.
We're working with the network providers at the moment, but they're
trying to claim if it works on one box then it must be cluster related.
Thanks guys,
Kristan
Anthony Thomas wrote:
[vbcol=seagreen]
> What do your SQL error logs say?
> What is your configuration? Are you running per Processor Licenses or
> CAL-based?
> Do you have any concurrent connection limits configured? What does
> sp_configure show?
> You said the system was SP4. Have you applied the 2187 hotfix yet?
> I think you need to start fresh. You say the connections are intermittent
> and only for one branch. Are the lost connections always from this one
> branch? If so, it is your physical layer, in the network. Check your
> routers from dropped packets. Make sure no one is using AUTO NEGOTIATE on
> the network link and duplex.
> Is there anything special about this branch? Is it the furthest away? Does
> it have the slowest connection speed? Does it have the most users?
> If an outage is sporadic but affects everyone equally, it is the system
> resource, or at least a common point in between. If you can localize a
> problem (like it is affecting only one branch or one group of users), then
> it will be somewhere unique to them, but at a common point for them, like a
> router at the branch site.
> Hope this helps, but without any additional data, that's about the best
> we've got.
> Sincerely,
>
> Anthony Thomas
>
> --
> <kristan.mcdonald@.googlemail.com> wrote in message
> news:1166013373.732672.101210@.j44g2000cwa.googlegr oups.com...
> really is
|||Then definitely check your AUTO NEGOTIATE settings on both the switch ports
and the NICs for all players, clients and server, for both link speed and
duplex. This would explain why sometimes you have a problem and sometimes
you don't, for some clients and not others.
Usually these can reset after patching the bios on the switches.
Sincerely,
Anthony Thomas
<kristan.mcdonald@.googlemail.com> wrote in message
news:1166619576.945335.285910@.f1g2000cwa.googlegro ups.com...[vbcol=seagreen]
> Thanks for your help and suggestions - SQL error logs don't indicate
> anything, and we're running per-processor. No connection limits set,
> and I've not applied 2187 but will shortly.
> We've got a bit deeper with this now, and it seems like it may be line
> related as we've found a few more apps (exchange for one) that seems to
> suffer a similar problem.
> Connecting via ISQL when the outage is taking place works, however
> doing something like select * from sysdatabases (which I guess is
> similar to what query analyzer does at login if the object browser is
> showing) I get only the first 10-15 databases (same place each time)
> and it dies. Looking at packet traces from both ends there's loads of
> retransmission and out of sequence arrivals going on, so I'm now 99%
> certain it's not SQL related.
> Still doesn't explain why I can run the same query on a box on exactly
> the switch and it will work, whereas doing in on the cluster will fail.
> We're working with the network providers at the moment, but they're
> trying to claim if it works on one box then it must be cluster related.
> Thanks guys,
> Kristan
> Anthony Thomas wrote:
intermittent[vbcol=seagreen]
on[vbcol=seagreen]
Does[vbcol=seagreen]
then[vbcol=seagreen]
like a[vbcol=seagreen]
not[vbcol=seagreen]
bit[vbcol=seagreen]
seemed[vbcol=seagreen]
cache
>
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment