Why OPC DA services Fail
We are facing problems with OPC DA services failure in a site, this site has both Advant/Safeguard and AC800M nodes combined in a redundant client/server network. The OPC DA failure has affected the AC800M, the reasons for this failure is not known. We have separate redundant connectivity servers for AC800M and Advant/ Safeguard, unfortnately both the AC800M connectiivty servers OPC DA service failed and the AC800M connectivity servers had services running, since the system is under production we had to restart the servers to restore the services. Why this OPC DA services are failing? any ways to mitigate this?
There are multiple network disconnections recorded that you might want to dig into the reasons behind; there are distinct events for servers, clients and AC800M CPUs.
I generally don't like to investigate event lists too deep before ALL "node downs" have gotten their explanation from site engineers - from a support perspective I can't tell if someone was rebooting the PC, a network cable was pulled intentionally or if the network is just "bad" and need an overhaul... all give the same "... connection to X lost" messages in the System Event/Alarm Lists.
The RNRP Utility's (aka Fault Tracer) first option can be used to scan all nodes running RNRP for errors and configuration problems. The report will also tell if you have suffered a network loop or storm.
I can only see AC800 OPC Server in CS-1 stopping. You need to visit that server and pull the session logs (C:\ABB Industrial IT Data\Control IT\OPC Server\...\logfiles\sessionXXX.log). Possibly also the Windows Event Logs.
While at it, I would also check the logs in the other AC800 OPC server node(s).
The following events seem related...
03:36:51 172.16.80.155-ConnectionError OPC Server(5500) Connection error to DA subscribed controller
03:36:55 OPCServerStopped OPC DA server stopped. The OPC Server terminated unexpectedly!
An AC800 OPC Server will report "bad" status if all connected controllers become unavailable. The 800xA OPC DA connector service will then abort itself and restart and wait for the OPC server status to go "good" again. The session logs should be able to tell why status was pulled down from good to bad (or if the OPC server crashed, which also should result in a .dmp file in the session log directory).
I could also see a few controller problems, e.g. communication lost with backup CPU.
If you can't make anything out of this yourself, please ask the regional ABB support center to file a support case for you.
Its not just the OPC DA services that are failing - the OPC AE and basic History services also fail.
There are many potential Reasons for this and without knowing a LOT more about your site and the history of these faults theres not much that we can tell you. You will need to build up a history of when the problem started, how often it is happening and what else could be occuring on the servers to cause an issue.
However, are you using any form of backup software that "snapshots" the server disks ? something like Acronis for example ?