High CPU utilization by AC800M OPC server/Connectivity server communication breaks
I put both problems into the topic as I am not sure which one is the root cause.
Сurrently we are experiencing periodic short-term breaks in communication with one of our connectivity server (for few seconds several times per hour) observed in the form of lost ping packets and Node_down/Node_up events in the RNRP monitor.
Also at the moment of communication breaks we observe a dramatic increase in CPU utlization by AC800M OPC server.
In the log file OPC server reports the following:
I 2017-03-31 12:32:50.620 OPCAE - End of subscription. OPC client Id = 0 . SaveSub not enabled.
I 2017-03-31 12:32:55.766 OPCAE - Start of subscription. OPC client Id = 0 . SaveSub not enabled.
Also in 90% cases along with the Node_down/Node_up events for the Connectivity server we observe Node_down/Node_up events for one of our operator workstations.
RNRP fault tracers shows some strange errors related to the connectivity server (172.16.80.111/22) and the workstation (172.16.80.207/22), note that the area and the node number are mixed up in the error messages:
Errors found in 172.16.5.72
Hosts file update got reply from unknown node=1, area=207, Has address changed ?
Hosts file update got reply from unknown node=1, area=111, Has address changed ?Please advise.
Going by the behavior described and the fact that restart of connectivity servers resolved the issue. It looks like the issue may have been caused by faulty/ failed MMS connections from the connectivity server to the Controllers. The restart of the connectivity server terminated the old connections and created fresh/ successful connections to the controller. As to what caused the failed connections in the first place, the most likely reason could be some network disturbance which could have resulted in some faulty connections.
This is a little bit "chicken or egg". High CPU loads can break the OPC server, or the OPC server restarting after a failure can cause High CPU.
If restarting the CS does not resolve the issue, you could start by enabling the Windows performance counters and try to track down which process (es) are causing the High CPU load, and when exactly the problems occur. i.e. before or after the restart. If the CPU load occurs after the MMS fault, then you can eliminate CPU loading as the cause.
Also I see a range of area numbers coming up in your RNRP error logs. You should work through your RNRP configuration to make sense of the error messages you're getting and see if the IP addresses, Areas, node numbers etc make sense. Depending on whether you use implict or explicit addressing, the messages should give you some hints about what is hapening. eg, is it one network card or the other, does RNRP calulate the area and node numbers correctly etc.