800xA Redundant CPU failure
Hello, ABB.
I'm having some problems with a redundancy setup with PM866 on a 800xA system. This is my first time working with 800xA (my previous experience has been with Freelance DCS). I didn't install this existing system either.
The current hardware setup is the following (see attached image):
MAIN CONTROLLER
- 1 AC800M PM866
- 1 CI854A
REDUNDANT CONTROLLER
- 1 AC800M PM866
- 3 CI854A
- 1 CI873
The problem was the following:
-The operator was instructed that when they had a certain problem with a module, to unplug it and plug it again with the system running (I don't exactly know what this specific problem was).
-They have a separate system on which they did this procedure, and never caused any problem.
-When they did this on the other system, the redundant CPU went to fault and dropped all communications.
-The only way they were able to restore operations was by removing the power supply from the redundant CPU. After this, all modules were functioning properly, even the ones attached directly to the redundant.
On the software side, the following errors appeared:
-"Warning on primary unit RPA, RPB, Redundant mode enabled. Unit B acts as primary Backup CPU stopped. Switchover occurred. "
-There was also an IP error shown. It showed that the IP on the redundant CPU couldn't be found.
Right now, the system is working without redundancy. If they power up the redundant CPU, the whole system drops.
My questions are:
1. Is this a correct hardware setup for redundancy? As far as I remember, you needed the main and redundant CPUs to have the exact same hardware attached to each of them.
2. Is the action of removing a communication module while the system is running in order to "reset it" a safe procedure?
3. What could have caused this issue? Is there a way to know if the redundant CPU can be salvaged?
I haven't had the chance to do some basic comissioning and test the CPU, but I need to be ready when the user is ready to do some tests.
Thanks in advance!
Answers
I'll try to answer Q1:
The PM866 has an RCU Link via the TK851 RCU Link Cable (the thick one). In a redundant system, the two processor units are linked together with this cable. Both processor unit are also connected to the same CEX-bus, using the TK850/CEX-bus extension cable (the black one) or through a BC810 module (or BC820 for longer range in ver 6). Using the BC810 module instead of the cable makes it possible to use redundant communication interfaces (CI) and connected expansion units.
In your system this is not the case, there are no redundancy regarding CI, any one of the two processor units can control the expansion units. (There seems however be (Profibus) cable redundancy on two of your CI (CI854:1), both the A and the B connectors are live.)
Regarding Q2, not knowing anything about your system version or application:
Cannot see any reason for removing a CI as the correct remedy for fixing errors unless the actual module has a hardware fault. In that case I would simply replace it. The big downside here is, when physically removing units, is that there is a big risk of mechanical damage to units or sockets, making things worse.
(Having said that, this method will restart communication and if this helps, I would start reviewing all the CI bus comms and slave configs.)
Q3:
I suggest you start with a dump of the controller log (with Control Builder), you will also get logs from the CI's. Verify that the firmware versions for your different modules are the correct ones.
I would also check all cables for correct termination and connection, observe that the TK851 has an "UPPER" label that should be connected to the upper (primary) controller (PM866). This cable has a minor design flaws, hence its stiffness in regards to the small (and weak) connector surfaces.
BR
1. Is this a correct hardware setup for redundancy? As far as I remember, you needed the main and redundant CPUs to have the exact same hardware attached to each of them.
>>>You can have the existing set up
2. Is the action of removing a communication module while the system is running in order to "reset it" a safe procedure?
>>> Not at all. Unsafe and potentialy damage the hardware
3. What could have caused this issue? Is there a way to know if the redundant CPU can be salvaged?
Please attach the controller logs .
Attach the controller logs for further evaluation.
BR
ATP
from your question it seems that you have not checked the online hardware status.
you can check first with the logs and then in hardware status from control builder.
check in the latch status what's showing.
also to remove CL854 in running is NOT at all suggested.
Add new comment