Hello all,
We are having a critical issue in our production Openfire cluster, where room occupant list does not correctly sync when a node is started/restarted. We observe this issue only in CentOS / RHL based Openfire clusters, but unable to reproduce the scenario under OSX. We have not tested this under Windows as we don't use any Windows systems at our work place. Here are the details:
Environment
- CentOS 6.7 (with firewalls disabled)
- Openfire 3.10.2
- Hazelcast plugin 2.1.2
- Websockets plugin 1.0.0
- Hazelcast 3.5.2
- MySQL 5.6
Scenario
1. Start the 1st Openfire node (Node-1) in the cluster.
2. Login a few users, and make these users join a room (in our tests, the room is called BEACH).
3. Start the 2nd Openfire node (Node-2).
4. Once the nodes have completed syncing, compare the "Client Sessions" page of Node-1 with that of Node-2. As expected, they should both show the same information.
5. Now compare the room occupant lists of the BEACH room (or whatever the room where users joined) of the two nodes. In Node-2, nicknames of some or all of the room occupants are incorrect. Their real nicknames have been replaced with resource IDs of other occupants in the same room.
6. In Node-2, try to kick a room occupants who has an incorrect nickname. Openfire throws an error saying the user cannot be kicked.
7. Terminate the all user sessions (through the admin console of Node-1 or Node-2), after which room occupants with the incorrect nicknames still exist in Node-2. So, in short these are ghost users.
Please also refer to attached screenshots.
node1.png: Node-1 sessions page and room occupants page for BEACH room.
node2.png: Node-2 sessions page and room occupants page for BEACH room. Incorrect nicknames of some occupants can be seen here.
I can reproduce these ghost users every time I run the above scenario. I use a simple JavaScript program to create user sessions via websockets. However, we have also seen this issue when using sockets and BOSH. I have attached the JS test program here. To run, create a user called "mbed" with the password "mirror", and also create a room called "beach" under conference.net.
I have also attached our Openfire configuration (openfire.xml) cluster configuration (cluster.xml) files.
I have also tested the above scenario with Openfire 3.10/Hazelcast plug 2.0.0/Hazelcast 3.4.6 and get the same results. So, this appears to be an issue either in the plugin or Openfire.
I would greatly appreciate if someone could help us resolve this issue, or let us know if this is a bug in Openfire. Please let me know if additional information is required. I am happy to share as much information as possible to nail this nasty issue.
Many thanks and kind regards,
Luki