Hi! I've been dealing with the same problem as other, with Hazelcast reporting that it failed to execute cluster tasks in the set timeout period. As others, I have also seen that the expected timeout time had indeed not elapsed, and that sometimes the cluster nodes appear as not available from the clustering page in the admin console. Finally, I have also seen messages being lost and synchronization on clients connected to servers sometimes failing, and I am assuming that this last part is just a symptom of the same problem.
A colleague of mine spotted this in the hazelcast plugin source code.
Class: ClusteredCacheFactory
public Collection<Object> doSynchronousClusterTask(ClusterTask task, boolean includeLocalMember) { if (cluster == null) { return Collections.emptyList(); } Set<Member> members = new HashSet<Member>(); Member current = cluster.getLocalMember(); for(Member member : cluster.getMembers()) { if (includeLocalMember || (!member.getUuid().equals(current.getUuid()))) { members.add(member); } } Collection<Object> result = new ArrayList<Object>(); if (members.size() > 0) { // Asynchronously execute the task on the other cluster members try { logger.debug("Executing MultiTask: " + task.getClass().getName()); Map<Member, Future<Object>> futures = hazelcast.getExecutorService(HAZELCAST_EXECUTOR_SERVICE_NAME) .submitToMembers(new CallableTask<Object>(task), members); long nanosLeft = TimeUnit.SECONDS.toNanos(MAX_CLUSTER_EXECUTION_TIME*members.size()); for (Future<Object> future : futures.values()) { long start = System.nanoTime(); result.add(future.get(nanosLeft, TimeUnit.NANOSECONDS)); nanosLeft = (System.nanoTime() - start); } } catch (TimeoutException te) { logger.error("Failed to execute cluster task within " + MAX_CLUSTER_EXECUTION_TIME + " seconds", te); } catch (Exception e) { logger.error("Failed to execute cluster task", e); } } else { logger.warn("No cluster members selected for cluster task " + task.getClass().getName()); } return result; }
This is the implementation that several cluster tasks will use to be executed across the cluster itself.
Pay attention to lines 17 and 21 of the snippet I pasted above. The nanosLeft
variable is supposed to keep track of how many nanoseconds are missing until the timeout period expired, using that for the timeout value of the future result (line 20), but on line 21, the current value of nanosLeft
is not considered at all. The new value for the time left is going to be System.nanoTime() - start
, which basically is how much time the last future value took to get (lines 19, 21).
At this point, I understand that having multiple nodes may have some tasks failing, and with a higher number of nodes, the possibility of these tasks failing are higher.
My setup has two nodes, and the tasks sometimes fail and sometimes doesn't. Considering what I just described, I think that tasks succeed when the second member can execute the task in a lower time than the first node, and the tasks fail otherwise.
My question for the Openfire community is: is the behavior I'm describing correct? Have we detected a bug in the Hazelcast clustering plugin?
Note that the hazelcast clustering plugin has been published with these changes about 7 months ago (according to GitHub), and I think that if this has been the case, the community would have been in an uproar for an implementation that fails 50% of the time. So, I believe that I may be missing part of the picture.