-
-
Notifications
You must be signed in to change notification settings - Fork 57
RoundRobin Load Balancing: When a node is down, all 1/N requests always fail. #312
Comments
Hi @lseelenbinder , |
No problem, @AlexPikalov. After I realized what was happening, I knew it was a bit of an edge case that wouldn't be too easy to accidentally replicate during testing, but quite common in production because machines are always coming and going during maintenance. We're just going to revert to SingleNode and use HAProxy to load balance the actual instances, so this isn't a blocker for us to go into production. |
@lseelenbinder |
@AlexPikalov, that's a great idea! My only concern is keeping the ability to limit which nodes a specific config would ever connect to, regardless of added or removed nodes (even if that means it has no live nodes to talk to). |
Hi @lseelenbinder , These changes will remove dead node from cluster load balancing basing on received Comparing to |
So far, I've been able to find some issues with a proposed solution. Fixing it |
Hi @AlexPikalov, Thanks for fixing this! I won't have a chance to test it for a few days, but a one thing about the design is confusing for me.
Our method of using HAProxy to balance local DC nodes is working quite well, and it looks like this method would probably require us to continue doing that for the event source. |
Due to how the
RoundRobin(Sync)
is configured, whenever one of the backing nodes is down because of outages or maintenance, all requests that would be routed to that R2D2 pool fail (because that pool has no live connections and cannot create anymore).This is a blocking bug to using the RoundRobin load balancing mechanism, in my opinion, since it removes all possibility of failover to another node, without implementing somewhat complex logic in the client.
Was this a known limitation I overlooked or should we look into adjusting the implementation so perhaps the collection of known nodes is used equally and when one is down, others can be used?
The text was updated successfully, but these errors were encountered: