RoundRobin Load Balancing: When a node is down, all 1/N requests always fail. #312

lseelenbinder · 2020-02-08T11:31:45Z

Due to how the RoundRobin(Sync) is configured, whenever one of the backing nodes is down because of outages or maintenance, all requests that would be routed to that R2D2 pool fail (because that pool has no live connections and cannot create anymore).

This is a blocking bug to using the RoundRobin load balancing mechanism, in my opinion, since it removes all possibility of failover to another node, without implementing somewhat complex logic in the client.

Was this a known limitation I overlooked or should we look into adjusting the implementation so perhaps the collection of known nodes is used equally and when one is down, others can be used?

The text was updated successfully, but these errors were encountered:

AlexPikalov · 2020-02-08T18:20:43Z

Hi @lseelenbinder ,
Yes, it was overlooked. Will try to come up with a solution for that bug.
Thanks for reporting

lseelenbinder · 2020-02-09T14:36:48Z

No problem, @AlexPikalov.

After I realized what was happening, I knew it was a bit of an edge case that wouldn't be too easy to accidentally replicate during testing, but quite common in production because machines are always coming and going during maintenance.

We're just going to revert to SingleNode and use HAProxy to load balance the actual instances, so this isn't a blocker for us to go into production.

AlexPikalov · 2020-02-09T16:35:42Z

@lseelenbinder
Good to know that it doesn't block you. However I think it's a good occasion to implement a feature that was requested almost 2 years ago #113 The solution itself may be based on Cassandra server events, namely on topology change: removed node. So if load balancer will remove a node reacting on this event, it may help to avoid the situation when load balancer returns a dead node

lseelenbinder · 2020-02-09T17:52:22Z

@AlexPikalov, that's a great idea!

My only concern is keeping the ability to limit which nodes a specific config would ever connect to, regardless of added or removed nodes (even if that means it has no live nodes to talk to).

AlexPikalov · 2020-03-02T14:40:05Z

Hi @lseelenbinder ,
I've just completed a draft implementation of dynamic clusters in #313.

These changes will remove dead node from cluster load balancing basing on received Topology Change event received from a node. I'm about to test it, but would really appreciate if you could check it from your end if it solves your case. Here is the new session factory function that will include this logic https://github.com/AlexPikalov/cdrs/blob/feat/113/src/cluster/session.rs#L236.

Comparing to new it has an extra argument event_src: NodeTcpConfig<'a, A> which is a configuration for a node that will be used as an event source.

AlexPikalov · 2020-03-03T09:20:29Z

So far, I've been able to find some issues with a proposed solution. Fixing it

lseelenbinder · 2020-03-03T12:39:49Z

Hi @AlexPikalov,

Thanks for fixing this! I won't have a chance to test it for a few days, but a one thing about the design is confusing for me.

NodeTcpConfig implies a single node is the source for the events, which, in my mind, doesn't actually help us any, since we still have a single point of failure. If that node happens to fail (or go down for maintenance, in the more likely scenario), we're still in the same position as before where one node failing causes issues across the cluster. Am I missing something in how it's intended to be used or how it works?

Our method of using HAProxy to balance local DC nodes is working quite well, and it looks like this method would probably require us to continue doing that for the event source.

AlexPikalov added the bug label Feb 8, 2020

AlexPikalov self-assigned this Feb 8, 2020

AlexPikalov added the in-progress label Feb 8, 2020

AlexPikalov mentioned this issue Feb 9, 2020

WIP feat: add remove_node method to LoadBalancerStrategy #313

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RoundRobin Load Balancing: When a node is down, all 1/N requests always fail. #312

RoundRobin Load Balancing: When a node is down, all 1/N requests always fail. #312

lseelenbinder commented Feb 8, 2020

AlexPikalov commented Feb 8, 2020

lseelenbinder commented Feb 9, 2020

AlexPikalov commented Feb 9, 2020

lseelenbinder commented Feb 9, 2020

AlexPikalov commented Mar 2, 2020 •

edited

Loading

AlexPikalov commented Mar 3, 2020

lseelenbinder commented Mar 3, 2020

RoundRobin Load Balancing: When a node is down, all 1/N requests always fail. #312

RoundRobin Load Balancing: When a node is down, all 1/N requests always fail. #312

Comments

lseelenbinder commented Feb 8, 2020

AlexPikalov commented Feb 8, 2020

lseelenbinder commented Feb 9, 2020

AlexPikalov commented Feb 9, 2020

lseelenbinder commented Feb 9, 2020

AlexPikalov commented Mar 2, 2020 • edited Loading

AlexPikalov commented Mar 3, 2020

lseelenbinder commented Mar 3, 2020

AlexPikalov commented Mar 2, 2020 •

edited

Loading