-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a possibility to administratively drop a client from NATS #1023
Comments
@wutkemtt Again, in this case it would just adds to the problem of the reconnecting issue discussed in the other server. I understand that when a bad actor creates havoc in the NATS cluster, it is frustrating, but a server cannot prevent a rogue client to misbehave. It can't prevent a client to create a TCP connection. It could try to hold that connection but don't do anything, but the client is free to close the connection (connection timeout), then try again, etc.. You can limit the impact by vetting the applications that can connect to the cluster and give then username/password that only those applications can then use the cluster (again, even that does not prevent a rogue client to try to connect without proper credentials). Keeping this opened for discussion along the other one to try find a viable solution. |
Just to explain it a little bit more: I propose to have a manual possibility for our operations personnel to drop clients from the NATS infrastructure if there are certain evidences that the client misbehaves and threats the whole cluster. Based on alarmings and information out of the metrics and logs an administrator could be able to find these types of clients. Droping them early would prevent further damage and hopefully an cluster outage too. |
Given that the client will probably just reconnect immediately if it’s administratively disconnected and then maybe with or without an ID simply disconnecting it won’t help. What other kind of disconnect / deny method do you propose? To be of use you need to disconnect it and keep it disconnected? And keep it disconnected at a layer before the NATS server TCP port. One option - at the NATS layer - is to set limits on the JWT, today we have a per account max connections limit which might be a viable option? I don’t know how feasible but I imagine a per user limit could also be supported so you would give each service a user and then limit the number of connections it’s allowed - again I don’t know the JWT code or how feasible such a feature would be but I am mentioning it to see what you think if that will help. Even with that though a badly behaving micro service can make 10s of thousands of connections very rapidly and such limits would be evaluated and client disconnected - only to have it reconnnect again and consume more resources. So at the core it would not help that much either. Any form of limit we add would apply after a client is authenticated. In my own client code I add a exponential back off around initial reconnects to help with this, mitigations we add on the server can only do so much |
@wutkemtt In reference to the other issue you opened, it looks to me like those two issues contradict themselves. In one you want to protect from possible denial of service with clients that (re)connect a lot, but here you ask for a way to forcibly close a misbehaving client, which, as @ripienaar mentioned, would lead to just that, the client trying to reconnect right away. I have discussed in length in the other issue (#1022 (comment)) why I believe that the server cannot take actions to prevent this issue unfortunately. I also should have pointed out that for some users, having an application that connects, sends a message, then disconnect, is considered normal (although this is a behavior that we discourage, we would not want to break that). |
What are the possibilities in such a situation when a client floods the nats cluster with messages or reconnects in a harmful manner to prevent further damage and to get the cluster back online in a fast way? |
I think the question is still: block by what means? They will just reconnect. We can’t block the IP on our own to drop traffic at a TCP level. They will just reconnect. This is why in web server world you have employ separate WAF that protects your servers from these situations. There are tools in the intrusion detection world that can profile incoming traffic and drop by means of firewall rules ones making many connections that might help in your case. We don’t have experience in using any of those with NATS though I think it’s something we could probably be engaged around. Ultimately that solution would be very tailored to your environment. |
You can also do things like |
@wutkemtt Instead of closing, I am wondering if it would not be better in this case to put the client in a mode that the server stops reading from the socket. Since most of our client libraries use blocking socket to send messages, it would effectively block the application at one point, and at the very least stop the flooding of the cluster. |
I did not think of the standard client libraries before as a possible solution vector. I like the idea to harden the default client implementation a littlebit. |
I have to be honest, this is going to be tough because if we change the library to now default to these low values unless explicitly set, then it means that users that just upgrade the library will suddenly see the performance of their app drop tremendously.
Help understand this.. for each failed connect attempt, the library currently the attempted url to the end of the list. But for each server that it already attempted to connect to, it does pause for ReconnectWait seconds. What you are suggesting is that this amount of time increases overtime while it fails to connect? And to be clear, the failure to connect here is a TCP failure, or something else?
Sorry, I am not sure I understand the statement above. Could you please elaborate? Also, if easier to discuss, we can arrange for a call, or if you are on Slack, to exchange direct messages/call. Let me know. |
Our developer teams used their own reconnect implementation for the nats connection due to a lack of knowledge and not read documentation. This caused the connection issue in our cluster. In the meantime with some discussions between DevOps and Developer teams we are planning to provide a standardized wrapper library for the developers, which requires some mandatory options and ensures, that specific parameters are set within predefined boundaries. We want to make sure that all clients are using the same connection mechanism and have all set a clientid, the maxinflight and maxpubinflight when connecting to stan. |
Hello, we use nats in our system with agents which have a persistent clientid. Sometimes when we have an incident in production and retry to execute a agent which has stopped unfortunatly, we can't because the clientid already exists for the nats streaming service. Our solution is to restart service but it's not optimal because others publisher clients can't send messages how long the service is down and we can probably loose some messages. Do you have a better solution ? A mean to disconnect a client with a new command is it a conceivable new feature ? |
If a streaming client is stopped (without having a chance to call connection.Close()) then indeed the server does not know and consider this client connected. However, if another application tries to connect with the same client ID, the server detects as a duplicate, but part of this process is to then contact the "old" client on an inbox to see if it responds. If it does, then the server reject the new client, but if that fails, it replaces the old with the new. |
Thank you for your answer (You perfectly understood my problem). Yes, our client is down but nats thread persisted in the JVM. We can see it with the java console. May be it is in link with the version 2.2.3 we use because i found this issue wich fix a similar problem in version 2.8.0 : Do you think this version may fix my problem ? |
If your client is "down" but the library is still running, especially the thread that replies to heartbeat from the server, then yes, you won't be able to restart any client using that client ID. I would recommend that you try with the latest client (and server) release(s) (always a good idea to try with the latest release of lib and/or server to see if the problem is fixed). Let me know if upgrading has addressed this issue. Thanks! |
[ ] Defect
[X] Feature Request or Change Proposal
Feature Requests
Use Case:
When a client threats a NATS servers or clusters health by openening too many connections or leaking too much performance by requesting messages with missing MaxPubInFlight option, it should be able to drop this client from the server or cluster.
Proposed Change:
NATS needs a signal or special published messages to drop the mentioned client.
Who Benefits From The Change(s)?
All NATS server / cluster operators in a similar case
Alternative Approaches
The text was updated successfully, but these errors were encountered: