NRI plugin add config failPolicy, eg. skip、fail #142

lengrongfu · 2025-01-02T03:17:26Z

What is the problem you're trying to solve

When NRI plugin is shutdown, containerd Will not reject container creation，If there is important logic in the CreateContainer method, such as injecting devices into the container, some containers may be missed during the restart of the nri plugin.

Describe the solution you'd like

Therefore, can we configure the failPolicy of each nri plugin, such as skip and fail? When the configuration is skip, the restart of the nri plugin is ignored, and when the configuration is fail, the container is refused to be created.

Additional context

No response

lengrongfu · 2025-01-10T08:47:05Z

@fuweid @klihub Hello, What do you think?

klihub · 2025-01-10T09:01:57Z

/cc @kad

klihub · 2025-01-10T09:27:16Z

@lengrongfu Just thinking aloud about alternative existing possibilities, your plugin could annotate containers it has processed, and if it restarts (due to a timeout or other reasons) it could check if some critical container is missing that annotation and then use other means to restart it.

klihub · 2025-01-10T09:31:15Z

@lengrongfu Also, I think this should be filed against the core NRI repo, not containerd itself.

lengrongfu · 2025-01-10T09:43:46Z

@lengrongfu Just thinking aloud about alternative existing possibilities, your plugin could annotate containers it has processed, and if it restarts (due to a timeout or other reasons) it could check if some critical container is missing that annotation and then use other means to restart it.

What you said is a solution to the problem; however, there are some problems. If we check whether the current container has what we injected after restarting, there will be a period of time when the container used by the user does not have the content we injected;

Our scenario is that if we cannot inject when creating a container, then we don’t want to start the container.

lengrongfu · 2025-01-10T09:45:01Z

@lengrongfu Also, I think this should be filed against the core NRI repo, not containerd itself.

I understand that the design of nri is somewhat similar to kubernetes' webhook, which modifies the pod yaml; so these control logics should be in containerd.

klihub · 2025-01-10T11:53:35Z

@lengrongfu Also, I think this should be filed against the core NRI repo, not containerd itself.

I understand that the design of nri is somewhat similar to kubernetes' webhook, which modifies the pod yaml; so these control logics should be in containerd.

I still think this belongs to core NRI. If we implemented something like this

we'd want identical behavior across all runtimes (IOW, both containerd and CRI-O),
we'd probably implement the configuration setting as a per-plugin parameter passed to NRI core/the runtime during plugin registration,
we'd probably implement the logic for interpreting/enforcing such a setting while collecting the combined NRI container adjustment, returning an error if a plugin with such a setting timed out and was disconnected

And if we did so, we probably wouldn't need any changes in the runtime-specific NRI integration code, only the common code in NRI itself. So based on this, it looks more of an NRI feature than a containerd/runtime one to me.

lengrongfu · 2025-01-10T16:15:38Z

Once we have discussed it clearly, we can track it from the nri repo.

"we'd probably implement the configuration setting as a per-plugin parameter passed to NRI core/the runtime during plugin registration,"
This is a paradox. If the NRI plugin is not started, it will not be registered with the runtime, and the runtime will not know whether there is an NRI plugin that must be called.

klihub · 2025-01-13T09:13:28Z

Once we have discussed it clearly, we can track it from the nri repo.

"we'd probably implement the configuration setting as a per-plugin parameter passed to NRI core/the runtime during plugin registration," This is a paradox. If the NRI plugin is not started, it will not be registered with the runtime, and the runtime will not know whether there is an NRI plugin that must be called.

Okay, sorry I misunderstood a bit what you were after. So you would like to ensure that some plugin or plugins are always present/running when some set of special containers come around, and that these plugins are able to adjust the special containers. And, if for any reason, that does not happen, then you'd like to fail the creation of the container altogether.

One way to achieve that would be to have a 'verification' plugin as the last one in the chain of plugins (maybe installed locally), which would identify your special containers and check if all the necessary plugins have adjusted it (for instance by checking that each plugin has annotated the container) and return an error when this check fails. This would then cause the creation of the container to fail.

lengrongfu · 2025-01-14T08:58:30Z

Your understanding is correct, but we can simplify this problem. If all containers need to be adjusted by a certain NRi plug-in; if there is no adjustment or the adjustment is wrong, the container should be prevented from continuing to be created. The reasons for no adjustment and adjustment errors may be the following:

NRI plug-in is not started.
Containerd fails to call the TRPC of the NRI plug-in
NRI plug-in processing logic error.

In any case when the adjustment fails, I hope to prevent the creation of this container on the Containerd side.

klihub · 2025-01-15T16:35:54Z

Your understanding is correct, but we can simplify this problem. If all containers need to be adjusted by a certain NRi plug-in; if there is no adjustment or the adjustment is wrong, the container should be prevented from continuing to be created. The reasons for no adjustment and adjustment errors may be the following:

NRI plug-in is not started.

Containerd fails to call the TRPC of the NRI plug-in

NRI plug-in processing logic error.

In any case when the adjustment fails, I hope to prevent the creation of this container on the Containerd side.

@lengrongfu So, keeping in mind that if an NRI plugin returns an error to NRI itself, NRI will return an error to containerd and that will prevent the creation of the container, would the above suggestion work for you ?

yylt · 2025-02-08T09:25:36Z

Could we agree on an annotation that specially processed by the core NRI and that annotation requires a specified plugin to execute, otherwise, it will fail.

This annotation could be added to the K8S webhook.

aojea · 2025-02-16T12:18:04Z

@klihub this is also a request from networking integrations, because they want to block the pod creation until the NRI network. plugin is ready. How it works today with CNI is: if the specific CNI config file does not exist the Pod is not created, same as the old NRI model in 0.1.0 https://github.com/containerd/nri/blob/main/README-v0.1.0.md . For the same reasons NRI evolved to a new model, network integrations want to move to this service oriented model, there are several proposal and demand from the community for a CNI 2.0 but is not realistic that is going to happen. https://github.com/containernetworking/cni/issues?q=is%3Aissue%20state%3Aopen%20grpc , hence NRI crosses all these checks.

This is a paradox. If the NRI plugin is not started, it will not be registered with the runtime, and the runtime will not know whether there is an NRI plugin that must be called.

and if the plugin is deployed as a daemonset you have a deadlock, because you need to run a container to register the plugin.

Let me step back and try dump my thoughts:

We want to deploy additional functionality related to the Pod/container lifecycle to Kubernetes based on NRI plugins because this present the right integration points.
Some of this functionality MUST be required to run in the system, so we need a way to enforce a NRI plugin is executed always if desired and fail close if it times out

Problem:

NRI plugins are commonly deployed as daemonset, so making a pluing required need to distinguish on different phases of the Node configuration / initialization. This is how a kubernetes node startup process goes.

VM is started
Kubelet and Container runtime are installed and configured
Node is tainted by Kubelet until Container Runtime reports via CRI API that is ready (NetworkReady and RuntimeReady conditions)
NetworkReady blocks Pods with network but allow to run pods with HostNetworks

I can;t remember exactly now what is the RuntimeReady effect

Could we agree on an annotation that specially processed by the core NRI and that annotation requires a specified plugin to execute, otherwise, it will fail.
This annotation could be added to the K8S webhook.

We should try to avoid designing a system based on annotations. We need to find a good integration model and for that the CRI API is the communication channel between kubernetes and the runtimes, if we start taking shortcuts we'll start to create complex and custom systems that will behave differently depending on the integration killing all the benefits of standardization

Some ideas:

Check if we can use the existing CRI runtime conditions to influence the Node status based on NRI plugins state
Allow a NRI plugin state or configuration mode that make some plugins as "Required" and report this to the kubelet and also fail close

klihub mentioned this issue Jan 10, 2025

NRI plugin add config failPolicy, eg. skip #133

Closed

samuelkarp transferred this issue from containerd/containerd Feb 10, 2025

samuelkarp mentioned this issue Feb 21, 2025

Configurable Restrictions of Selected NRI Features #137

Open

aojea mentioned this issue Mar 6, 2025

NRI: add a cluster-wide registration mechnism kubernetes/kubernetes#130594

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NRI plugin add config failPolicy, eg. skip、fail #142

NRI plugin add config failPolicy, eg. skip、fail #142

lengrongfu commented Jan 2, 2025

lengrongfu commented Jan 10, 2025

klihub commented Jan 10, 2025

klihub commented Jan 10, 2025

klihub commented Jan 10, 2025

lengrongfu commented Jan 10, 2025

lengrongfu commented Jan 10, 2025

klihub commented Jan 10, 2025 •

edited

Loading

lengrongfu commented Jan 10, 2025

klihub commented Jan 13, 2025 •

edited

Loading

lengrongfu commented Jan 14, 2025

klihub commented Jan 15, 2025 •

edited

Loading

yylt commented Feb 8, 2025

aojea commented Feb 16, 2025 •

edited

Loading

NRI plugin add config failPolicy, eg. skip、fail #142

NRI plugin add config failPolicy, eg. skip、fail #142

Comments

lengrongfu commented Jan 2, 2025

What is the problem you're trying to solve

Describe the solution you'd like

Additional context

lengrongfu commented Jan 10, 2025

klihub commented Jan 10, 2025

klihub commented Jan 10, 2025

klihub commented Jan 10, 2025

lengrongfu commented Jan 10, 2025

lengrongfu commented Jan 10, 2025

klihub commented Jan 10, 2025 • edited Loading

lengrongfu commented Jan 10, 2025

klihub commented Jan 13, 2025 • edited Loading

lengrongfu commented Jan 14, 2025

klihub commented Jan 15, 2025 • edited Loading

yylt commented Feb 8, 2025

aojea commented Feb 16, 2025 • edited Loading

klihub commented Jan 10, 2025 •

edited

Loading

klihub commented Jan 13, 2025 •

edited

Loading

klihub commented Jan 15, 2025 •

edited

Loading

aojea commented Feb 16, 2025 •

edited

Loading