Skip to content
This repository was archived by the owner on Jun 28, 2024. It is now read-only.

Conversation

fidencio
Copy link
Member

and test with the image offload to the guest instead.

@fidencio
Copy link
Member Author

/test

@katacontainersbot katacontainersbot added the size/large Task of significant size label Sep 26, 2023
.ci/run.sh Outdated
@@ -83,7 +83,7 @@ case "${CI_JOB}" in
info "Running Confidential Containers tests for AMD SEV-SNP"
sudo -E PATH="$PATH" CRI_RUNTIME="containerd" bash -c "make cc-snp-kubernetes"
;;
"CC_CRI_CONTAINERD_K8S"|"CC_CRI_CONTAINERD_K8S_TDX_QEMU"|"CC_CRI_CONTAINERD_K8S_SE_QEMU"|"CC_CRI_CONTAINERD_K8S_TDX_CLOUD_HYPERVISOR")
"CC_CRI_CONTAINERD_K8S"|"CC_CRI_CONTAINERD_K8S_TDX_QEMU"|"CC_CRI_CONTAINERD_K8S_SE_QEMU"|"CC_CRI_CONTAINERD_K8S_TDX_CLOUD_HYPERVISOR"|"CC_CRI_CONTAINERD_K8S_IMAGE_OFFLOAD_TO_GUEST")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need the CC_CRI_CONTAINERD_K8S_IMAGE_OFFLOAD_TO_GUEST job if the 'original' ones are testing the image-offload to guest now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh - I've just noticed it was removed in a later commit, hence showing outdated

Comment on lines 162 to 171

# Print the logs
echo "-- Kata logs:"
sudo journalctl -xe -t kata --since "$test_start_date" -n 100000

echo "-- containerd logs:"
sudo journalctl -xe -t containerd --since "$test_start_date" -n 100000

echo "-- kubelet logs:"
sudo journalctl -xe -t kubelet --since "$test_start_date" -n 100000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can probable delete these debug logs (or move them into the teardown if they are useful)

@fidencio fidencio force-pushed the topic/CC-switch-tests-from-forked-containerd-to-pulling-on-guest branch 2 times, most recently from 1bf9324 to bf56ce3 Compare September 26, 2023 16:05
@fidencio
Copy link
Member Author

/test

@fidencio fidencio force-pushed the topic/CC-switch-tests-from-forked-containerd-to-pulling-on-guest branch from bf56ce3 to b27cbdd Compare September 26, 2023 18:08
@fidencio
Copy link
Member Author

/test

@fidencio fidencio force-pushed the topic/CC-switch-tests-from-forked-containerd-to-pulling-on-guest branch from b27cbdd to a39f358 Compare September 27, 2023 08:28
@fidencio
Copy link
Member Author

/test

@fidencio fidencio force-pushed the topic/CC-switch-tests-from-forked-containerd-to-pulling-on-guest branch 2 times, most recently from 90141bc to fa1115f Compare September 27, 2023 08:43
@fidencio
Copy link
Member Author

/test

@fidencio fidencio force-pushed the topic/CC-switch-tests-from-forked-containerd-to-pulling-on-guest branch from fa1115f to 738ce6b Compare September 27, 2023 09:55
@fidencio
Copy link
Member Author

/test

@stevenhorsman
Copy link
Member

/test-ubuntu

@stevenhorsman stevenhorsman force-pushed the topic/CC-switch-tests-from-forked-containerd-to-pulling-on-guest branch from 16f558a to 5cce367 Compare September 28, 2023 13:09
@stevenhorsman
Copy link
Member

/test-ubuntu

@stevenhorsman stevenhorsman force-pushed the topic/CC-switch-tests-from-forked-containerd-to-pulling-on-guest branch from 5cce367 to ba151db Compare September 28, 2023 13:28
@stevenhorsman
Copy link
Member

/test-ubuntu

@stevenhorsman stevenhorsman force-pushed the topic/CC-switch-tests-from-forked-containerd-to-pulling-on-guest branch from d40de58 to 052f845 Compare September 28, 2023 15:11
@stevenhorsman
Copy link
Member

/test-ubuntu

@stevenhorsman
Copy link
Member

/test

1 similar comment
@stevenhorsman
Copy link
Member

/test

@stevenhorsman stevenhorsman force-pushed the topic/CC-switch-tests-from-forked-containerd-to-pulling-on-guest branch from fe851d5 to 8ee76f1 Compare September 29, 2023 13:28
@stevenhorsman
Copy link
Member

/test-ubuntu

@stevenhorsman stevenhorsman force-pushed the topic/CC-switch-tests-from-forked-containerd-to-pulling-on-guest branch from 8ee76f1 to 5da3fb1 Compare September 29, 2023 14:03
@stevenhorsman
Copy link
Member

/test-ubuntu

@stevenhorsman stevenhorsman force-pushed the topic/CC-switch-tests-from-forked-containerd-to-pulling-on-guest branch from 5da3fb1 to f847e0c Compare September 29, 2023 14:17
@fidencio
Copy link
Member Author

fidencio commented Oct 3, 2023

TDX is not a blocker as the network has been terrible lately and we dropped the cached components due to the lack of maintenance.

@fidencio fidencio marked this pull request as ready for review October 3, 2023 15:02
@stevenhorsman
Copy link
Member

@ryansavino - I think this is just waiting on SEV and SNP now. How is the investigation going as the SNP node has been offline for ~5 days now, or do you want us to merge this as is and you handle the AMD test fixes later?

@stevenhorsman
Copy link
Member

/test

@stevenhorsman
Copy link
Member

The SEV and SNP tests failed with:

09:01:59 make[1]: *** [tools/packaging/kata-deploy/local-build/Makefile:67: kernel-sev-tarball-build] Error 1
09:01:59 make[1]: Leaving directory '/home/jenkins/workspace/tests-CCv0-ubuntu-20.04_snp-x86_64-CC_SNP_CRI_CONTAINERD_K8S-PR/go/src/github.com/kata-containers/kata-containers'
09:01:59 make: *** [tools/packaging/kata-deploy/local-build/Makefile:97: kernel-sev-tarball] Error 2
09:01:59 [install_kata_image.sh:195] ERROR: sudo -E PATH=/home/jenkins/workspace/tests-CCv0-ubuntu-20.04_snp-x86_64-CC_SNP_CRI_CONTAINERD_K8S-PR/go/bin:/usr/local/go/bin:/usr/sbin:/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin make rootfs-initrd-sev-tarball
09:01:59 [install_kata.sh:47] ERROR: .ci/install_kata_image.sh 

and

09:06:28 ERROR: failed to solve: process "/bin/sh -c . /root/.cargo/env; cargo install cargo-when" did not complete successfully: exit code: 127
09:06:28 make: *** [Makefile:96: /home/jenkins/workspace/tests-CCv0-ubuntu-20.04_sev-x86_64-CC_SEV_CRI_CONTAINERD_K8S-PR/go/src/github.com/kata-containers/kata-containers/tools/packaging/kata-deploy/local-build/build/rootfs-initrd-sev/builddir/initrd-image/.ubuntu_rootfs.done] Error 1
09:06:29 make[1]: *** [tools/packaging/kata-deploy/local-build/Makefile:67: rootfs-initrd-sev-tarball-build] Error 2
09:06:29 make[1]: Leaving directory '/home/jenkins/workspace/tests-CCv0-ubuntu-20.04_sev-x86_64-CC_SEV_CRI_CONTAINERD_K8S-PR/go/src/github.com/kata-containers/kata-containers'
09:06:29 make: *** [tools/packaging/kata-deploy/local-build/Makefile:127: rootfs-initrd-sev-tarball] Error 2
09:06:29 [install_kata_image.sh:195] ERROR: sudo -E PATH=/home/jenkins/workspace/tests-CCv0-ubuntu-20.04_sev-x86_64-CC_SEV_CRI_CONTAINERD_K8S-PR/go/bin:/usr/local/go/bin:/usr/sbin:/sbin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin make rootfs-initrd-sev-tarball
09:06:29 [install_kata.sh:47] ERROR: .ci/install_kata_image.sh 

so I'll retry them to see if that helps

@stevenhorsman
Copy link
Member

Both SEV and SNP re-run failed with:

09:32:52 #11 [skopeo 3/4] RUN curl -fsSL "https://github.com/containers/skopeo/archive/v1.9.1.tar.gz"   | tar -xzf - --strip-components=1
09:32:52 #11 0.336 curl: (6) Could not resolve host: github.com

so it looks like network issues on AMD's side?

@ryansavino
Copy link
Member

Yeah, I was hoping to see network failure around the same time, but they were a couple hours apart it looks. I've retriggered for now. I'll check up on it.

@stevenhorsman
Copy link
Member

SNP failed with:

16:53:57 INFO: Run tests
16:53:57 1..1
16:55:36 not ok 1 [cc][kubernetes][containerd][snp] Test SNP unencrypted container launch success
16:55:36 # (from function `kubernetes_wait_for_pod_ready_state' in file lib.sh, line 42,
16:55:36 #  in test file confidential/snp.bats, line 80)
16:55:36 #   `kubernetes_wait_for_pod_ready_state "$pod_name" 20' failed

SEV looks like a network issue I think

@@ -14,6 +14,9 @@ UNENCRYPTED_IMAGE_URL="${IMAGE_REPO}:unencrypted"
# Text to grep for active feature in guest dmesg output
SNP_DMESG_GREP_TEXT="Memory Encryption Features active:.*SEV-SNP"

# Add sleep to give nydus snapshotter a chance to start-up as suggested by AMD folks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how bats executes commands outside of the main tests, setup or teardown methods. The snp test is still failing. I would recommend moving this to the setup_file method. However, I think this check should go in the containerd_nydus_setup.sh. Maybe SEV/SNP tests are the only things experiencing issues with this right now, but that doesn't necessarily mean that other things won't have issue with the snapshotter not being fully initialized later.

@@ -17,6 +17,9 @@ UNENCRYPTED_IMAGE_URL="${IMAGE_REPO}:unencrypted"
SEV_DMESG_GREP_TEXT="Memory Encryption Features active:.*\(SEV$\|SEV \)"
SEV_ES_DMESG_GREP_TEXT="Memory Encryption Features active:.*SEV-ES"

# Add sleep to give nydus snapshotter a chance to start-up as suggested by AMD folks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here as below for the snp.bats.

- AMD folks have suggested the test failures are due to the nydus
snapshotter not being started when the tests run, so they added
sleeps which seemed to help

Signed-off-by: stevenhorsman <[email protected]>
@stevenhorsman stevenhorsman force-pushed the topic/CC-switch-tests-from-forked-containerd-to-pulling-on-guest branch from 5882630 to 5949184 Compare October 4, 2023 17:53
@stevenhorsman
Copy link
Member

/test

@stevenhorsman
Copy link
Member

Network failures again with SEV?

19:00:25 #9 19.82 Reading package lists...
19:00:25 #9 20.51 W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/focal-updates/InRelease  Connection failed [IP: 185.125.190.36 80]
19:00:25 #9 20.51 W: Some index files failed to download. They have been ignored, or old ones used instead.
19:00:26 #9 20.55 Reading package lists...
19:00:26 #9 21.25 Building dependency tree...
19:00:26 #9 21.45 Reading state information...
19:00:26 #9 21.57 Some packages could not be installed. This may mean that you have
19:00:26 #9 21.57 requested an impossible situation or if you are using the unstable
19:00:26 #9 21.57 distribution that some required packages have not yet been created
19:00:26 #9 21.57 or been moved out of Incoming.
19:00:26 #9 21.57 The following information may help to resolve the situation:
19:00:26 #9 21.57 
19:00:26 #9 21.57 The following packages have unmet dependencies:
19:00:27 #9 21.68  clang : Depends: clang-10 (>= 10~) but it is not going to be installed
19:00:27 #9 21.68  g++ : Depends: g++-9 (>= 9.3.0-3~) but it is not going to be installed
19:00:27 #9 21.68  libdevmapper-dev : Depends: libudev-dev but it is not going to be installed
19:00:27 #9 21.68                     Depends: libselinux1-dev but it is not going to be installed
19:00:27 #9 21.68  libgpgme-dev : Depends: libc6-dev but it is not going to be installed
19:00:27 #9 21.70 E: Unable to correct problems, you have held broken packages.

@stevenhorsman
Copy link
Member

The SNP tests failed again and the log times indicate that both 30s sleeps took effect, so I'm not sure that's the solution:

19:26:47     address = "/run/containerd-nydus/containerd-nydus-grpc.sock"
19:27:17 INFO: Run tests
19:27:17 1..1
19:27:56 not ok 1 [cc][kubernetes][containerd][snp] Test SNP unencrypted container launch success
19:27:56 # (from function `kubernetes_wait_for_pod_ready_state' in file lib.sh, line 42,
19:27:56 #  in test file confidential/snp.bats, line 77)
19:27:56 #   `kubernetes_wait_for_pod_ready_state "$pod_name" 20' failed

@ryansavino
Copy link
Member

The last SEV test run, all the tests failed. The pod event failure is showing this error:

15:54:22 #   Normal   Scheduled  21s                default-scheduler  Successfully assigned default/sev-encrypted-dd9f8bbb9-dmsff to amd-coco-ci-ubuntu2004-001
15:54:22 #   Normal   Pulled     12s                kubelet            Successfully pulled image "ghcr.io/confidential-containers/test-container:multi-arch-encrypted" in 1.247186984s
15:54:22 #   Normal   Pulling    11s (x2 over 13s)  kubelet            Pulling image "ghcr.io/confidential-containers/test-container:multi-arch-encrypted"
15:54:22 #   Warning  Failed     11s (x2 over 12s)  kubelet            Error: failed to create containerd container: create instance 697: object with key "697" already exists: unknown
15:54:22 #   Normal   Pulled     11s                kubelet            Successfully pulled image "ghcr.io/confidential-containers/test-container:multi-arch-encrypted" in 396.519239ms

Has anyone seen this type of error before?

@stevenhorsman
Copy link
Member

The last SEV test run, all the tests failed. The pod event failure is showing this error:

15:54:22 #   Normal   Scheduled  21s                default-scheduler  Successfully assigned default/sev-encrypted-dd9f8bbb9-dmsff to amd-coco-ci-ubuntu2004-001
15:54:22 #   Normal   Pulled     12s                kubelet            Successfully pulled image "ghcr.io/confidential-containers/test-container:multi-arch-encrypted" in 1.247186984s
15:54:22 #   Normal   Pulling    11s (x2 over 13s)  kubelet            Pulling image "ghcr.io/confidential-containers/test-container:multi-arch-encrypted"
15:54:22 #   Warning  Failed     11s (x2 over 12s)  kubelet            Error: failed to create containerd container: create instance 697: object with key "697" already exists: unknown
15:54:22 #   Normal   Pulled     11s                kubelet            Successfully pulled image "ghcr.io/confidential-containers/test-container:multi-arch-encrypted" in 396.519239ms

Has anyone seen this type of error before?

Yes, I saw an error like that when I as getting the non-TEE test working locally. I can't remember the context though. Can you try cleaning up the imaged you have with crictl rmi as that might be how I solved it?

@fitzthum
Copy link

fitzthum commented Oct 6, 2023

Has anyone seen this type of error before?

I saw the same failure on one of the SEV runs last week. I'm a bit worried we're introducing a new intermittent issue.

@ryansavino
Copy link
Member

The snp.bats was failing with the "object with key [] already exists" message. I cleaned up the test image on the host using the below command:

sudo crictl -r unix:///run/containerd/containerd.sock rmi ghcr.io/confidential-containers/test-container:unencrypted

This fixed the above error. I ran a full test after that and the snp.bats passed successfully. So I think we still need that sleep or a script to detect nydus snapshotter is fully up and running. My hunch is that this will fix the sev.bats as well. I've cleaned up that image on the SEV node, and I'm re-triggering the tests.
I'm not sure why the key exists error is only getting thrown in certain scenarios. It seems now that I've resolved it, I can run the test over and over and it passes. Weird.
One other thing I was going to recommend @stevenhorsman. I noticed the pod takes more time to come up. Can we try increasing the timeout for checking if the pod is ready as well?
https://github.com/fidencio/kata-tests/blob/594918474a1026108658d2113ca7f1af7cc29fb3/integration/kubernetes/confidential/snp.bats#L77
https://github.com/fidencio/kata-tests/blob/594918474a1026108658d2113ca7f1af7cc29fb3/integration/kubernetes/confidential/sev.bats#L128 (L128,162,193,228,263)
Maybe increase 20 to 30.

@ryansavino
Copy link
Member

ryansavino commented Oct 6, 2023

SEV Test 1 failed, can we try increasing the timeout values?

@stevenhorsman
Copy link
Member

SEV Test 1 failed, can we try increasing the timeout values?

Yeah, I'll try bumping them all to 40 to see if that helps

- Bump timeouts from 20 to 40s to see if that helps with tests passing.

Signed-off-by: stevenhorsman <[email protected]>
@stevenhorsman
Copy link
Member

/test

Copy link
Member

@stevenhorsman stevenhorsman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bits that others did LGTM and the tests are passing

@fidencio
Copy link
Member Author

fidencio commented Oct 9, 2023

LGTM as well, let's have it merged.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
size/large Task of significant size
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CC | ci | stop testing with the forked containerd
5 participants