-
Notifications
You must be signed in to change notification settings - Fork 196
CC | ci | stop testing with the forked containerd ... #5775
CC | ci | stop testing with the forked containerd ... #5775
Conversation
/test |
.ci/run.sh
Outdated
@@ -83,7 +83,7 @@ case "${CI_JOB}" in | |||
info "Running Confidential Containers tests for AMD SEV-SNP" | |||
sudo -E PATH="$PATH" CRI_RUNTIME="containerd" bash -c "make cc-snp-kubernetes" | |||
;; | |||
"CC_CRI_CONTAINERD_K8S"|"CC_CRI_CONTAINERD_K8S_TDX_QEMU"|"CC_CRI_CONTAINERD_K8S_SE_QEMU"|"CC_CRI_CONTAINERD_K8S_TDX_CLOUD_HYPERVISOR") | |||
"CC_CRI_CONTAINERD_K8S"|"CC_CRI_CONTAINERD_K8S_TDX_QEMU"|"CC_CRI_CONTAINERD_K8S_SE_QEMU"|"CC_CRI_CONTAINERD_K8S_TDX_CLOUD_HYPERVISOR"|"CC_CRI_CONTAINERD_K8S_IMAGE_OFFLOAD_TO_GUEST") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need the CC_CRI_CONTAINERD_K8S_IMAGE_OFFLOAD_TO_GUEST
job if the 'original' ones are testing the image-offload to guest now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh - I've just noticed it was removed in a later commit, hence showing outdated
|
||
# Print the logs | ||
echo "-- Kata logs:" | ||
sudo journalctl -xe -t kata --since "$test_start_date" -n 100000 | ||
|
||
echo "-- containerd logs:" | ||
sudo journalctl -xe -t containerd --since "$test_start_date" -n 100000 | ||
|
||
echo "-- kubelet logs:" | ||
sudo journalctl -xe -t kubelet --since "$test_start_date" -n 100000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can probable delete these debug logs (or move them into the teardown if they are useful)
1bf9324
to
bf56ce3
Compare
/test |
bf56ce3
to
b27cbdd
Compare
/test |
b27cbdd
to
a39f358
Compare
/test |
90141bc
to
fa1115f
Compare
/test |
fa1115f
to
738ce6b
Compare
/test |
/test-ubuntu |
16f558a
to
5cce367
Compare
/test-ubuntu |
5cce367
to
ba151db
Compare
/test-ubuntu |
d40de58
to
052f845
Compare
/test-ubuntu |
/test |
1 similar comment
/test |
fe851d5
to
8ee76f1
Compare
/test-ubuntu |
8ee76f1
to
5da3fb1
Compare
/test-ubuntu |
5da3fb1
to
f847e0c
Compare
TDX is not a blocker as the network has been terrible lately and we dropped the cached components due to the lack of maintenance. |
@ryansavino - I think this is just waiting on SEV and SNP now. How is the investigation going as the SNP node has been offline for ~5 days now, or do you want us to merge this as is and you handle the AMD test fixes later? |
/test |
The SEV and SNP tests failed with:
and
so I'll retry them to see if that helps |
Both SEV and SNP re-run failed with:
so it looks like network issues on AMD's side? |
Yeah, I was hoping to see network failure around the same time, but they were a couple hours apart it looks. I've retriggered for now. I'll check up on it. |
SNP failed with:
SEV looks like a network issue I think |
@@ -14,6 +14,9 @@ UNENCRYPTED_IMAGE_URL="${IMAGE_REPO}:unencrypted" | |||
# Text to grep for active feature in guest dmesg output | |||
SNP_DMESG_GREP_TEXT="Memory Encryption Features active:.*SEV-SNP" | |||
|
|||
# Add sleep to give nydus snapshotter a chance to start-up as suggested by AMD folks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how bats executes commands outside of the main tests, setup or teardown methods. The snp test is still failing. I would recommend moving this to the setup_file
method. However, I think this check should go in the containerd_nydus_setup.sh
. Maybe SEV/SNP tests are the only things experiencing issues with this right now, but that doesn't necessarily mean that other things won't have issue with the snapshotter not being fully initialized later.
@@ -17,6 +17,9 @@ UNENCRYPTED_IMAGE_URL="${IMAGE_REPO}:unencrypted" | |||
SEV_DMESG_GREP_TEXT="Memory Encryption Features active:.*\(SEV$\|SEV \)" | |||
SEV_ES_DMESG_GREP_TEXT="Memory Encryption Features active:.*SEV-ES" | |||
|
|||
# Add sleep to give nydus snapshotter a chance to start-up as suggested by AMD folks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here as below for the snp.bats.
- AMD folks have suggested the test failures are due to the nydus snapshotter not being started when the tests run, so they added sleeps which seemed to help Signed-off-by: stevenhorsman <[email protected]>
5882630
to
5949184
Compare
/test |
Network failures again with SEV?
|
The SNP tests failed again and the log times indicate that both 30s sleeps took effect, so I'm not sure that's the solution:
|
The last SEV test run, all the tests failed. The pod event failure is showing this error:
Has anyone seen this type of error before? |
Yes, I saw an error like that when I as getting the non-TEE test working locally. I can't remember the context though. Can you try cleaning up the imaged you have with crictl rmi as that might be how I solved it? |
I saw the same failure on one of the SEV runs last week. I'm a bit worried we're introducing a new intermittent issue. |
The snp.bats was failing with the "object with key [] already exists" message. I cleaned up the test image on the host using the below command:
This fixed the above error. I ran a full test after that and the snp.bats passed successfully. So I think we still need that sleep or a script to detect nydus snapshotter is fully up and running. My hunch is that this will fix the sev.bats as well. I've cleaned up that image on the SEV node, and I'm re-triggering the tests. |
SEV Test 1 failed, can we try increasing the timeout values? |
Yeah, I'll try bumping them all to 40 to see if that helps |
- Bump timeouts from 20 to 40s to see if that helps with tests passing. Signed-off-by: stevenhorsman <[email protected]>
/test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bits that others did LGTM and the tests are passing
LGTM as well, let's have it merged. |
and test with the image offload to the guest instead.