Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend task_fetch retries #293

Merged
merged 1 commit into from
Nov 27, 2023

Conversation

lulinqing
Copy link
Contributor

More beaker jobs are now relying on fetching tasks from git repos, and intermitted connectivity issue are causing unnecessary problems when restraint only tries a couple of times in half a minute.
It's a much bigger issue when core tasks like kernelinstall was aborted for that reason. The remaining tests in the recipe become pointless and even misleading due to false positive/negative from wrong kernel.
This minor change is increasing the number and interval of task_fetch retries - to make future jobs more tolerant with intermitted connectivity issue, and mitigate false negative before we have a better solution for #288 .

(BTW I'm okay with larger numbers, even if it may trigger watchdog timeout and abort the whole recipeset - which works in our favor.)

More beaker jobs are now relying on fetching tasks from git repos, and intermitted connectivity issue are causing unnecessary problems when restraint only tries a couple of times in half a minute.
It's a much bigger issue when core tasks like kernelinstall was aborted for that reason. The remaining tests in the recipe become pointless and even misleading due to false positive/negative from wrong kernel.
This minor change is increasing the number and interval of task_fetch retries - to make future jobs more tolerant with intermitted connectivity issue.
@idorax
Copy link

idorax commented Apr 20, 2023

This minor change is increasing the number and interval of task_fetch retries - to make future jobs more tolerant with intermitted connectivity issue ...

Hi @lulinqing, can we let user specify env TASK_FETCH_RETRIES and get its value from restraint client? If it needs more effort, your patch with increasing macro TASK_FETCH_RETRIES and TASK_FETCH_INTERVAL looks good me :-)

@lulinqing
Copy link
Contributor Author

lulinqing commented Apr 20, 2023 via email

@jbastian
Copy link
Contributor

This minor change is increasing the number and interval of task_fetch retries - to make future jobs more tolerant with intermitted connectivity issue ...

Hi @lulinqing, can we let user specify env TASK_FETCH_RETRIES and get its value from restraint client? If it needs more effort, your patch with increasing macro TASK_FETCH_RETRIES and TASK_FETCH_INTERVAL looks good me :-)

I don't think a task parameter (i.e., environment variable) will work here: restraintd is started at boot and has its environment fixed at boot time. (See /proc/$(pidof restraintd)/environ) The task parameters only apply to the task itself, not the parent process (the restraintd harness).

We could, however, consider adding an environment variable to /etc/profile.d/beaker-harness-env.sh or /etc/profile.d/rh-env.sh which would impact the harness, and we could tweak the value in the Beaker kickstart script with a ks_meta variable.

@lulinqing
Copy link
Contributor Author

We could, however, consider adding an environment variable to /etc/profile.d/beaker-harness-env.sh or /etc/profile.d/rh-env.sh which would impact the harness, and we could tweak the value in the Beaker kickstart script with a ks_meta variable.

That's more tangible~
Given this feature may take a while to implement (plus most jobs would go with default values - instead of tweaking ks_meta), does it make sense to merge this PR for a quick and transparent mitigation first?
Any additional test we need to go through before approved?
Thanks!

tcler added a commit to tcler/beaker that referenced this pull request Oct 7, 2023
so that we can implement the Restraint RFE[1] in more elegant way:
by add optional attribute in fetch element:
<fetch url="http://my.download.host/path" retry="8" />

ref[1]: restraint-harness/restraint#293
tcler added a commit to tcler/beaker that referenced this pull request Oct 11, 2023
so that we can implement the Restraint RFE[1] in more elegant way:
by add optional attribute in fetch element:
<fetch url="http://my.download.host/path" options="retry=8 timeo=8" />

ref[1]: restraint-harness/restraint#293
tcler added a commit to tcler/beaker that referenced this pull request Oct 11, 2023
so that we can implement the Restraint RFE[1] in more elegant way:
by add optional attribute in fetch element:
<fetch url="http://my.download.host/path" options="retry=8 timeo=8" />

ref[1]: restraint-harness/restraint#293
@lulinqing
Copy link
Contributor Author

lulinqing commented Nov 17, 2023

I was made aware this week that internal Gerrit mirror sites which support git:// protocol will be gone soon (with Gerrit itself), just like they did for dist-git earlier.
Using git protocol has been QE's main mitigation for task-fetching timeout in Beaker jobs, the abort/failure rate would worsen after moving back to downloading tarball of entire repo via https per task.

While we are pulling together a restraint volunteer group to properly build/test/validate other more sophisticated proposals like #295 , can we have some quick review/approval on this 2-liner patch to improve fault tolerance over networking issues?

  • it should pose little to none risk of regression, yet significant reduce related failure rate across board, either git or https
  • it's not overlapping with other solutions like HTTP fetch abort_recipe_on_fail attribute #295 even both get merged later.

@StykMartin @cbouchar @p3ck @jbastian
Thanks a lot!

@lulinqing lulinqing merged commit e5645a0 into restraint-harness:master Nov 27, 2023
@lulinqing lulinqing deleted the TASK_FETCH_RETRIES branch November 27, 2023 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants