Extend task_fetch retries #293

lulinqing · 2023-04-20T01:59:36Z

More beaker jobs are now relying on fetching tasks from git repos, and intermitted connectivity issue are causing unnecessary problems when restraint only tries a couple of times in half a minute.
It's a much bigger issue when core tasks like kernelinstall was aborted for that reason. The remaining tests in the recipe become pointless and even misleading due to false positive/negative from wrong kernel.
This minor change is increasing the number and interval of task_fetch retries - to make future jobs more tolerant with intermitted connectivity issue, and mitigate false negative before we have a better solution for #288 .

(BTW I'm okay with larger numbers, even if it may trigger watchdog timeout and abort the whole recipeset - which works in our favor.)

More beaker jobs are now relying on fetching tasks from git repos, and intermitted connectivity issue are causing unnecessary problems when restraint only tries a couple of times in half a minute. It's a much bigger issue when core tasks like kernelinstall was aborted for that reason. The remaining tests in the recipe become pointless and even misleading due to false positive/negative from wrong kernel. This minor change is increasing the number and interval of task_fetch retries - to make future jobs more tolerant with intermitted connectivity issue.

idorax · 2023-04-20T02:44:36Z

This minor change is increasing the number and interval of task_fetch retries - to make future jobs more tolerant with intermitted connectivity issue ...

Hi @lulinqing, can we let user specify env TASK_FETCH_RETRIES and get its value from restraint client? If it needs more effort, your patch with increasing macro TASK_FETCH_RETRIES and TASK_FETCH_INTERVAL looks good me :-)

lulinqing · 2023-04-20T03:05:30Z

Not sure how this works in Beaker job definition, but I’m okay to parameterize them. Sounds a nice feature. Meanwhile I’d still like to have the default number/interval increased asap - for reasons described earlier. And this should not conflict with your future plan. In case helpful, Jeff has more details from our conversation with Jirka. Thanks!

On Wed, Apr 19, 2023 at 22:44 Vector Li ***@***.***> wrote: This minor change is increasing the number and interval of task_fetch retries - to make future jobs more tolerant with intermitted connectivity issue ... Hi @lulinqing <https://github.com/lulinqing>, can we let user specify env TASK_FETCH_RETRIES and get its value from restraint client? If it needs more effort, your patch with increasing macro TASK_FETCH_RETRIES and TASK_FETCH_INTERVAL looks good me :-) — Reply to this email directly, view it on GitHub <#293 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGBOQD6TJH2SYJ6KPZZBBTXCCPJ5ANCNFSM6AAAAAAXE2SPTY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Linqing

jbastian · 2023-04-20T18:35:11Z

This minor change is increasing the number and interval of task_fetch retries - to make future jobs more tolerant with intermitted connectivity issue ...

Hi @lulinqing, can we let user specify env TASK_FETCH_RETRIES and get its value from restraint client? If it needs more effort, your patch with increasing macro TASK_FETCH_RETRIES and TASK_FETCH_INTERVAL looks good me :-)

I don't think a task parameter (i.e., environment variable) will work here: restraintd is started at boot and has its environment fixed at boot time. (See /proc/$(pidof restraintd)/environ) The task parameters only apply to the task itself, not the parent process (the restraintd harness).

We could, however, consider adding an environment variable to /etc/profile.d/beaker-harness-env.sh or /etc/profile.d/rh-env.sh which would impact the harness, and we could tweak the value in the Beaker kickstart script with a ks_meta variable.

lulinqing · 2023-04-20T20:03:53Z

We could, however, consider adding an environment variable to /etc/profile.d/beaker-harness-env.sh or /etc/profile.d/rh-env.sh which would impact the harness, and we could tweak the value in the Beaker kickstart script with a ks_meta variable.

That's more tangible~
Given this feature may take a while to implement (plus most jobs would go with default values - instead of tweaking ks_meta), does it make sense to merge this PR for a quick and transparent mitigation first?
Any additional test we need to go through before approved?
Thanks!

so that we can implement the Restraint RFE[1] in more elegant way: by add optional attribute in fetch element: <fetch url="http://my.download.host/path" retry="8" /> ref[1]: restraint-harness/restraint#293

so that we can implement the Restraint RFE[1] in more elegant way: by add optional attribute in fetch element: <fetch url="http://my.download.host/path" options="retry=8 timeo=8" /> ref[1]: restraint-harness/restraint#293

lulinqing · 2023-11-17T00:46:18Z

I was made aware this week that internal Gerrit mirror sites which support git:// protocol will be gone soon (with Gerrit itself), just like they did for dist-git earlier.
Using git protocol has been QE's main mitigation for task-fetching timeout in Beaker jobs, the abort/failure rate would worsen after moving back to downloading tarball of entire repo via https per task.

While we are pulling together a restraint volunteer group to properly build/test/validate other more sophisticated proposals like #295 , can we have some quick review/approval on this 2-liner patch to improve fault tolerance over networking issues?

it should pose little to none risk of regression, yet significant reduce related failure rate across board, either git or https
it's not overlapping with other solutions like HTTP fetch abort_recipe_on_fail attribute #295 even both get merged later.

@StykMartin @cbouchar @p3ck @jbastian
Thanks a lot!

tcler mentioned this pull request Oct 7, 2023

beaker-job schema: add optional attr 'retry' to task.fetch element beaker-project/beaker#180

Closed

jbastian approved these changes Nov 17, 2023

View reviewed changes

lulinqing merged commit e5645a0 into restraint-harness:master Nov 27, 2023

lulinqing deleted the TASK_FETCH_RETRIES branch November 27, 2023 15:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend task_fetch retries #293

Extend task_fetch retries #293

lulinqing commented Apr 20, 2023

idorax commented Apr 20, 2023

lulinqing commented Apr 20, 2023 via email

jbastian commented Apr 20, 2023

lulinqing commented Apr 20, 2023

lulinqing commented Nov 17, 2023 •

edited

Loading

Extend task_fetch retries #293

Extend task_fetch retries #293

Conversation

lulinqing commented Apr 20, 2023

idorax commented Apr 20, 2023

lulinqing commented Apr 20, 2023 via email

jbastian commented Apr 20, 2023

lulinqing commented Apr 20, 2023

lulinqing commented Nov 17, 2023 • edited Loading

lulinqing commented Nov 17, 2023 •

edited

Loading