Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Packer hangs while running shell provisioners #11407

Open
aviadtamir opened this issue Nov 21, 2021 · 6 comments
Open

Packer hangs while running shell provisioners #11407

aviadtamir opened this issue Nov 21, 2021 · 6 comments
Labels

Comments

@aviadtamir
Copy link

aviadtamir commented Nov 21, 2021

Hi
Packer is hanging while i'm running shell provisioners.
The hanging build is happening to me on two different qemu builds and on two different scripts. one thing in common for both of those scripts is that they produce a lot of output to stdout which might be the case (based on this former closed issue #4627)
those same scripts are running also on amazon-ebs builds where this issue haven't reproduced, it only occurs on qemu builds.
couldn't find anything in the log file and it all seems fine except that the script never exits

i also tried two different packer versions, 1.7.8 and 1.5.5 and it occurred on both
@SwampDragons - i saw that you commented on the issue i mentioned above

Thnx

@aviadtamir aviadtamir added the bug label Nov 21, 2021
@ladar
Copy link
Contributor

ladar commented Apr 5, 2022

I'm seeing a similar possibly related issue. A shell provisioner will fail, but instead of aborting the build, packer will hang, even past the configured timeout. I've seen variations on this issue with other scripts, box configs, where a build pipeline will hang indefinitely.

2022/03/30 10:16:21 ui error: ==> magma-arch-virtualbox: + printf 'alias vi=vim\n'
2022/03/30 10:16:21 ui error: ==> magma-arch-virtualbox: + reboot
2022/03/30 12:15:42 ui error: ==> magma-arch-virtualbox: Cancelling provisioner after a timeout...
2022/04/05 00:46:05 packer-builder-virtualbox-iso plugin: Executing VBoxManage: []string{"controlvm", "magma-arch-virtualbox", "poweroff"}
2022/04/05 00:46:06 ui: ==> magma-arch-virtualbox: Cleaning up floppy disk...
2022/04/05 00:46:06 packer-builder-virtualbox-iso plugin: Executing VBoxManage: []string{"storageattach", "magma-arch-virtualbox", "--storagectl", "IDE Controller", "--port", "0", "--device", "1", "--type", "dvddrive", "--medium", "none"}
2022/04/05 00:46:06 ui: ==> magma-arch-virtualbox: Deregistering and deleting VM...
2022/04/05 00:46:06 packer-builder-virtualbox-iso plugin: Executing VBoxManage: []string{"unregistervm", "magma-arch-virtualbox", "--delete"}

The provisioner has a timeout of 120m, and if you look at the log, you'll see that at ~120m it says "Cancelling provisioner after a timeout..." but instead of aborting the box build, or trying the next step, packer hangs until I come along several days later and abort it manually using Control-C.

This is with packer 1.6.6 and VirtualBox 5.2.44. At one point packer had issues with VirtualBox 5.2, and trying to use an invalid command line option. I don't know if that's been fixed, but if so I can try updating to a newer packer version.

The box/config which triggered this issue this time was the magma-arch-virtualbox config from the magma-virtualbox.json file.

If you want to try running it, clone the robox repo and run: ./robox.sh box magma-arch-virtualbox ...

I've seen this issue pop up for awhile now, but it seems to occur rather inconsistently, so it's been hard to pickup on any pattern.

@ladar
Copy link
Contributor

ladar commented May 28, 2022

I might have a fix for this. Testing it now, but since the hangs are intermittent, it might be awhile before I'm confident of the cause/workaround.

Not posting this update to tease anyone, but to keep the nasty hashicorp robots from closing this issue prematurely.

@ladar
Copy link
Contributor

ladar commented Jul 29, 2022

So far my fix has worked. It turns out that whenever the host renewed its DHCP lease, it was triggering a dispatch event, even when the renewal doesn't result in the host IP changing. But regardless of the outcome, the renewal event was causing packer to lose track of the guest IP address. I don't recall at this point if the guest IP was changing, or VirtualBox simply lost track, but either way, disabling the following dispatchers seems to have fixed the issue.

Please note, on my CentOS 7 server, which does not have a graphical desktop, there is nothing important being triggered by the dispatch event, so it was safe to run the following commands. Your environment may be different.

systemctl disable dbus-org.freedesktop.nm-dispatcher.service
systemctl disable NetworkManager-dispatcher.service 
systemctl mask dbus-org.freedesktop.nm-dispatcher.service
systemctl mask NetworkManager-dispatcher.service

@aj-cruz
Copy link

aj-cruz commented Oct 31, 2022

I'm having hanging issues with a shell provisioner as well. Mine hangs after "starting remote command" about 25% of the time.
It is happening on Packer v1.8.3 & v1.8.4

==> vsphere-clone.Nautobot: Provisioning with shell script: ./provision_scripts/build/install_docker.sh
2022/10/31 07:03:44 packer-provisioner-shell plugin: Opening ./provision_scripts/build/install_docker.sh for reading
2022/10/31 07:03:44 packer-provisioner-shell plugin: [INFO] 1376 bytes written for 'uploadData'
2022/10/31 07:03:44 packer-plugin-vsphere_v1.0.8_x5.0_linux_amd64 plugin: 2022/10/31 07:03:44 [DEBUG] Opening new ssh session
2022/10/31 07:03:44 [INFO] 1376 bytes written for 'uploadData'
2022/10/31 07:03:44 packer-plugin-vsphere_v1.0.8_x5.0_linux_amd64 plugin: 2022/10/31 07:03:44 [DEBUG] Starting remote scp process:  scp -vt /tmp
2022/10/31 07:03:44 packer-plugin-vsphere_v1.0.8_x5.0_linux_amd64 plugin: 2022/10/31 07:03:44 [DEBUG] Started SCP session, beginning transfers...
2022/10/31 07:03:44 packer-plugin-vsphere_v1.0.8_x5.0_linux_amd64 plugin: 2022/10/31 07:03:44 [DEBUG] Copying input data into temporary file so we can read the length
2022/10/31 07:03:44 packer-plugin-vsphere_v1.0.8_x5.0_linux_amd64 plugin: 2022/10/31 07:03:44 [DEBUG] scp: Uploading script_1669.sh: perms=C0644 size=1376
2022/10/31 07:03:44 packer-plugin-vsphere_v1.0.8_x5.0_linux_amd64 plugin: 2022/10/31 07:03:44 [DEBUG] SCP session complete, closing stdin pipe.
2022/10/31 07:03:44 packer-plugin-vsphere_v1.0.8_x5.0_linux_amd64 plugin: 2022/10/31 07:03:44 [DEBUG] Waiting for SSH session to complete.
2022/10/31 07:03:44 packer-plugin-vsphere_v1.0.8_x5.0_linux_amd64 plugin: 2022/10/31 07:03:44 [DEBUG] scp stderr (length 71): Sink: C0644 1376 script_1669.sh
2022/10/31 07:03:44 packer-plugin-vsphere_v1.0.8_x5.0_linux_amd64 plugin: scp: debug1: fd 0 clearing O_NONBLOCK
2022/10/31 07:03:44 packer-plugin-vsphere_v1.0.8_x5.0_linux_amd64 plugin: 2022/10/31 07:03:44 [DEBUG] Opening new ssh session
2022/10/31 07:03:44 packer-plugin-vsphere_v1.0.8_x5.0_linux_amd64 plugin: 2022/10/31 07:03:44 [DEBUG] starting remote command: chmod 0755 /tmp/script_1669.sh
2022/10/31 07:03:44 packer-plugin-vsphere_v1.0.8_x5.0_linux_amd64 plugin: 2022/10/31 07:03:44 [INFO] RPC endpoint: Communicator ended with: 0
2022/10/31 07:03:44 [INFO] RPC client: Communicator ended with: 0
2022/10/31 07:03:44 [INFO] RPC endpoint: Communicator ended with: 0
2022/10/31 07:03:44 packer-provisioner-shell plugin: [INFO] RPC client: Communicator ended with: 0
2022/10/31 07:03:44 packer-plugin-vsphere_v1.0.8_x5.0_linux_amd64 plugin: 2022/10/31 07:03:44 [DEBUG] Opening new ssh session
2022/10/31 07:03:44 packer-plugin-vsphere_v1.0.8_x5.0_linux_amd64 plugin: 2022/10/31 07:03:44 [DEBUG] starting remote command: chmod +x /tmp/script_1669.sh; BATFISH_VER='2022.09.08.86' DOCKER_COMPOSE_VER='2.12.2' DOCKER_VER='5:20.10.21~3-0~ubuntu-jammy' NAUTOBOT_IMAGE='1.4.7' NGINX_VER='1.23.1' OVA_VER='rc4' PACKER_BUILDER_TYPE='vsphere-clone' PACKER_BUILD_NAME='Nautobot' PACKER_HTTP_ADDR='172.24.10.141:0' PACKER_HTTP_IP='172.24.10.141' PACKER_HTTP_PORT='0' POSTGRES_VER='14' PYTHON_VER='3.9' REDIS_VER='alpine' UBUNTU_VER='22.04.1'  /tmp/script_1669.sh

@ladar
Copy link
Contributor

ladar commented Dec 8, 2022

I can't offer any insight on the vsphere plugin/provisioners. But @aj-cruz you might want to manually test/check whether the guest IP is changing. Quite simply you can find the connecting IP in the Packer log (you might need to enable a debug level of output). You can then connect yourself from the host while packer is running and see if your connection gets dropped, and/or test whether you can still manually connect after the packer connection is dropped. use the -on-error=ask command line option to keep Packer from destroying the guest after the error, and give you time to troubleshoot.

But as I mentioned above, I haven't had poroblems with provisioners hanging since I disabled those dispatchers. Of course that was on CentOS 7 build robots. Different distros and releases are likely to have different dispatchers which can trigger the underlying issue, which is tricking the system into thinking it needs to give the guest a new DHCP IP, and Packer not being smart enough to use the platform API to find the current IP and see if it's changed before exploding. In theory it should be able to find a new guest IP using the same process it uses to find it in the first place, which varies by platform.

@palacsint
Copy link

We are experiencing the same hangup with the Ansible provisioner.

Packer output from a Jenkins job:

22:38:57  ==> hcloud.xxxyyy: Setting a 20m0s timeout for the next provisioner...
22:38:57  ==> hcloud.xxxyyy: Provisioning with Ansible...
22:38:57      hcloud.xxxyyy: Setting up proxy adapter for Ansible....
22:38:57  ==> hcloud.xxxyyy: Executing Ansible: ansible-playbook [...]
[...]
22:39:13      hcloud.xxxyyy:
22:39:13      hcloud.xxxyyy: TASK [add-repository : add repository] *****************************
22:55:35  ==> hcloud.xxxyyy: EOF
22:59:11  ==> hcloud.xxxyyy: Cancelling provisioner after a timeout...
Cancelling nested steps due to timeout
23:38:10  Sending interrupt signal to process
23:38:20  ==> hcloud.xxxyyy: Provisioning step had errors: Running the cleanup provisioner, if present...
23:38:20  ==> hcloud.xxxyyy: Destroying server...
23:38:20  ==> hcloud.xxxyyy: Could not destroy server (please destroy it manually): server with ID '...' not found (not_found, ...)
23:38:20  ==> hcloud.xxxyyy: Deleting temporary SSH key...

I manually deleted the cloud server at 22:39:20 for testing.

Packer should terminate the Ansible provisioner around 23:08 (20 minutes provisioner timeout).

At 23:38 the Jenkins pipeline timeout (60 minutes) killed the packer process.

pstree output:

    ├─sh─┬─sh───sleep
    │    └─script.sh.copy───packer─┬─17*[{packer}]
    │                              └─packer─┬─28*[{packer}]
    │                                       ├─packer-plugin-h───28*[{packer-plugin-h}]
    │                                       ├─packer───18*[{packer}]
    │                                       ├─packer───17*[{packer}]
    │                                       ├─packer-plugin-a─┬─22*[{packer-plugin-a}]
    │                                       │                 └─ansible-playboo─┬─{ansible-playboo}
    │                                       │                                   └─ansible-playboo───sftp───ssh
    │                                       └─2*[packer───16*[{packer}]]

Provisioner from the HCL file:

  provisioner "ansible" {
    playbook_file = "../playbooks/...yml"

    groups = local.ansible_groups
    host_alias = local.ansible_host_alias
    extra_arguments = local.ansible_extra_arguments
    keep_inventory_file = false
    timeout = "20m"
  }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants