A useful libvirt + QEMU debugging session in context of OpenStack Nova, by Daniel Berrange. From this review: https://review.openstack.org/#/c/181781/ -- libvirt: handle code=38 + sigkill (ebusy) in destroy() Context ------- When libvirt kills a process it sends it SIGTERM first and waits 10 seconds. If it hasn't gone # it sends SIGKILL and waits another 5 seconds. If it still hasn't gone then you get this EBUSY error. Usually when a QEMU process fails to go away upon SIGKILL it is because it is stuck in an uninterruptable kernel sleep waiting on I/O from some non-responsive server. Given the CPU load of the gate tests though, it is conceivable that the 15 second timeout is too short, particularly if the VM running tempest has a high steal time from the cloud host. ie 15 wallclock seconds may have passed, but the VM might have only have a few seconds of scheduled run time Debugging of why the isue still persists ---------------------------------------- "There are two reasons why you'd get this failure ("Failed to terminate process: Device or resource busy") from libvirt. The first is simply that the host is overloaded and so the kernel doesn't clean up qemu in time. The second is that there is some kind of storage failure causing QEMU to get stuck in kernelspace in an uninterruptable I/O operation. In this case no amounting of killing QEMU will work - it requires unblocking of the storage I/O to let it go. Since we've now waited 90 seconds it seems unlikely to be the first reason and more likely to be storage related." "I've been looking at the change here: https://review.openstack.org/#/c/179689/34 which exhibits a job with the EBUSY error. What's fun is that the job showing this error actually ended up with a Success status not failed :-) http://logs.openstack.org/89/179689/34/check/check-tempest-dsvm-centos7/c7fb267/ Taking the nova log http://logs.openstack.org/89/179689/34/check/check-tempest-dsvm-centos7/c7fb267/logs/screen-n-cpu.txt.gz We see the instance start 2015-06-11 01:48:01.322 INFO nova.compute.manager [req-8b49b17b-0473-408b-8c40-aaaeddcc4822 None None] [instance: bc7a8641-d99f-4981-84c8-536cbc231382] VM Started (Lifecycle Event) Then we try terminating it 2015-06-11 01:48:10.322 INFO nova.compute.manager [req-9794c663-8f3b-476c-b3b2-5cf16593051a ServerRescueNegativeTestJSON-1049499083 ServerRescueNegativeTestJSON-1148490304] [instance: bc7a8641-d99f-4981-84c8-536cbc231382] Terminating instance We see 3 failed attempts 2015-06-11 01:48:25.360 WARNING nova.virt.libvirt.driver [req-9794c663-8f3b-476c-b3b2-5cf16593051a ServerRescueNegativeTestJSON-1049499083 ServerRescueNegativeTestJSON-1148490304] [instance: bc7a8641-d99f-4981-84c8-536cbc231382] Error from libvirt during destroy. Code=38 Error=Failed to terminate process 16884 with SIGKILL: Device or resource busy; attempt 1 of 3 2015-06-11 01:48:40.592 WARNING nova.virt.libvirt.driver [req-9794c663-8f3b-476c-b3b2-5cf16593051a ServerRescueNegativeTestJSON-1049499083 ServerRescueNegativeTestJSON-1148490304] [instance: bc7a8641-d99f-4981-84c8-536cbc231382] Error from libvirt during destroy. Code=38 Error=Failed to terminate process 16884 with SIGKILL: Device or resource busy; attempt 2 of 3 2015-06-11 01:48:55.627 WARNING nova.virt.libvirt.driver [req-9794c663-8f3b-476c-b3b2-5cf16593051a ServerRescueNegativeTestJSON-1049499083 ServerRescueNegativeTestJSON-1148490304] [instance: bc7a8641-d99f-4981-84c8-536cbc231382] Error from libvirt during destroy. Code=38 Error=Failed to terminate process 16884 with SIGKILL: Device or resource busy; attempt 3 of 3 This bc7a8641-d99f-4981-84c8-536cbc231382 uuid corresponds to instance-0000004d. So looking at the QEMU log file for that http://logs.openstack.org/89/179689/34/check/check-tempest-dsvm-centos7/c7fb267/logs/libvirt/qemu/instance-0000004d.txt.gz I see the startup marker which corresponds to the timestamp nova has 2015-06-11 01:48:01.084+0000: starting up What is interesting is the shutdown marker qemu: terminating on signal 15 from pid 12032 2015-06-11 01:50:08.097+0000: shutting down Notice that this is another 75 seconds after Nova's attempt 3 failed. So the time we started termination to the time it actually succeeded was 2 minutes, but we only waited for 45 seconds (3 attempts * 15 seconds). I looked in syslog but didn't see any obvious storage related errors So perhaps the gate system really is that slow. Which would suggest you need to make the max attempts much much longer - perhaps 10, or even 15 attempts, which would give 150 seconds or 225 seconds."