A useful libvirt + QEMU debugging session in context of OpenStack Nova,
by Daniel Berrange.

From this review: https://review.openstack.org/#/c/181781/ -- libvirt:
handle code=38 + sigkill (ebusy) in destroy()

Context
-------

    When libvirt kills a process it sends it SIGTERM first and waits 10
    seconds. If it hasn't gone # it sends SIGKILL and waits another 5
    seconds. If it still hasn't gone then you get this EBUSY error.
    Usually when a QEMU process fails to go away upon SIGKILL it is
    because it is stuck in an uninterruptable kernel sleep waiting on
    I/O from some non-responsive server. Given the CPU load of the gate
    tests though, it is conceivable that the 15 second timeout is too
    short, particularly if the VM running tempest has a high steal time
    from the cloud host. ie 15 wallclock seconds may have passed, but
    the VM might have only have a few seconds of scheduled run time

Debugging of why the isue still persists
----------------------------------------

"There are two reasons why you'd get this failure ("Failed to terminate
process: Device or resource busy") from libvirt. The first is simply
that the host is overloaded and so the kernel doesn't clean up qemu in
time. The second is that there is some kind of storage failure causing
QEMU to get stuck in kernelspace in an uninterruptable I/O operation. In
this case no amounting of killing QEMU will work - it requires
unblocking of the storage I/O to let it go. Since we've now waited 90
seconds it seems unlikely to be the first reason and more likely to be
storage related."


"I've been looking at the change here:

    https://review.openstack.org/#/c/179689/34

which exhibits a job with the EBUSY error. What's fun is that the job
showing this error actually ended up with a Success status not failed
:-)

    http://logs.openstack.org/89/179689/34/check/check-tempest-dsvm-centos7/c7fb267/

Taking the nova log

    http://logs.openstack.org/89/179689/34/check/check-tempest-dsvm-centos7/c7fb267/logs/screen-n-cpu.txt.gz

We see the instance start

    2015-06-11 01:48:01.322 INFO nova.compute.manager [req-8b49b17b-0473-408b-8c40-aaaeddcc4822 None None] [instance: bc7a8641-d99f-4981-84c8-536cbc231382] VM Started (Lifecycle Event)

Then we try terminating it

    2015-06-11 01:48:10.322 INFO nova.compute.manager [req-9794c663-8f3b-476c-b3b2-5cf16593051a ServerRescueNegativeTestJSON-1049499083 ServerRescueNegativeTestJSON-1148490304] [instance: bc7a8641-d99f-4981-84c8-536cbc231382] Terminating instance

We see 3 failed attempts

    2015-06-11 01:48:25.360 WARNING nova.virt.libvirt.driver [req-9794c663-8f3b-476c-b3b2-5cf16593051a ServerRescueNegativeTestJSON-1049499083 ServerRescueNegativeTestJSON-1148490304] [instance: bc7a8641-d99f-4981-84c8-536cbc231382] Error from libvirt during destroy. Code=38 Error=Failed to terminate process 16884 with SIGKILL: Device or resource busy; attempt 1 of 3

    2015-06-11 01:48:40.592 WARNING nova.virt.libvirt.driver [req-9794c663-8f3b-476c-b3b2-5cf16593051a ServerRescueNegativeTestJSON-1049499083 ServerRescueNegativeTestJSON-1148490304] [instance: bc7a8641-d99f-4981-84c8-536cbc231382] Error from libvirt during destroy. Code=38 Error=Failed to terminate process 16884 with SIGKILL: Device or resource busy; attempt 2 of 3

    2015-06-11 01:48:55.627 WARNING nova.virt.libvirt.driver [req-9794c663-8f3b-476c-b3b2-5cf16593051a ServerRescueNegativeTestJSON-1049499083 ServerRescueNegativeTestJSON-1148490304] [instance: bc7a8641-d99f-4981-84c8-536cbc231382] Error from libvirt during destroy. Code=38 Error=Failed to terminate process 16884 with SIGKILL: Device or resource busy; attempt 3 of 3

This bc7a8641-d99f-4981-84c8-536cbc231382 uuid corresponds to
instance-0000004d. So looking at the QEMU log file for that

    http://logs.openstack.org/89/179689/34/check/check-tempest-dsvm-centos7/c7fb267/logs/libvirt/qemu/instance-0000004d.txt.gz

I see the startup marker which corresponds to the timestamp nova has

    2015-06-11 01:48:01.084+0000: starting up

What is interesting is the shutdown marker

    qemu: terminating on signal 15 from pid 12032 2015-06-11 01:50:08.097+0000: shutting down

Notice that this is another 75 seconds after Nova's attempt 3 failed.

So the time we started termination to the time it actually succeeded was
2 minutes, but we only waited for 45 seconds (3 attempts * 15 seconds).

I looked in syslog but didn't see any obvious storage related errors

So perhaps the gate system really is that slow. Which would suggest you
need to make the max attempts much much longer - perhaps 10, or even 15
attempts, which would give 150 seconds or 225 seconds."