Notes for Day-3 KVMForum ======================== Backing Chain Management in QEMU and libvirt, Eric Blake --------------------------------------------------------- - Feature bits in qcow2v3 - qemu-io -c "write . . ." base.qcow2 - Internal snapshots - optionally, clinckuding live VM state; - no I/O penalties to active state. - External snapshots - optimized QMP performance - I/O overhead in long chains - 'qemu-img map' can tell you where the clusters live - Points in time vs. file names - think of files as a Delta from the backing file - Backing files must not change - qcow2 block operations are NOT a substitute for overlayfs - Rule of thumb: backing files must never be changed - Block-stream primitive ("pull") - QEMU 2.5 will allow intermediate streaming - Always safe, restartable - Block-commit primitive ("commit") - Restartable, but remember caveat about editing a shared backing file. - Future QEMU may add additional commit mode that combines pull and commit, so that files removed from chain are still consistent. - Drive-mirror primitive ("copy") - Copy all or part of one chain to another destination - Destination can be pre-created, as long as the data seen by the guest is identical b/n src & dst when starting. - Drive-backup primitive - Similar to Drive-mirror, but with a different point in time - Libvirt control - Name a specific chain member by index (vda) or file name (wrap.qcow2) - virDomainSnapshotCreationXML() API - blockdev-snapshot-sync - virDomainBlockRebase() API - virDomainBlockCopy() / virDomainBlockJobAbort() - cp --reflink=always - Libvirt (What did we do wrong?) ------------------------------- Nice talk by Michal Privoznik. - OLD: virDomainShutdown(virDomainPtr domain) - NEW: virDomainShutdownFlags(virDomainPtr domain,unsigned int flags) - OLD: virDomainCoreDump(virDomainPtr dom, const char *to, unsigned int flags) - NEW: virDomainCoreDumpWithFormat(vorDomainPtr dom, const char *to, unsigned int fmt, . . .) - APIs too tied to a specific hypervisor - OLD: virDomainCreate(virDomainPtr dom) - NEW: virDomainCreateWithFlags(virDomainPtr dom, unsigned int flags) - - From bare C to glibc - What did we do right? - Internal objects - Objects are not exposed to user - Public header file - Private header file - Analysis (APIs missing some arguments) - Changing requirements - Hard to do proper analysis and design - Block Jobs: current status, upcoming challenges -- Jeff Cody, Red Hat --------------------------------------------------------------------- - Block Jobs are executed via cououtine - Asynchronous - Four Block Jobs - Backup, Stream, Commit, Mirror - Backup - 'blockdev-backup; - Incremental 'sync' mode; - COLO support - Stream - Intermediate streaming - Commit - Bug fixes/ Cleanups - Mirror - Bitmap spoiling fix - Bitmap scanning speedup - Block Job Infrastructure - Nested Pause - Future Challenges/Improvements - Operational Blockers - Safe (r) Commit - Structure of a Block Job - Key Components - QMP command definition - Block Job Cououtine - Block Job Events - BLOCK_JOB-{Completed, Cancelled, Error, Ready} - Block Job Control - set-speed, cancel, pause, resume, control Live Migration with SR-IOV -------------------------- TO-WATCH. Improving QEMU event loop ------------------------- http://blog.vmsplice.net/2011/03/qemu-internals-big-picture-overview.html Reference: - http://blog.vmsplice.net/2011/03/qemu-internals-overall-architecture-and.html The event loops in QEMU ~~~~~~~~~~~~~~~~~~~~~~~ - Threads of QEMU - main, iothread(*) - ppoll, (*)thread pool worker - Main loop - Dispatches fd events - aio: block I/O, ioeventfd - iohandler: net, nbd, audio, ui, vfio, ... - slirp: 0net user - chardev: -chardev XXX - Non-fd services - timers - bottom halves - Main loop (again) - Prepare - slirp_pollfds_fill(gpollfd, &timeout) qemu_iohandler_fill(gpollfd) timeout = qemu_soonest_timeout(timeout, timeer_deadline) glib_pollfds_fill(gpollfd, &timeout) - Poll - qemu_poo_ns(gpollfd, timeout) - Dispatch - fd, BH, aio timers - main loop timers - iothread (dataplane) - Equals to AIO context GSource in the main loop - - Challenges I & II - Challenge #1: consistency - Challenge #2: scalability - The loop runs slower as more fds are polled - *_pollfds_fill() and add_pollfd() take longer. - qemu_poll_ns() (ppoll(2)) takes longer. - the dispatch walking through more nodes takes longer - Benchmarking virtio-scsi on ramdisk - Disk IOPS degradation with increasing number of fds - Benchmarking virtio-scsi-dataplane - O(n) - Solution: epoll - "epoll . . . scales well to large no. of watched fds" - epoll_create - epoll_ctl - EPOLL_CTL_ADD - EPOLL_CTL_MOD - EPOLL_CTL_DEL - epoll_wait Doesn't fit in current main loop model. - Cure: aio interface is similar to epoll - New implementation: - aio_set_fd_handler(ctx, fd, ...) - aio_set_event_notifier(ctx, notifier, ...) - Challenge #2&1/2: epoll timeout - Timeout in epoll is in ms [. . .] - But, nanosecond timeouts is desired . . . - Solution # 2&1/2: epoll timeout - Timeout precision is kept by combining timerfd: - Begin with a timerfd added to epollfd. - Update the timerfd before epoll_wait() - Do epoll_wait with timeout=-1 - If AIO can use epoll, what about main loop? Move main loop ingredients on to aio_poll() Resolve challenge#1 - Solution: consistency - Rebase all other ingredients in main loop onto AIO: - Make iohandler interface consistent with aio interface by dropping fd_read_poll [done] - Convert slirp to AIO - Convert iohandler to AIO. - Convert chardev GSource to aio or an equivalent interface - Unify with AIO - Next step: Convert main loop to use aio_poll() - Nested event loops - Block layer synchronous calls are implemented with nested aio_poll(). E.g. - void bdrv_aio_cancel(BlockAIOCB *acb) { qemu_aio_ref(acb); bdrv_aio_cancel_async(acb); - Example of a nested event loop (drive-backup call stack from gdb) - Challenge #3: correctness Solution: aio_client_disable/enable - qtest