Notes for Day-2 KVMForum ======================== QEMU Keynote ------------ - QEMU is now part of Software Freedom Conservancy - 3 releases in the last year, w/ lots of features - ACPI & PCI on ARM - virtio-vga - Groundwork for MTTCG - TriCore - More than 1.5 million LoC - Lines Added - Most lines added are for new hardware models - By directory: target-ppc, target-arm (64-bit new spike), target-mips - Contribution by authors/companies - 1 - RHT; 2 - Linaro; 3 - Individual contributors Towards multi-threaded TCG - Alex Bennée and Frederic Konrad ------------------------------------------------------------ - Tiny Code Generator - cross platform emulation w/o hardware acceleration - Current process model - Multi-threaded TCG - Why? -- We're living in a multi-core world - Raspberry-Pi: Quad-core Cortex A7 @900 Mhz - Dragonboard 40C -- 75$ - Intel i7 (4 core + 4 hyperthreads) @3.4 Ghz - Using QEMU for System bring up - Increasingly used for prototpying - As a development tool - Instrumentation and inspection - Record and playback - Reverse debugging - Cross Tooling - qemu-linux-user - Global State in QEMU - Numerous globals in TCG generation - TCG Runtime Structures - - Guest Memory Models - Atomic behaviour - LL/SC Semantics - Memory barriers - 3 approaches - Use threads/locks - Use processes/IPC - Re-write from scratch - What has been done - Protected code generation - Serialised the run loop - New memory heuristics - - TCG Runtime structures - per-CPU variables - Using locks - expensive for freq. read vCPU structures - complex when modifying multiple vCPUs data - Deferred Work - Existing queued_work mechanism - add work to queue - signal vCPU to exit - New queued_safe_work - TCG Summary - Move global vars. . . - Guest Memory Models - Atomic Behavior is easy when Single Threaded - Considerably harder when Multi-threaded - Load-link/Store-conditional (LL/SC) - SoftMMU - Maps guest loads/stores to host memory - uses an addend offset - Fast path in generated code - Slow path in C code - Victim cache lookup - How it works: Stage one - - Memory Model Summary - SoftMMU allows fairly efficient implementation - Device emulation - KVM already done it - added thread safety to a number of systems - introduced memory API - introduced I/O thread - TCG access to device memory - All MMIO pages are flagged in the SoftMMU TLB Message Passing Workloads ------------------------- - Usually, anything that frequently switches between running and idle. - Intuition: Workloads which don't involve IO virt should run at near native performance - Message Passing Workloads may not involve any IO but will still perform nX worse than native - (loopback) Memcache: 2x higher latency. - Microbenchmark: Loopback TCP_RR - Client and Server ping-pong 1-byte of data over an established TCP connection - Loopback: No networking devices (real or virtual) involved - Performance: Latency of each transaction - Virtual Overheads of TCP_RR - Message Passing on 1 CPU - Context Switch - Message Passing on >1 CPU - Interprocessor-interrupts - What's going on under the hood - VMEXITs are a good place to look at - MSR_WRITEs of TCP_RR - 10 MSR_WRITE - "Write to Model Specific Register" instruction executed in the guest - VMEXITs of TCP_RR - APIC Timer "Initial Count" Register - 8 per transaction - 4 on the critical path - NOHZ (tickless guest kernel) - "Disable" scheduler-tick upon entering idle. - "Enable" scheduler-tick upon leaving idle. - scheduler-tick - Why 2 writes. . . - HLT - x86 instruction - CPU stops executing instructions until an interrupt arrives - - IPI+HLT - Sending an IPI to wake up a HLT-ed CPU. - Same operation on bare metal is entirely implemented in hardware - KVM versus Hardware - Ring 0 Microbenchmark (kvm-unit-tests) - Median: KVM is 12x slower - Pathalogical case: KVM is 400x slower - Notes aobut the benchmark - No guest FPU to save/restore - Host otherwise idle (VCPU context switches to idle on HLT) - Host power management not the culprit - KVM HLT Internals - Unsuprisingly, the scheduler takes some time to run the vCPU - Slow even in the uncontended, cache-hot, case - - Experiment: Don't schedule on HLT - Eliminate almost all of the latency overhead by not scheduling on HLT. - Scheduling is often the right thing to do - Let other threads run the CPU - Halt-Polling - Step-1: Poll (for up to X nanoseconds) - If a task is waiting to run on our CPU, go to Step2 - Check if a guest interrupt arrived, if so, we're done - Step-2: schedule() - - Memcache: 1.5x latency improvement - Windows Event Objects: 2x latency improvement - Reduce message passing latenc by 10-15 us (including n/w latency) - It is merged in 4.0 Kernel - Use the KVM module parameter halt_poll_ns to control how long to poll on each HLT - Future improvements - Automatic poll toggling (remove idle CPU overhead by turning polling off) - Automatic halt_poll_ns - - Lazy Context Switching - - Conclusion - Halt-Polling saves 10-15 micro-sec on message passing round-trip latency. ARM: Caches that give you enough rope to shoot yourself in the foot - Marc Zyngier --------------------------------------------------------------------- - ARM: THe cache coherency myth, ad the facts - Facts - Cache coherent architecture - Scales from single CPU to massive SPM systems - Implementer chooses to offer caches that are - visible to software - invisible to software - . . . or any point between those two points - ARM: Cache architecture - (Modified) Harvard architecture - Multiple levels of caching (with snooping) - - ARM: Interacting with caches - The ARM arch. offers the usual (mostly) privileged operations to interact with caches - Caches are an essential part of the coherency protocol - Using uncached memory explicitly bypasses it - It looks logical to cope with the consequences - Emulated devices: the uncached I/O issue - Top rant about KVM/ARM: "My VGA adapter in QEMU doesn't work with KVM" - Userspace uses cached memory (via mmap) - The guest uses non-cached memory - Why would the CPU read back from it? - How to fix this mess - Hack guest attributes, forcing cacheable - Breaks devices that _need_ uncached access - Cache maintenance from userspace - Requires a new syscall on ARMv7 - Allow userspace to mmap uncached - And what if the guest maps it as cached? - Just tell the guest the device is coherent - Only _real_ solution - How did we end-up here? - A VGA device on an ARM VM looks like a terrible idea. - VGA was invented in 1987. . . - ARM VMs have no legacy to care about - We use paravirtualized devices for most things - Why don't we use virtio-vga as well? - Back to coherency: Emulated vs physical devices - Firmware does have some level of support to describe the cache coherency attributes: - Conclusion - KVM and its ecosystem are strongly x86 oriented (tainted?) - Not all the solutions that worked on x86 make sense on ARM - Nobody neds a Franekn-VM - We have the chance of a clean slate - It doesn't take much effort to fix KVM - All it takes is to read the - We already have modern, efficient solutions - Paravirt is the best thing since sliced bread - Firmware (UEFI) and high level tools (libvirt[*] and co.) seem to be the biggest issues - Probably the worse "x86-ism" - Isn't _that_ hard to address the problems - Just don't assume x86 is _alywas_ the model to follow [*] Rich W.M. Jones corrected the presenter on this about improved libvirt (and `virt-install`) support. QEMU interface introspection: from hacks to solutions - Markus Armbruster -------------------------------------------------------------- Part I - QEMU provides interfaces - QMP Monitor - Command line - QEMU command line - 139 total options - 14 deprecated - 2 internal use 123 supported options - QMP is _even_ bigger - 126 commands + 33 events - More than 700 named arguments and results - Defined the (book-sized) QAPI/QMP schema - Command line evolves fast - QMP eveolves _faster_ - Why interface introspection? - QEMU provides big, rapidly evolving interfaces - A program can - Tie to a specific build of QEMU Part II Prior work - Version numbers? - QEMU says 0.12.1 - QMP commands grew by 250% (in Upstream/RHEL6 comparision) - Version numbers insufficient - git-diff --shorstat v2.3.0. .v2.4.0 - git-log --oneline v.2.3.0. .v2.4.0 | wc -l - Just try to use it - Workable in simple cases - Example: libvirt tries QMP 'inject-nmi', falls back to old HMP 'nmi' - Complex, slow, fragile in not so simple cases - A real-wold failure of "just try" - block-commit new in v1.2, and libvirt just tried it: - Run `block-commit`, and if it succeeds - wait for event BLOCK_JOB_COMPLETED - Before v2.0, block-commit fails for active layer Since then, it succeeds, but requires manual `block-job-complete` to complete -> Old libvirt hangs on block-job-complete - Pretend to be human: read help -help -device help -device virtio-net,help -drive format=help Everybody did this (parsing help_ until QEMU grew real interfaces - QMP `query-commands` - This is very _limited_ QMP introsection - `query-command-line-options` - Results an array of options - array of object parameter types. . . - `query-command-line-options` is incomplete - Probably better than nothing - Certainly less than needed - `query-command-line-options` is inexpressive Things we'd like to know, but it can't tell: - Formats supported by -drive? - It only tells us parameter 'format' is "string" - Parameters supported with 'chardev socket'? - Where do we stand now? - Current introspection solutions work, but won't cut it much longer Part III QMP introspection - The basic idea - Interface introspection turns interface into data QMP is defined by QAPI schema Schema is _data_, so let clients query for it! - Cooking: `query-schema` - Exposes QMP wire ABIas defined in the schema: - Commands, events with arguments & results - Let's introspect a command - QAPI Schema for `query-block` - Return type is an array of BlockInfo - Let's introsect introspection! - QAPISchema for `query-schema` - returns: 'SchemaInfo' - Introspect SchemaInfo - - Quck peek under the hood - QAPI schema is compile time static - SchemaInfo is generated from it - Generator is 160 SLOC of Python - Complete info is a bit over 70KiB Part IV - QMP introspection limitations - Known issues - Can see only qapified commands - Can't see `device_add` - Can see only qapified arguments & results We cheat for netdev_add - Can't see most of netdev_add's arguments - Only as good as the qapification - `add_client` - Cleaning up qapification of netdev_add - Problem: type-specific arguments are missing Need to - qapify the type-specific arguments. . . - w/o upsetting the QMP wire format Wire format matches QAPI/QMP's flat union type Possible solution: - Support unions as commands - Qapifying `device_add` - Wire format like netdev_add - common + driver-specific arguments But: drivers collected only at run time! - QAPI schema fixed at run time - QAPI follow-up work - On the way to introspection, we - got ourselves real test coverage - What about the command lne? - Same basic idea: turn interface into data - Good: our CLI definition is data - Bad: not QAPI, less expressive, leaves more to code Choices: - Build non-QAPI CLI introspection? - Only as good as the data. . . - . . . Incremental Backups ------------------- - Refer the slides: http://events.linuxfoundation.org/sites/events/files/slides/kvm2015_rh_light_44_vfinal.pdf qcow2 - why (not)? ------------------ http://events.linuxfoundation.org/sites/events/files/slides/p0.pp_.pdf