Notes for Day-2 KVMForum
========================

QEMU Keynote
------------

- QEMU is now part of Software Freedom Conservancy
- 3 releases in the last year, w/ lots of features
   - ACPI & PCI on ARM
   - virtio-vga
   - Groundwork for MTTCG
   - TriCore
- More than 1.5 million LoC
- Lines Added
   - Most lines added are for new hardware models 
   - By directory: target-ppc, target-arm (64-bit new spike),
     target-mips
- Contribution by authors/companies
   - 1 - RHT; 2 - Linaro; 3 - Individual contributors


Towards multi-threaded TCG - Alex Bennée and Frederic Konrad
------------------------------------------------------------

- Tiny Code Generator - cross platform emulation w/o hardware
  acceleration

- Current process model

- Multi-threaded TCG
   - Why? -- We're living in a multi-core world
   - Raspberry-Pi: Quad-core Cortex A7 @900 Mhz
   - Dragonboard 40C -- 75$
   - Intel i7 (4 core + 4 hyperthreads) @3.4 Ghz

- Using QEMU for System bring up
   - Increasingly used for prototpying

- As a development tool
   - Instrumentation and inspection
   - Record and playback
   - Reverse debugging

- Cross Tooling

- qemu-linux-user

- Global State in QEMU
   - Numerous globals in TCG generation
   - TCG Runtime Structures
   - 

- Guest Memory Models
   - Atomic behaviour
   - LL/SC Semantics
   - Memory barriers

- 3 approaches
   - Use threads/locks
   - Use processes/IPC
   - Re-write from scratch

- What has been done
   - Protected code generation
   - Serialised the run loop
   - New memory heuristics
   - 

- TCG Runtime structures
- per-CPU variables

- Using locks 
   - expensive for freq. read vCPU structures
   - complex when modifying multiple vCPUs data

- Deferred Work
   - Existing queued_work mechanism
      - add work to queue
      - signal vCPU to exit
   - New queued_safe_work

- TCG Summary
   - Move global vars. . .


- Guest Memory Models
   - Atomic Behavior is easy when Single Threaded
   - Considerably harder when Multi-threaded

- Load-link/Store-conditional (LL/SC)

- SoftMMU
   - Maps guest loads/stores to host memory
      - uses an addend offset
   - Fast path in generated code
   - Slow path in C code
      - Victim cache lookup

   - How it works: Stage one
      -

- Memory Model Summary
   - SoftMMU allows fairly efficient implementation

- Device emulation
   - KVM already done it
      - added thread safety to a number of systems
      - introduced memory API
      - introduced I/O thread

- TCG access to device memory
   - All MMIO pages are flagged in the SoftMMU TLB


Message Passing Workloads
-------------------------

- Usually, anything that frequently switches between running and idle.

- Intuition: Workloads which don't involve IO virt should run at near
  native performance
   - Message Passing Workloads may not involve any IO but will still
     perform nX worse than native
      - (loopback) Memcache: 2x higher latency.

- Microbenchmark: Loopback TCP_RR
   - Client and Server ping-pong 1-byte of data over an established TCP
     connection
   - Loopback: No networking devices (real or virtual) involved
   - Performance: Latency of each transaction

- Virtual Overheads of TCP_RR
   - Message Passing on 1 CPU
      - Context Switch
   - Message Passing on >1 CPU
      - Interprocessor-interrupts
   - What's going on under the hood
      - VMEXITs are a good place to look at

- MSR_WRITEs of TCP_RR
   - 10 MSR_WRITE
      - "Write to Model Specific Register" instruction executed in the
        guest

- VMEXITs of TCP_RR

- APIC Timer "Initial Count" Register
   - 8 per transaction
      - 4 on the critical path
   - NOHZ (tickless guest kernel)
      - "Disable" scheduler-tick upon entering idle.
      - "Enable" scheduler-tick upon leaving idle.
      - scheduler-tick
   - Why 2 writes. . .

- HLT
   - x86 instruction
   - CPU stops executing instructions until an interrupt arrives
   - 

- IPI+HLT
   - Sending an IPI to wake up a HLT-ed CPU.
   - Same operation on bare metal is entirely implemented in hardware

- KVM versus Hardware
   - Ring 0 Microbenchmark (kvm-unit-tests)
   - Median: KVM is 12x slower
   - Pathalogical case: KVM is 400x slower

- Notes aobut the benchmark
   - No guest FPU to save/restore
   - Host otherwise idle (VCPU context switches to idle on HLT)
   - Host power management not the culprit

- KVM HLT Internals
   - Unsuprisingly, the scheduler takes some time to run the vCPU
      - Slow even in the uncontended, cache-hot, case
      - 
   - Experiment: Don't schedule on HLT
      - Eliminate almost all of the latency overhead by not scheduling on
        HLT.
      - Scheduling is often the right thing to do
         - Let other threads run the CPU

- Halt-Polling
   - Step-1: Poll (for up to X nanoseconds)
      - If a task is waiting to run on our CPU, go to Step2
      - Check if a guest interrupt arrived, if so, we're done
   - Step-2: schedule()
      - 

   - Memcache: 1.5x latency improvement
   - Windows Event Objects: 2x latency improvement
   - Reduce message passing latenc by 10-15 us (including n/w latency)

   - It is merged in 4.0 Kernel
      - Use the KVM module parameter halt_poll_ns to control how long
        to poll on each HLT

   - Future improvements
      - Automatic poll toggling (remove idle CPU overhead by turning
        polling off)
      - Automatic halt_poll_ns
         - 
      - Lazy Context Switching
         -

- Conclusion
   - Halt-Polling saves 10-15 micro-sec on message passing round-trip
     latency.


ARM: Caches that give you enough rope to shoot yourself in the foot -
Marc Zyngier
---------------------------------------------------------------------

- ARM: THe cache coherency myth, ad the facts
   - Facts
      - Cache coherent architecture
      - Scales from single CPU to massive SPM systems
      - Implementer chooses to offer caches that are
         - visible to software
         - invisible to software
         - . . . or any point between those two points

- ARM: Cache architecture
   - (Modified) Harvard architecture
      - Multiple levels of caching (with snooping)
      - 

- ARM: Interacting with caches
   - The ARM arch. offers the usual (mostly) privileged operations to
     interact with caches

- Caches are an essential part of the coherency protocol
   - Using uncached memory explicitly bypasses it
   - It looks logical to cope with the consequences

- Emulated devices: the uncached I/O issue
   - Top rant about KVM/ARM: "My VGA adapter in QEMU doesn't work with
     KVM"
      - Userspace uses cached memory (via mmap)
      - The guest uses non-cached memory
         - Why would the CPU read back from it?

   - How to fix this mess
      - Hack guest attributes, forcing cacheable
         - Breaks devices that _need_ uncached access
      - Cache maintenance from userspace
         - Requires a new syscall on ARMv7
      - Allow userspace to mmap uncached
         - And what if the guest maps it as cached?
      - Just tell the guest the device is coherent
         - Only _real_ solution

- How did we end-up here?
   - A VGA device on an ARM VM looks like a terrible idea.
   - VGA was invented in 1987. . .
   - ARM VMs have no legacy to care about
   - We use paravirtualized devices for most things
   - Why don't we use virtio-vga as well?

- Back to coherency: Emulated vs physical devices
   - Firmware does have some level of support to describe the cache
     coherency attributes:

- Conclusion
   - KVM and its ecosystem are strongly x86 oriented (tainted?)
   - Not all the solutions that worked on x86 make sense on ARM
      - Nobody neds a Franekn-VM
      - We have the chance of a clean slate
   - It doesn't take much effort to fix KVM
      - All it takes is to read the 
   - We already have modern, efficient solutions
      - Paravirt is the best thing since sliced bread
   - Firmware (UEFI) and high level tools (libvirt[*] and co.) seem to
     be the biggest issues
      - Probably the worse "x86-ism"
      - Isn't _that_ hard to address the problems
      - Just don't assume x86 is _alywas_ the model to follow
      [*] Rich W.M. Jones corrected the presenter on this about improved
          libvirt (and `virt-install`) support.


QEMU interface introspection: from hacks to solutions - Markus
Armbruster
--------------------------------------------------------------

Part I

- QEMU provides interfaces
   - QMP Monitor
   - Command line

- QEMU command line
   - 139 total options
      - 14 deprecated
      - 2 internal use
     123 supported options

- QMP is _even_ bigger
   - 126 commands + 33 events
   - More than 700 named arguments and results
   - Defined the (book-sized) QAPI/QMP schema

- Command line evolves fast
- QMP eveolves _faster_

- Why interface introspection?
   - QEMU provides big, rapidly evolving interfaces
   - A program can
      - Tie to a specific build of QEMU

Part II

Prior work

- Version numbers?
   - QEMU says 0.12.1
   - QMP commands grew by 250% (in Upstream/RHEL6 comparision)

- Version numbers insufficient
   - git-diff --shorstat v2.3.0. .v2.4.0
   - git-log --oneline v.2.3.0. .v2.4.0 | wc -l

- Just try to use it
   - Workable in simple cases
      - Example: libvirt tries QMP 'inject-nmi', falls back to old HMP
        'nmi'
   - Complex, slow, fragile in not so simple cases

- A real-wold failure of "just try"
   - block-commit new in v1.2, and libvirt just tried it:
      - Run `block-commit`, and if it succeeds
      - wait for event BLOCK_JOB_COMPLETED

   - Before v2.0, block-commit fails for active layer
     Since then, it succeeds, but requires manual `block-job-complete`
     to complete

     -> Old libvirt hangs on block-job-complete

- Pretend to be human: read help
   -help
   -device help
   -device virtio-net,help
   -drive format=help

   Everybody did this (parsing help_ until QEMU grew real interfaces

- QMP `query-commands`
   - This is very _limited_ QMP introsection
   - `query-command-line-options`
      - Results an array of options
         - array of object parameter types. . .
      - `query-command-line-options` is incomplete
          - Probably better than nothing
          - Certainly less than needed
      - `query-command-line-options` is inexpressive
       
         Things we'd like to know, but it can't tell:
          - Formats supported by -drive?
          - It only tells us parameter 'format' is "string"
          - Parameters supported with 'chardev socket'?

- Where do we stand now?
   - Current introspection solutions work, but won't cut it much longer

Part III

QMP introspection

- The basic idea
   - Interface introspection turns interface into data
     QMP is defined by QAPI schema
     Schema is _data_, so let clients query for it!

- Cooking: `query-schema`
   - Exposes QMP wire ABIas defined in the schema:
      - Commands, events with arguments & results

- Let's introspect a command
   - QAPI Schema for `query-block`
   - Return type is an array of BlockInfo

- Let's introsect introspection!
   - QAPISchema for `query-schema`
      - returns: 'SchemaInfo'

- Introspect SchemaInfo
   - 

- Quck peek under the hood
   - QAPI schema is compile time static
   - SchemaInfo is generated from it
   - Generator is 160 SLOC of Python
   - Complete info is a bit over 70KiB

Part IV

- QMP introspection limitations
   - Known issues
      - Can see only qapified commands
         - Can't see `device_add`
      - Can see only qapified arguments & results
        We cheat for netdev_add
         - Can't see most of netdev_add's arguments
      - Only as good as the qapification
         - `add_client`

- Cleaning up qapification of netdev_add
   - Problem: type-specific arguments are missing

   Need to
    - qapify the type-specific arguments. . .
    - w/o upsetting the QMP wire format

   Wire format matches QAPI/QMP's flat union type
   Possible solution:
    - Support unions as commands
           
- Qapifying `device_add`
   - Wire format like netdev_add
      - common + driver-specific arguments
     But: drivers collected only at run time!
      - QAPI schema fixed at run time

- QAPI follow-up work
   - On the way to introspection, we
      - got ourselves real test coverage

- What about the command lne?
   - Same basic idea: turn interface into data
   - Good: our CLI definition is data
   - Bad: not QAPI, less expressive, leaves more to code

     Choices:
      - Build non-QAPI CLI introspection?
         - Only as good as the data. . .
      - . . .


Incremental Backups
-------------------

- Refer the slides:

  http://events.linuxfoundation.org/sites/events/files/slides/kvm2015_rh_light_44_vfinal.pdf


qcow2 - why (not)?
------------------

http://events.linuxfoundation.org/sites/events/files/slides/p0.pp_.pdf