Notes from - New Developments and Advanced Features in the libvirt Management API, by Dan Berrange ================================================================== High-level libvirt architecture view ------------------------------------- - Libvirt provides stable API, i.e. does not break in an incompatible way. - "State-full architecture" (for QEMU, KVM, LXC): Application talks to the libvirt library (libvirt.so), and it uses generic RPC mechanism to talk to the libvirt daemon (it mains the state of other virtualization hosts), which in turn talks to the QEMU processes via the QMP (Qemu Monitor Protocol/interface). Disk access protection ---------------------- - Danger scenarios: - 2 guests using same disk image - Same guest started twice on different hosts (e.g. you have a single VM, and you're doing live migration from one host to another, you want to ensure that virtual machine doesn't end up running on both hosts at the same time) - Disk access modes - Read-only, shared () - Read-write, shared ( - Can be attached to multiple VMs, applicable in cluster file-system which is aware that there are multiple "writers" at the same time - Read-write, exclusive (default) - Sanlock: Disk lease/lock management - This is way to enforce disk access modes - Uses disk paxos algorithm - Discouraged to be used with NFS. Recommended to be used with SAN storage - Manual leases provide more control on how the leases are stored and maintained (useful for OpenStack) - virtlockd - Useful if storage is NFS-based, or Gluster, Ceph, etc) - Default locking mechanism for libvirt - POSIX fcntl() based locks - Requires use of shared file system - Automatic leases - Direct file path - Indirect SHA256 has of the file path -- default mechanism - Indirect LVM UUID - Indirect SCSI UUID of the LUN - The UUID-base methods are slightly safer than file path, because if your storage appears as a different file path on different guests, then the latter two mechansims are stable across hosts - virtlockd architecture - The QEMU driver inside libvirt daemon, just talks to the virtlockd daemon using an RPC mechanism. So, whenever you first start a guest, the first thing it does is it talks to the virtlockd daemon and acquire locks for all of these disk images -- only if this succeeds, will the QEMU process will be started - These locks are also released and reaquired whenever you paused the virtual machines -- which is the key to make migrations work. Fine grained access control --------------------------- - Historically, libvirt had a very simple access control mechanism: if you're talking to libvirt over a unix domain socket, there are two mechansims: - 'read-only' socket: if you over this socket, you can get information about your virtual machines, but you can't make an changes [Or] - 'read-write' socket: you can do whatever you like with no restrictions whatsoever These are fine for projects like OpenStack which wants to be able to do anything at any time. - New ACLs on (object, subject, permission) - New ACLs allows you to express rules which will allow e.g. user 'frank' can do 'start' guest 'apache' - This ACL mechanism operates across all libvirt drivers (KVM, LXC, etc) - Pluggable backends (in-tree only) -- allows different access control mechanisms to be used Polkit ACLs ~~~~~~~~~~~ - 'polkit' is main (only) backend option - Every libvirt API has one or more permissions associated with it (API documentation will tell you what permissions are required for which API), and these permissions are mapped into PolicyKit actions - e.g. if you want 'start' permission on the 'domain' object, that gets mapped into a polkit action called 'org.libvirt.api.domain.start'. These can be found in the libvirt online API documentation - helps in figuring out what the mapping is for any APIs - Object identifiers as properties - You ought to idenfity the object you're managing, e.g. for a VM: 'driver', 'id', 'uuid', 'name' - Local UNIX users only - i.e. idenitfy the _user_ that you're trying to restrict access to - Currently due to the limitations of Polkit, we can identify only local UNIX users (i.e. we need to know the local UNIX user that's calling the APIs) Polkit rules for managing ACLs ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - Has a Javascript backend (ACLs are written in Javascript) - Trivial code snippet: ~~~~~~~~ polkit.addRule(function(action,subject) { if(action.id == "org.libvirt.api.domain.getattr" && subject.user == "kashyapc") { if (action.lookup("connect_driver") == 'LXC' && action.lookup("domain_name") == 'demo') { return polkit.Resut.YES; } else { return polkit.Result.NO; } } }); ~~~~~~~~ - There is a number of objects provided to you: - 'action' object: this tells *what* API is being invoked - 'subject' object: tells you the user who's invoking it - 'action' has a number of properties to identify the object, e.g. in the above snippet, we're looking at an API call with an action of 'getattr', and the user is 'kashyap', guest is running on the LXC hypervisor with the name of 'demo' -- if all these match, then we allow access or else we deny it - This is fairly new functionality, try out and provide feedback. - The reason for Polkit as the first engine for access control: Access from Polkit to LDAP rules database is made easier. sVirt: SELinux -------------- - Each VM is a QEMU process - Default: dynamic MCS with 'svirt_t' or 'svirt_tcg_t' - Static label override per guest - i.e. you can write custom SELinux policies - Base label override per guest - e.g. replace 'svirt_t' with 'my_svirt_t' - Still uses dynamic MCS - Per disk label override - Every QEMU process has its own user id. sVirt: DAC ~~~~~~~~~~ - Default: fixed 'qemu:qemu' user/group - With this, every QEMU process can have its own *unique* ID. So, you could rely on traditional UNIX permissions to separate the QEMU processes securely - Static label override per guest - Currently, you have to assign those user IDs, per guest statically, and libvirt will take care of dynamically setting the ownership of the disk images to match whatever UID your guest runs under - Dynamic or static image relabelling Audit Logging ------------- - To keep track of who's doing _what_ on your Virtualization host. Audit log provides a way to find this out. - Whenever Libvirt starts/stops a virtual machine, it'll generate an Audit record for that operation - sVirt label - vCPU hotplug - Memory balloon/assignment - Disk/net/PCI/file-system hotplug - cgroup properties ACLs General logging --------------- - systemd journald structured data - Raw message - Log priority - Log reason (debug/audit/trace/error) - Source file/line/fuction Control Groups -------------- - New systemd - $ROOT/machine.slice/{guest-name}-{qemu,lxc}.scope - Control Groups Custom - Custom grouping: /machine/production - Tuning CPU - Scheduler tunables - cpu_shares - {vcpu,emulator}_period - {vcpu,emulator}_quota - CPU models - Named model - Host model - Host passthrough Tuning Memory ------------- - Numa policies - Important: if you want to maximize utilization of your hardware. You can do control this manually by telling libvirt what memory nodes you want the VM to run on or you could tell libvirt to do it automatically - Static CPU & memory placement via XML - Dynamic CPU & memory placement via numad - In this case, libvirt will talk to numad - a daemon that runs on the host which says: this NUMA node has a lot of resources free, put the VM over there. - Guest NUMA topology - Allocation backing - Huge pages, page sharing, locked - Automatic with current upstream Kernels - KSM (Kernel Shared Memory) -- useful when you are running lots of virtual machines running same software stack, chances are they have a lot of memory pages which have the same data in them, KSM will identify those memory pages which are identical and merge them, so you only have *one* copy of this memory page shared among multiple virtual machines. Overall: allows higher density of virtual machines - Memory tunables - hard_limit, soft_limit, swap_limit, min_guarantee Tuning Block ------------ - Per guest tunables - weight - device_weight (can be applied to individual block devices) - Per guest disk tunables - {total,read,write}_iops_sec - {total,read,write}_bytes_sec Tuning Network -------------- - Per guest NIC tunables - QoS with 'tc' - Migration tuning - MiB/second Q/A --- - Goal of `virsh` is to directly expose the libvirt functionality to the administrator, so there's full range of control. - You can do locking "per" guest, this needs to be specified in guest in XML config - CPU models - MMX instruction set