Notes from - New Developments and Advanced Features in the libvirt
Management API, by Dan Berrange
==================================================================
High-level libvirt architecture view
-------------------------------------
- Libvirt provides stable API, i.e. does not break in an incompatible way.
- "State-full architecture" (for QEMU, KVM, LXC): Application talks to
the libvirt library (libvirt.so), and it uses generic RPC mechanism
to talk to the libvirt daemon (it mains the state of other
virtualization hosts), which in turn talks to the QEMU processes via
the QMP (Qemu Monitor Protocol/interface).
Disk access protection
----------------------
- Danger scenarios:
- 2 guests using same disk image
- Same guest started twice on different hosts (e.g. you have a
single VM, and you're doing live migration from one host to
another, you want to ensure that virtual machine doesn't end up
running on both hosts at the same time)
- Disk access modes
- Read-only, shared ()
- Read-write, shared (
- Can be attached to multiple VMs, applicable in cluster
file-system which is aware that there are multiple "writers"
at the same time
- Read-write, exclusive (default)
- Sanlock: Disk lease/lock management
- This is way to enforce disk access modes
- Uses disk paxos algorithm
- Discouraged to be used with NFS. Recommended to be used with SAN
storage
- Manual leases provide more control on how the leases are stored
and maintained (useful for OpenStack)
- virtlockd
- Useful if storage is NFS-based, or Gluster, Ceph, etc)
- Default locking mechanism for libvirt
- POSIX fcntl() based locks
- Requires use of shared file system
- Automatic leases
- Direct file path
- Indirect SHA256 has of the file path -- default mechanism
- Indirect LVM UUID
- Indirect SCSI UUID of the LUN
- The UUID-base methods are slightly safer than file path,
because if your storage appears as a different file path on
different guests, then the latter two mechansims are stable
across hosts
- virtlockd architecture
- The QEMU driver inside libvirt daemon, just talks to the
virtlockd daemon using an RPC mechanism. So, whenever you
first start a guest, the first thing it does is it talks to
the virtlockd daemon and acquire locks for all of these disk
images -- only if this succeeds, will the QEMU process will be
started
- These locks are also released and reaquired whenever you
paused the virtual machines -- which is the key to make
migrations work.
Fine grained access control
---------------------------
- Historically, libvirt had a very simple access control mechanism: if
you're talking to libvirt over a unix domain socket, there are two
mechansims:
- 'read-only' socket: if you over this socket, you can get
information about your virtual machines, but you can't make an
changes [Or]
- 'read-write' socket: you can do whatever you like with no
restrictions whatsoever
These are fine for projects like OpenStack which wants to be able
to do anything at any time.
- New ACLs on (object, subject, permission)
- New ACLs allows you to express rules which will allow e.g. user
'frank' can do 'start' guest 'apache'
- This ACL mechanism operates across all libvirt drivers (KVM, LXC,
etc)
- Pluggable backends (in-tree only) -- allows different access
control mechanisms to be used
Polkit ACLs
~~~~~~~~~~~
- 'polkit' is main (only) backend option
- Every libvirt API has one or more permissions associated with it
(API documentation will tell you what permissions are required for
which API), and these permissions are mapped into PolicyKit actions
- e.g. if you want 'start' permission on the 'domain' object, that
gets mapped into a polkit action called
'org.libvirt.api.domain.start'. These can be found in the libvirt
online API documentation - helps in figuring out what the mapping
is for any APIs
- Object identifiers as properties
- You ought to idenfity the object you're managing, e.g. for a
VM: 'driver', 'id', 'uuid', 'name'
- Local UNIX users only
- i.e. idenitfy the _user_ that you're trying to restrict access to
- Currently due to the limitations of Polkit, we can identify only
local UNIX users (i.e. we need to know the local UNIX user that's
calling the APIs)
Polkit rules for managing ACLs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- Has a Javascript backend (ACLs are written in Javascript)
- Trivial code snippet:
~~~~~~~~
polkit.addRule(function(action,subject) {
if(action.id == "org.libvirt.api.domain.getattr" &&
subject.user == "kashyapc") {
if (action.lookup("connect_driver") == 'LXC' &&
action.lookup("domain_name") == 'demo') {
return polkit.Resut.YES;
} else {
return polkit.Result.NO;
}
}
});
~~~~~~~~
- There is a number of objects provided to you:
- 'action' object: this tells *what* API is being invoked
- 'subject' object: tells you the user who's invoking it
- 'action' has a number of properties to identify the object, e.g.
in the above snippet, we're looking at an API call with an action
of 'getattr', and the user is 'kashyap', guest is running on the
LXC hypervisor with the name of 'demo' -- if all these match,
then we allow access or else we deny it
- This is fairly new functionality, try out and provide feedback.
- The reason for Polkit as the first engine for access control: Access
from Polkit to LDAP rules database is made easier.
sVirt: SELinux
--------------
- Each VM is a QEMU process
- Default: dynamic MCS with 'svirt_t' or 'svirt_tcg_t'
- Static label override per guest
- i.e. you can write custom SELinux policies
- Base label override per guest
- e.g. replace 'svirt_t' with 'my_svirt_t'
- Still uses dynamic MCS
- Per disk label override
- Every QEMU process has its own user id.
sVirt: DAC
~~~~~~~~~~
- Default: fixed 'qemu:qemu' user/group
- With this, every QEMU process can have its own *unique* ID. So, you
could rely on traditional UNIX permissions to separate the QEMU
processes securely
- Static label override per guest
- Currently, you have to assign those user IDs, per guest
statically, and libvirt will take care of dynamically setting the
ownership of the disk images to match whatever UID your guest
runs under
- Dynamic or static image relabelling
Audit Logging
-------------
- To keep track of who's doing _what_ on your Virtualization host.
Audit log provides a way to find this out.
- Whenever Libvirt starts/stops a virtual machine, it'll generate
an Audit record for that operation
- sVirt label
- vCPU hotplug
- Memory balloon/assignment
- Disk/net/PCI/file-system hotplug
- cgroup properties ACLs
General logging
---------------
- systemd journald structured data
- Raw message
- Log priority
- Log reason (debug/audit/trace/error)
- Source file/line/fuction
Control Groups
--------------
- New systemd
- $ROOT/machine.slice/{guest-name}-{qemu,lxc}.scope
- Control Groups Custom
- Custom grouping:
/machine/production
- Tuning CPU
- Scheduler tunables
- cpu_shares
- {vcpu,emulator}_period
- {vcpu,emulator}_quota
- CPU models
- Named model
- Host model
- Host passthrough
Tuning Memory
-------------
- Numa policies
- Important: if you want to maximize utilization of your hardware.
You can do control this manually by telling libvirt what memory
nodes you want the VM to run on or you could tell libvirt to do
it automatically
- Static CPU & memory placement via XML
- Dynamic CPU & memory placement via numad
- In this case, libvirt will talk to numad - a daemon that runs
on the host which says: this NUMA node has a lot of resources
free, put the VM over there.
- Guest NUMA topology
- Allocation backing
- Huge pages, page sharing, locked
- Automatic with current upstream Kernels
- KSM (Kernel Shared Memory) -- useful when you are running lots of
virtual machines running same software stack, chances are they have
a lot of memory pages which have the same data in them, KSM will
identify those memory pages which are identical and merge them, so
you only have *one* copy of this memory page shared among multiple
virtual machines. Overall: allows higher density of virtual machines
- Memory tunables
- hard_limit, soft_limit, swap_limit, min_guarantee
Tuning Block
------------
- Per guest tunables
- weight
- device_weight (can be applied to individual block devices)
- Per guest disk tunables
- {total,read,write}_iops_sec
- {total,read,write}_bytes_sec
Tuning Network
--------------
- Per guest NIC tunables
- QoS with 'tc'
- Migration tuning
- MiB/second
Q/A
---
- Goal of `virsh` is to directly expose the libvirt functionality to
the administrator, so there's full range of control.
- You can do locking "per" guest, this needs to be specified in guest
in XML config
- CPU models
- MMX instruction set