version author date changelog

Version 0.1

pings@juniper.net

(2015-11-23)

main part of the doc done

Version 0.2

pings@juniper.net

(2015-11-28)

add "references"

Version 0.3

pings@juniper.net

(2016-01-04)

add qemu-kvm params breakdown(TODO)

Kernel-based Virtual Machine (KVM) is a virtualization infrastructure for the Linux kernel that turns it into a hypervisor. It has been gaining industry traction and market share in a wide variety of software deployments in just a few short years. KVM turns to be a very interesting virtualization topic and is well positioned for the future.

Juniper MX86(VMX: Virtual MX) is a virtual version of the MX Series 3D Universal Edge Router that is installed on an industry-standard x86 server running a Linux operating system, applicable third-party software, and the KVM hypervisor. It is one of industry-level implementations of router based on KVM technology.

about This doc :

This doc records some learning and testing notes about KVM/MX86 technology.

  • illustration of various installation procedures

  • illustration of verification procedure

  • illustration of features involved in VMX installation process, mostly from KVM/virtualization technology’s perspecive

  • some case studies

This doc can be used as reference when you:

  • read the official VMX documents, use this as an "accompanying" doc

  • need a quick list of commands to verify a feature or troubleshoot an issue

  • when run into issues, need a "working example" as a reference

a "TODO" list:
  • breakdown of qemu command parameters

  • customerization of the installation script to make it startup friendly

  • libvirt API scripting

  • DPDK programming

  • internal crabs? (flow cache, junos troubleshooting, internal packet flow, etc)

Part 1: VMX installation

1. prepare for the installation

1.1. identify current system info

before starting the installation, we need to collect some basic information about the server that VMX is about to be built on. At mininum, make sure the server complies to the "Minimum Hardware and Software Requirements".

pre-installation checklist
  • server manufacturer info

  • operating system

  • cpu and memory

  • NIC/controller

  • NIC driver

  • VT-d/IOMMU

  • kvm/qemu version

  • libvirt version

1.1.1. server Manufacturer/model/SN

this server’s manufacturer/model info:
  • this server is a ProLiant BL660c Gen8 server.

  • chassis SN is USE4379WSS

same info can be acquired from HP ILO, but command from ssh session is more convenient.
 1    ping@trinity:~$ sudo dmidecode |sed -n '/System Info/,/^$/p'
 2    System Information
 3            Manufacturer: HP
 4            Product Name: ProLiant BL660c Gen8          #<------
 5            Version: Not Specified
 6            Serial Number: USE4379WSS                   #<------
 7            UUID: 31393736-3831-5355-4534-333739575353
 8            Wake-up Type: Power Switch
 9            SKU Number: 679118-B21
10            Family: ProLiant

dmidecode -t 1 will print same info:

 1    ping@trinity:~$ sudo dmidecode -t 1
 2        Handle 0x0100, DMI type 1, 27 bytes
 3        System Information
 4                Manufacturer: HP
 5                Product Name: ProLiant BL660c Gen8
 6                Version: Not Specified
 7                Serial Number: USE4379WSS
 8                UUID: 31393736-3831-5355-4534-333739575353
 9                Wake-up Type: Power Switch
10                SKU Number: 679118-B21
11                Family: ProLiant
-t is handy , but the use of sed is a more general way to extract a part of text from long text output of any command.
dmidecode tool

most hardware data can be queried by dmidecode and/or lshw command. the dmi type code of system info is 1, so dmidecode -t 1 will print same info. refer to manpage of dmidecode for more info about the usage.

The SMBIOS specification defines the following DMI types:

Type   Inf         0   BIOS
   1   System
   2   Baseboard
   3   Chassis
   4   Processor
   5   Memory Controller
   6   Memory Module
   7   Cache
   8   Port Connector
   9   System Slots
  10   On Board Devices
  11   OEM Strings
  12   System Configuration Options
  13   BIOS Language
  14   Group Associations
  16   System Event Log
  16   Physical Memory Array
  17   Memory Device
  18   32-bit Memory Error
  19   Memory Array Mapped Address
  20   Memory Device Mapped Address
  21   Built-in Pointing Device
  22   Portable Battery
  23   System Reset
  24   Hardware Security
  25   System Power Controls
  26   Voltage Probe
  27   Cooling Device
  28   Temperature Probe
  29   Electrical Current Probe
  30   Out-of-band Remote Access
  31   Boot Integrity Services
  32   System Boot
  33   64-bit Memory Error
  34   Management Device
  35   Management Device Component
  36   Management Device Threshold Data
  37   Memory Channel
  38   IPMI Device
  39   Power Supply
  40   Additional Information
  41   Onboard Devices Extended Information
  42   Management Controller Host Interface

1.1.2. Operating System

this server’s current operation system info:
  • linux distribution: ubuntu 14.04.2

  • linux kernel in use: 3.19.0-25-generic

 1ping@trinity:~$ lsb_release -a
 2No LSB modules are available.
 3Distributor ID: Ubuntu
 4Description:    Ubuntu 14.04.2 LTS  #<------
 5Release:        14.04
 6Codename:       trusty
 7
 8ping@trinity:~$ uname -a
 9Linux trinity 3.19.0-25-generic #26~14.04.1-Ubuntu SMP Fri Jul 24 21:16:20 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
10              ^^^^^^^^^

there are many other alternative commands to print OS version/release info:

 1    ping@ubuntu1:~$ cat /etc/os-release
 2    NAME="Ubuntu"
 3    VERSION="14.04.1 LTS, Trusty Tahr"
 4    ID=ubuntu
 5    ID_LIKE=debian
 6    PRETTY_NAME="Ubuntu 14.04.1 LTS"
 7    VERSION_ID="14.04"
 8    HOME_URL="http://www.ubuntu.com/"
 9    SUPPORT_URL="http://help.ubuntu.com/"
10    BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
11
12    ping@trinity:~$ cat /etc/lsb-release
13    DISTRIB_ID=Ubuntu
14    DISTRIB_RELEASE=14.04
15    DISTRIB_CODENAME=trusty
16    DISTRIB_DESCRIPTION="Ubuntu 14.04.2 LTS"
17
18    ping@trinity:~$ cat /etc/issue
19    Ubuntu 14.04.2 LTS \n \l
20
21    ping@trinity:~$ cat /etc/issue.net
22    Ubuntu 14.04.2 LTS
23
24    ping@ubuntu1:~$ cat /proc/version
25    Linux version 3.13.0-32-generic (buildd@kissel)
26    (gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1) )
27    #57-Ubuntu SMP Tue Jul 15 03:51:08 UTC 2014

to list all kernels installed currently in the server:

1ping@trinity:~$ dpkg --get-selections | grep linux-image
2linux-image-3.13.0-32-generic                   install     #<------
3linux-image-3.16.0-30-generic                   install     #<------
4linux-image-3.19.0-25-generic                   install     #<------
5linux-image-extra-3.13.0-32-generic             install
6linux-image-extra-3.16.0-30-generic             install
7linux-image-extra-3.19.0-25-generic             install
8linux-image-generic-lts-utopic                  install

In this server multiple verions of linux kernel have been installed, including our target kernel 3.13.0-32-generic. There is no need to install the the kernel again - we only need to select the right one and reboot the system with it.

1.1.3. cpu and memory

this server’s CPU and memory info:

 1ping@trinity:~$ lscpu
 2Architecture:          x86_64
 3CPU op-mode(s):        32-bit, 64-bit
 4Byte Order:            Little Endian
 5CPU(s):                32           #<----(1)
 6On-line CPU(s) list:   0-31
 7Thread(s) per core:    1            #<----(2)
 8Core(s) per socket:    8
 9Socket(s):             4            #<----(4)
10NUMA node(s):          4
11Vendor ID:             GenuineIntel
12CPU family:            6
13Model:                 62
14Stepping:              4
15CPU MHz:               3299.797
16BogoMIPS:              6605.01
17Virtualization:        VT-x         #<----(3)
18L1d cache:             32K
19L1i cache:             32K
20L2 cache:              256K
21L3 cache:              16384K
22NUMA node0 CPU(s):     0-7
23NUMA node1 CPU(s):     8-15
24NUMA node2 CPU(s):     16-23
25NUMA node3 CPU(s):     24-31
1 total number of CPU cores
2 hyperthreading is not enabled or not supported in this server
3 VT-x is enabled
4 max number of sockets available (to hold CPU) in the server
 1ping@trinity:~$ grep -m1 "model name"  /proc/cpuinfo
 2model name      : Intel(R) Xeon(R) CPU E5-4627 v2 @ 3.30GHz   #<------
 3
 4ping@trinity:~$ cat /proc/cpuinfo | sed -e '/^$/,$d'
 5processor       : 0
 6vendor_id       : GenuineIntel
 7cpu family      : 6
 8model           : 62
 9model name      : Intel(R) Xeon(R) CPU E5-4627 v2 @ 3.30GHz   #<------
10stepping        : 4
11microcode       : 0x427
12cpu MHz         : 3299.797
13cache size      : 16384 KB
14physical id     : 0
15siblings        : 8
16core id         : 0
17cpu cores       : 8
18apicid          : 0
19initial apicid  : 0
20fpu             : yes
21fpu_exception   : yes
22cpuid level     : 13
23wp              : yes
24flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr  (1)
25                  pge mca cmov pat pse36 clflush dts acpi mmx fxsr
26                  sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp   (2)
27                  lm constant_tsc arch_perfmon pebs bts rep_good
28                  nopl xtopology nonstop_tsc aperfmperf eagerfpu
29                  pni pclmulqdq dtes64 monitor ds_cpl vmx smx est   (3)
30                  tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2
31                  x2apic popcnt tsc_deadline_timer aes xsave avx
32                  f16c rdrand lahf_lm ida arat epb pln pts dtherm   (4)
33                  tpr_shadow vnmi flexpriority ept vpid fsgsbase
34                  smep erms xsaveopt
35bugs            :
36bogomips        : 6599.59
37clflush size    : 64
38cache_alignment : 64
39address sizes   : 46 bits physical, 48 bits virtual
40power management:
1 pse flag indicate 2M hugepage support
2 pdpe1gb flag indicate 1G hugepage support
3 vmx flag indicate VT-x capability
4 rdrand flag, required by current VMX implementation, supported in Ivy Bridge CPU and not in Sandy Bridge CPU
memory
ping@trinity: free -h
             total       used       free     shared    buffers     cached
Mem:          503G        13G       490G       1.5M       176M       6.9G
-/+ buffers/cache:       5.9G       497G
Swap:          14G         0B        14G
dmidecode version of the cpu information captured is shown below:
ping@trinity: sudo dmidecode -t 4
# dmidecode 2.12
SMBIOS 2.8 present.

    Handle 0x0400, DMI type 4, 42 bytes
    Processor Information
            Socket Designation: Proc 1
            Type: Central Processor
            Family: Xeon
            Manufacturer: Intel
            ID: E4 06 03 00 FF FB EB BF
            Signature: Type 0, Family 6, Model 62, Stepping 4
            Flags:
                    FPU (Floating-point unit on-chip)
                    VME (Virtual mode extension)
                    DE (Debugging extension)
                    PSE (Page size extension)
                    TSC (Time stamp counter)
                    MSR (Model specific registers)
                    PAE (Physical address extension)
                    MCE (Machine check exception)
                    CX8 (CMPXCHG8 instruction supported)
                    APIC (On-chip APIC hardware supported)
                    SEP (Fast system call)
                    MTRR (Memory type range registers)
                    PGE (Page global enable)
                    MCA (Machine check architecture)
                    CMOV (Conditional move instruction supported)
                    PAT (Page attribute table)
                    PSE-36 (36-bit page size extension)
                    CLFSH (CLFLUSH instruction supported)
                    DS (Debug store)
                    ACPI (ACPI supported)
                    MMX (MMX technology supported)
                    FXSR (FXSAVE and FXSTOR instructions supported)
                    SSE (Streaming SIMD extensions)
                    SSE2 (Streaming SIMD extensions 2)
                    SS (Self-snoop)
                    HTT (Multi-threading)
                    TM (Thermal monitor supported)
                    PBE (Pending break enabled)
            Version:  Intel(R) Xeon(R) CPU E5-4627 v2 @ 3.30GHz
            Voltage: 1.4 V
            External Clock: 100 MHz
            Max Speed: 4800 MHz
            Current Speed: 3300 MHz
            Status: Populated, Enabled
            Upgrade: Socket LGA2011
            L1 Cache Handle: 0x0710
            L2 Cache Handle: 0x0720
            L3 Cache Handle: 0x0730
            Serial Number: Not Specified
            Asset Tag: Not Specified
            Part Number: Not Specified
            Core Count: 8
            Core Enabled: 8
            Thread Count: 8
            Characteristics:
                    64-bit capable

    Handle 0x0401, DMI type 4, 42 bytes
    Processor Information
            Socket Designation: Proc 2
            Type: Central Processor
            ......

    Handle 0x0402, DMI type 4, 42 bytes
    Processor Information
            Socket Designation: Proc 3
            ......

    Handle 0x0403, DMI type 4, 42 bytes
    Processor Information
            Socket Designation: Proc 4
            ......
dmidecode version of the memory data is: dmidecode -t 17:
ping@trinity: sudo dmidecode -t 17
# dmidecode 2.12
SMBIOS 2.8 present.

    Handle 0x1100, DMI type 17, 40 bytes
    Memory Device
            Array Handle: 0x1000
            Error Information Handle: Not Provided
            Total Width: 72 bits
            Data Width: 64 bits
            Size: 32 GB         #<------
            Form Factor: DIMM
            Set: None
            Locator: PROC  1 DIMM  1
            Bank Locator: Not Specified
            Type: DDR3
            Type Detail: Synchronous
            Speed: 1866 MHz
            Manufacturer: HP
            Serial Number: Not Specified
            Asset Tag: Not Specified
            Part Number: 712384-081
            Rank: 4
            Configured Clock Speed: 1866 MHz
            Minimum voltage:  1.500 V
            Maximum voltage:  1.500 V
            Configured voltage:  1.500 V

    Handle 0x1101, DMI type 17, 40 bytes
    Memory Device
            Array Handle: 0x1000
            Error Information Handle: Not Provided
            Total Width: 72 bits
            Data Width: 64 bits
            Size: No Module Installed
            Form Factor: DIMM
            Set: 1
            Locator: PROC  1 DIMM  2
            Bank Locator: Not Specified
            Type: DDR3
            Type Detail: Synchronous
            Speed: Unknown
            Manufacturer: UNKNOWN
            Serial Number: Not Specified
            Asset Tag: Not Specified
            Part Number: NOT AVAILABLE
            Rank: Unknown
            Configured Clock Speed: Unknown
            Minimum voltage:  Unknown
            Maximum voltage:  Unknown
            Configured voltage:  Unknown

    Handle 0x1102, DMI type 17, 40 bytes
    Memory Device
            ......
            Set: 2
            Locator: PROC  1 DIMM  3
            ......

    ......

    Handle 0x111F, DMI type 17, 40 bytes
    Memory Device
            Array Handle: 0x1003
            Error Information Handle: Not Provided
            Total Width: 72 bits
            Data Width: 64 bits
            Size: 32 GB
            Form Factor: DIMM
            Set: 31
            Locator: PROC  4 DIMM  8
            Bank Locator: Not Specified
            Type: DDR3
            Type Detail: Synchronous
            Speed: 1866 MHz
            Manufacturer: HP
            Serial Number: Not Specified
            Asset Tag: Not Specified
            Part Number: 712384-081
            Rank: 4
            Configured Clock Speed: 1866 MHz
            Minimum voltage:  1.500 V
            Maximum voltage:  1.500 V
            Configured voltage:  1.500 V

1.1.4. network adapter/controller

This server’s adapter/NIC/controller info:
  • two type of HP NIC adapters were equiped:

  • Intel 82599 10GE controller

  • ixgbe kernel driver module that came with linux kernel is in use

    • SR-IOV supported

    • by default VF has not yet been configured

  • adapter/interface/PCI address mapping table of all physical network interfaces in this server:

    Table 1. server interface table
    NO. "PCI address" adapter interface name

    1

    02:00.0

    560FLB

    em9

    2

    02:00.1

    560FLB

    em10

    3

    06:00.0

    560M

    p2p1

    4

    06:00.1

    560M

    p2p2

    5

    21:00.0

    560FLB

    em1

    6

    21:00.1

    560FLB

    em2

    7

    23:00.0

    560M

    p3p1

    8

    23:00.1

    560M

    p3p2

[lspci] lspci is a linux utility for displaying information about PCI buses in the system and devices connected to them.

 1ping@trinity:~$ lspci | grep -i ethernet
 202:00.0 Ethernet controller: Intel Corporation 82599 10 Gigabit Dual Port Backplane Connection (rev 01)
 302:00.1 Ethernet controller: Intel Corporation 82599 10 Gigabit Dual Port Backplane Connection (rev 01)
 406:00.0 Ethernet controller: Intel Corporation 82599 10 Gigabit Dual Port Backplane Connection (rev 01)
 506:00.1 Ethernet controller: Intel Corporation 82599 10 Gigabit Dual Port Backplane Connection (rev 01)
 621:00.0 Ethernet controller: Intel Corporation 82599 10 Gigabit Dual Port Backplane Connection (rev 01)
 721:00.1 Ethernet controller: Intel Corporation 82599 10 Gigabit Dual Port Backplane Connection (rev 01)
 823:00.0 Ethernet controller: Intel Corporation 82599 10 Gigabit Dual Port Backplane Connection (rev 01)
 923:00.1 Ethernet controller: Intel Corporation 82599 10 Gigabit Dual Port Backplane Connection (rev 01)

the prefix number string of each line is a "PCI address":

02:00.0 Ethernet controller: Intel Corporation 82599 10 Gigabit Dual Port Backplane Connection (rev 01)
 ^  ^ ^
 |  | |
 |  | function (port): 0-7
 |  |
 | slot (NIC): 0-1f
 |
bus: 0-ff

there is also a "domain" before "bus:slot.function", usually with a value 0000 and is ignored. see man lspci option -s and -D.

to get more details of each device (NIC port) via PCI address:
ping@trinity:~$ sudo lspci -vs 02:00.0
02:00.0 Ethernet controller: Intel Corporation 82599 10 Gigabit Dual Port Backplane Connection (rev 01) (2)
    Subsystem: Hewlett-Packard Company Ethernet 10Gb 2-port 560FLB Adapter (1)
    Flags: bus master, fast devsel, latency 0, IRQ 64
    Memory at ef700000 (32-bit, non-prefetchable) [size=1M]
    I/O ports at 4000 [size=32]
    Memory at ef6f0000 (32-bit, non-prefetchable) [size=16K]
    [virtual] Expansion ROM at ef400000 [disabled] [size=512K]
    Capabilities: [40] Power Management version 3
    Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
    Capabilities: [70] MSI-X: Enable+ Count=64 Masked-
    Capabilities: [a0] Express Endpoint, MSI 00
    Capabilities: [e0] Vital Product Data
    Capabilities: [100] Advanced Error Reporting
    Capabilities: [140] Device Serial Number 00-00-00-ff-ff-00-00-00
    Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
    Capabilities: [160] Single Root I/O Virtualization (SR-IOV)            (4)
    Kernel driver in use: ixgbe                                            (3)

ping@trinity:~$ sudo lspci -vs 06:00.0
06:00.0 Ethernet controller: Intel Corporation 82599 10 Gigabit Dual Port Backplane Connection (rev 01) (2)
    Subsystem: Hewlett-Packard Company Ethernet 10Gb 2-port 560M Adapter (1)
    Physical Slot: 2
    Flags: bus master, fast devsel, latency 0, IRQ 136
    Memory at eff00000 (32-bit, non-prefetchable) [size=1M]
    I/O ports at 6000 [size=32]
    Memory at efef0000 (32-bit, non-prefetchable) [size=16K]
    [virtual] Expansion ROM at efc00000 [disabled] [size=512K]
    Capabilities: [40] Power Management version 3
    Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
    Capabilities: [70] MSI-X: Enable+ Count=64 Masked-
    Capabilities: [a0] Express Endpoint, MSI 00
    Capabilities: [e0] Vital Product Data
    Capabilities: [100] Advanced Error Reporting
    Capabilities: [140] Device Serial Number 00-00-00-ff-ff-00-00-00
    Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
    Capabilities: [160] Single Root I/O Virtualization (SR-IOV)          (4)
    Kernel driver in use: ixgbe                                          (3)

ping@trinity:~$ sudo lspci -vs 21:* | grep -iE "controller|adapter"
21:00.0 Ethernet controller: Intel Corporation 82599 10 Gigabit Dual Port Backplane Connection (rev 01) (2)
        Subsystem: Hewlett-Packard Company Ethernet 10Gb 2-port 560FLB Adapter (1)
21:00.1 Ethernet controller: Intel Corporation 82599 10 Gigabit Dual Port Backplane Connection (rev 01)
        Subsystem: Hewlett-Packard Company Ethernet 10Gb 2-port 560FLB Adapter

ping@trinity:~$ sudo lspci -vs 23:* | grep -iE "controller|adapter"
23:00.0 Ethernet controller: Intel Corporation 82599 10 Gigabit Dual Port Backplane Connection (rev 01)
        Subsystem: Hewlett-Packard Company Ethernet 10Gb 2-port 560M Adapter
23:00.1 Ethernet controller: Intel Corporation 82599 10 Gigabit Dual Port Backplane Connection (rev 01)
        Subsystem: Hewlett-Packard Company Ethernet 10Gb 2-port 560M Adapter
1 adapter vendor info
2 controller vendor info
3 current driver in use is ixgbe
4 driver kernel module support SR-IOV feature

1.1.5. ixgbe kernel driver

  • current ixgbe driver of this server is the one that came with linux kernel 3.19.1:

    • ixgbe version 4.0.1-k

    • support following capabilities:

      • max_vfs parameter: max 63 VF allowed per port (default 0: VF won’t be enabled by default)

      • allow_unsupported_sfp parameter: unsupported/untested SFP allowed

      • debug option

  • the README.txt file from VMX installation package contains more info about the ixgbe driver

ping@trinity:~$ cat /sys/module/ixgbe/version
4.0.1-k
 1ping@trinity:~$ modinfo ixgbe
 2filename:
 3  /lib/modules/3.19.0-25-generic/kernel/drivers/net/ethernet/intel/ixgbe/ixgbe.ko
 4version:        4.0.1-k     #<------
 5license:        GPL
 6description:    Intel(R) 10 Gigabit PCI Express Network Driver
 7author:         Intel Corporation, <linux.nics@intel.com>
 8srcversion:     44CBFE422F8BAD726E61653
 9alias:          pci:v00008086d000015ABsv*sd*bc*sc*i*
10alias:          pci:v00008086d000015AAsv*sd*bc*sc*i*
11alias:          pci:v00008086d00001563sv*sd*bc*sc*i*
12alias:          pci:v00008086d00001560sv*sd*bc*sc*i*
13alias:          pci:v00008086d0000154Asv*sd*bc*sc*i*
14alias:          pci:v00008086d00001557sv*sd*bc*sc*i*
15alias:          pci:v00008086d00001558sv*sd*bc*sc*i*
16alias:          pci:v00008086d0000154Fsv*sd*bc*sc*i*
17alias:          pci:v00008086d0000154Dsv*sd*bc*sc*i*
18alias:          pci:v00008086d00001528sv*sd*bc*sc*i*
19alias:          pci:v00008086d000010F8sv*sd*bc*sc*i*
20alias:          pci:v00008086d0000151Csv*sd*bc*sc*i*
21alias:          pci:v00008086d00001529sv*sd*bc*sc*i*
22alias:          pci:v00008086d0000152Asv*sd*bc*sc*i*
23alias:          pci:v00008086d000010F9sv*sd*bc*sc*i*
24alias:          pci:v00008086d00001514sv*sd*bc*sc*i*
25alias:          pci:v00008086d00001507sv*sd*bc*sc*i*
26alias:          pci:v00008086d000010FBsv*sd*bc*sc*i*
27alias:          pci:v00008086d00001517sv*sd*bc*sc*i*
28alias:          pci:v00008086d000010FCsv*sd*bc*sc*i*
29alias:          pci:v00008086d000010F7sv*sd*bc*sc*i*
30alias:          pci:v00008086d00001508sv*sd*bc*sc*i*
31alias:          pci:v00008086d000010DBsv*sd*bc*sc*i*
32alias:          pci:v00008086d000010F4sv*sd*bc*sc*i*
33alias:          pci:v00008086d000010E1sv*sd*bc*sc*i*
34alias:          pci:v00008086d000010F1sv*sd*bc*sc*i*
35alias:          pci:v00008086d000010ECsv*sd*bc*sc*i*
36alias:          pci:v00008086d000010DDsv*sd*bc*sc*i*
37alias:          pci:v00008086d0000150Bsv*sd*bc*sc*i*
38alias:          pci:v00008086d000010C8sv*sd*bc*sc*i*
39alias:          pci:v00008086d000010C7sv*sd*bc*sc*i*
40alias:          pci:v00008086d000010C6sv*sd*bc*sc*i*
41alias:          pci:v00008086d000010B6sv*sd*bc*sc*i*
42depends:        mdio,ptp,dca,vxlan
43intree:         Y
44vermagic:       3.19.0-25-generic SMP mod_unload modversions
45signer:         Magrathea: Glacier signing key
46sig_key:        6A:AA:11:D1:8C:2D:3A:40:B1:B4:DB:E5:BF:8A:D6:56:DD:F5:18:38
47sig_hashalgo:   sha512
48parm:           max_vfs:Maximum number of virtual functions to allocate per
49    physical function - default is zero and maximum value is 63. (Deprecated)
50    (uint)
51parm:           allow_unsupported_sfp:Allow unsupported and untested SFP+
52    modules on 82599-based adapters (uint)
53parm:           debug:Debug level (0=none,...,16=all) (int)

1.1.6. VT-d/IOMMU

Intel VT-d feaure is supported in this server’s current OS kernel.

 1ping@trinity:~$ less /boot/config-3.19.0-25-generic | grep -i iommu
 2CONFIG_GART_IOMMU=y
 3CONFIG_CALGARY_IOMMU=y
 4CONFIG_CALGARY_IOMMU_ENABLED_BY_DEFAULT=y
 5CONFIG_IOMMU_HELPER=y
 6CONFIG_VFIO_IOMMU_TYPE1=m
 7CONFIG_IOMMU_API=y
 8CONFIG_IOMMU_SUPPORT=y          #<------
 9CONFIG_AMD_IOMMU=y
10CONFIG_AMD_IOMMU_STATS=y
11CONFIG_AMD_IOMMU_V2=m
12CONFIG_INTEL_IOMMU=y            #<------
13# CONFIG_INTEL_IOMMU_DEFAULT_ON is not set
14CONFIG_INTEL_IOMMU_FLOPPY_WA=y
15# CONFIG_IOMMU_DEBUG is not set
16# CONFIG_IOMMU_STRESS is not set
17
18ping@trinity:~$ grep -i remap /boot/config-3.13.0-32-generic
19CONFIG_HAVE_IOREMAP_PROT=y
20CONFIG_IRQ_REMAP=y              #<------
21
22ping@ubuntu1:~$ grep -i pci_stub /boot/config-3.13.0-32-generic
23CONFIG_PCI_STUB=m               #<------
a more "general" way to print kernel info is: less /boot/config-`uname -r`

1.1.7. kvm/qemu software version

  • qemu version 2.0.0 (Debian 2.0.0+dfsg-2ubuntu1.19)

  • kvm version (same as linux kernel version)

This looks OK comparing with the "Minimum Hardware and Software" from the release note:

VirtualizationQEMU-KVM 2.0.0+dfsg-2ubuntu1.11 or later

To verify qemu/kvm version and hardware acceleration support:

  1. qemu version:

    1ping@trinity:~$ qemu-system-x86_64 --version
    2QEMU emulator version 2.0.0 (Debian 2.0.0+dfsg-2ubuntu1.19), Copyright (c) 2003-2008 Fabrice Bellard

    or

    1ping@trinity:~$ kvm --version
    2QEMU emulator version 2.0.0 (Debian 2.0.0+dfsg-2ubuntu1.19), Copyright (c) 2003-2008 Fabrice Bellard
  2. KVM kernel module info:

     1ping@trinity:~$ modinfo kvm
     2filename:       /lib/modules/3.19.0-25-generic/kernel/arch/x86/kvm/kvm.ko
     3license:        GPL
     4author:         Qumranet
     5srcversion:     F58A0F8858A02EFA0549DE5
     6depends:
     7intree:         Y
     8vermagic:       3.19.0-25-generic SMP mod_unload modversions
     9signer:         Magrathea: Glacier signing key
    10sig_key:        6A:AA:11:D1:8C:2D:3A:40:B1:B4:DB:E5:BF:8A:D6:56:DD:F5:18:38
    11sig_hashalgo:   sha512
    12parm:           allow_unsafe_assigned_interrupts:Enable device assignment on platforms without interrupt remapping support. (bool)
    13parm:           ignore_msrs:bool
    14parm:           min_timer_period_us:uint
    15parm:           tsc_tolerance_ppm:uint
  3. HW acceleration OK:

    ping@trinity:~$ kvm-ok
    INFO: /dev/kvm exists
    KVM acceleration can be used
    
    ping@MX86-host-BL660C-B1: ls -l /dev/kvm
    crw-rw---- 1 root kvm 10, 232 Nov  5 11:35 /dev/kvm

The development of qemu software is very active, and code changes (for bug fix, security update, new features) and new releases appear frequently. For ubuntu linux the changelogs are available here

1.1.8. libvirt

  • current libvirt version is 1.2.2

  • this needs to be upgraded to required libvirt version.

1ping@ubuntu:~$ libvirtd --version
2libvirtd (libvirt) 1.2.2
libvirt 1.2.2 misses some of bug fixes of features that are important for VMX. One of these are numatune, which is used to "pin" vCPUs of guest VM to physcial CPUs all in one NUMA node, hence reducing the NUMA misses that is one of the main contributor of performance impact.

Now after identifying the server HW/SW configuration, the checklist looks:

pre-installation checklist
  • server manufacturer info

  • operating system

  • cpu and memory

  • NIC/controller

  • NIC driver

  • VT-d/IOMMU

  • kvm/qemu version

  • libvirt version

those checked mark indicate those items that do not meet the requirement and need some change.

1.2. adjust the system

after collecting the server’s current configuration, we need to change some setting according to the Preparing the System to Install vMX portion in the VMX document.

1.2.1. BIOS setting

following (Intel) virtualization features are required to setup VMX and need to be enabled from within BIOS, if not yet.

  • VT-x

  • VT-d

  • SR-IOV

  • HyperThreading

the requirment may change in different VMX release

enter BIOS setting:

865494a4 a442 11e5 8e62 05c126a9b004

even in same hardware, different version of BIOS might look a little bit different. In this system we have BIOS version "I32"

Here are more details of current BIOS info in this server:

labroot@MX86-host-BL660C-B1:~/vmx_20141216$ sudo dmidecode
# dmidecode 2.12
SMBIOS 2.8 present.
227 structures occupying 7860 bytes.
Table at 0xBFBDB000.
Handle 0x0000, DMI type 0, 24 bytes
BIOS Information
        Vendor: HP
        Version: I32        #<------
        Release Date: 02/10/2014
        Address: 0xF0000
        Runtime Size: 64 kB
        ROM Size: 8192 kB
        Characteristics:
                PCI is supported
                PNP is supported
                BIOS is upgradeable
                BIOS shadowing is allowed
                ESCD support is available
                Boot from CD is supported
                Selectable boot is supported
                EDD is supported
                5.25"/360 kB floppy services are supported (int 13h)
                5.25"/1.2 MB floppy services are supported (int 13h)
                3.5"/720 kB floppy services are supported (int 13h)
                Print screen service is supported (int 5h)
                8042 keyboard services are supported (int 9h)
                Serial services are supported (int 14h)
                Printer services are supported (int 17h)
                CGA/mono video services are supported (int 10h)
                ACPI is supported
                USB legacy is supported
                BIOS boot specification is supported
                Function key-initiated network boot is supported
                Targeted content distribution is supported
        Firmware Revision: 1.51
  1. VT-x/VT-d/HyperThreading

    BIOS ▸ System Options ▸ Intel® Virtualization technology

    BIOS ▸ System Options ▸ Intel® VT-d

    BIOS ▸ System Options ▸ Intel® HyperThreading Options

    04fc70ec a443 11e5 9ce0 f02a3a05a7c1

    The new VMX releases requires HyperThreading to be enabled to support flow cache feature. The installation script will abort if HT is not enabled, see troubleshooting installation script for more detail of this issue. To install VMX, Either modifying the script to disable HT1 calculation/verification, or installing VMX manually.

    • For lab simulation and low performance (less than 100 Mbps) use cases, any x86 processor (Intel or AMD) with VT-d capability.

    • For all other use cases, Intel Ivy Bridge processors or later are required. Example of Ivy Bridge processor: Intel Xeon E5-2667 v2 @ 3.30 GHz 25 MB Cache

    • For single root I/O virtualization (SR-IOV) NIC type, use Intel 82599-based PCI-Express cards (10 Gbps) and Ivy Bridge processors.

    These statements indicate some current implementation info:

    • lab simulation/low performance ⇒ virtio

    • all other use cases ⇒ high performance ⇒ SRIOV

    • "Ivy Bridge CPU" is required for VMX running SR-IOV

  2. SR-IOV

    BIOS ▸ Advanced Options ▸ SR-IOV

    da2d5aa2 a442 11e5 9ad7 837bfe167b20

1.2.2. enable iommu/VT-d

to enable VT-d we need to change kernel boot parameters. Here is the steps:

  1. Make sure the /etc/default/grub file contains this line:

    GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=on"
  2. If not, add it:

    echo 'GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=on"' >> /etc/default/grub

    So it looks like:

    ping@trinity:~$ grep -i iommu /etc/default/grub
    GRUB_CMDLINE_LINUX_DEFAULT="intel_iommu=on"
  3. Run sudo update-grub to update grub

  4. reboot system to make it start with configured kernel parameters

    sudo reboot
when using 'echo', be careful to use >> instead of >. > will "overwrite" the whole file with whatever echoed, instead of "append". in case that happens, correct it with instruction here.

instead of checking the grub config file, looking at parameters passed to the kernel at the time it is started might be more accurate:

ping@matrix:~$ cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-3.19.0-25-generic.efi.signed root=UUID=875bb72c-c5de-4329-af48-55af85f26398 ro intel_iommu=on pci=realloc

in this example, we are sure the current kernel has IOMMU enabled.

In a system that didn’t enable IOMMU, after enabling it and restart, you will notice kernel logs similar to below captures:

ping@Compute24:~$ dmesg -T | grep -iE "iommu|dmar"
[Thu Dec 17 09:07:22 2015] Command line: BOOT_IMAGE=/vmlinuz-3.16.0-30-generic
    root=/dev/mapper/Compute24--vg-root ro intel_iommu=on
    crashkernel=384M-2G:64M,2G-16G:128M,16G-:256M
[Thu Dec 17 09:07:22 2015] ACPI: DMAR 0x00000000BDDAB840 000718 (v01 HP
    ProLiant 00000001 \xffffffd2?   0000162E)
[Thu Dec 17 09:07:22 2015] Kernel command line:
    BOOT_IMAGE=/vmlinuz-3.16.0-30-generic root=/dev/mapper/Compute24--vg-root ro
    intel_iommu=on crashkernel=384M-2G:64M,2G-16G:128M,16G-:256M
[Thu Dec 17 09:07:22 2015] Intel-IOMMU: enabled
[Thu Dec 17 09:07:22 2015] dmar: Host address width 46
[Thu Dec 17 09:07:22 2015] dmar: DRHD base: 0x000000f34fe000 flags: 0x0
[Thu Dec 17 09:07:22 2015] dmar: IOMMU 0: reg_base_addr f34fe000 ver 1:0 cap d2078c106f0466 ecap f020de
[Thu Dec 17 09:07:22 2015] dmar: DRHD base: 0x000000f7efe000 flags: 0x0
[Thu Dec 17 09:07:22 2015] dmar: IOMMU 1: reg_base_addr f7efe000 ver 1:0 cap d2078c106f0466 ecap f020de
[Thu Dec 17 09:07:22 2015] dmar: DRHD base: 0x000000fbefe000 flags: 0x0
[Thu Dec 17 09:07:22 2015] dmar: IOMMU 2: reg_base_addr fbefe000 ver 1:0 cap d2078c106f0466 ecap f020de
[Thu Dec 17 09:07:22 2015] dmar: DRHD base: 0x000000ecffe000 flags: 0x1
[Thu Dec 17 09:07:22 2015] dmar: IOMMU 3: reg_base_addr ecffe000 ver 1:0 cap d2078c106f0466 ecap f020de
[Thu Dec 17 09:07:22 2015] dmar: RMRR base: 0x000000bdffd000 end: 0x000000bdffffff
......
[Thu Dec 17 09:07:22 2015] dmar: RMRR base: 0x000000bddde000 end: 0x000000bdddefff
[Thu Dec 17 09:07:22 2015] dmar: ATSR flags: 0x0
[Thu Dec 17 09:07:22 2015] IOAPIC id 12 under DRHD base  0xfbefe000 IOMMU 2
[Thu Dec 17 09:07:22 2015] IOAPIC id 11 under DRHD base  0xf7efe000 IOMMU 1
[Thu Dec 17 09:07:22 2015] IOAPIC id 10 under DRHD base  0xf34fe000 IOMMU 0
[Thu Dec 17 09:07:22 2015] IOAPIC id 8 under DRHD base  0xecffe000 IOMMU 3
[Thu Dec 17 09:07:22 2015] IOAPIC id 0 under DRHD base  0xecffe000 IOMMU 3
[Thu Dec 17 09:07:24 2015] IOMMU 2 0xfbefe000: using Queued invalidation
[Thu Dec 17 09:07:24 2015] IOMMU 1 0xf7efe000: using Queued invalidation
[Thu Dec 17 09:07:24 2015] IOMMU 0 0xf34fe000: using Queued invalidation
[Thu Dec 17 09:07:24 2015] IOMMU 3 0xecffe000: using Queued invalidation
[Thu Dec 17 09:07:24 2015] IOMMU: Setting RMRR:
[Thu Dec 17 09:07:24 2015] IOMMU: Setting identity map for device 0000:01:00.0 [0xbddde000 - 0xbdddefff]
......
[Thu Dec 17 09:07:24 2015] IOMMU: Setting identity map for device 0000:01:00.2 [0xbddde000 - 0xbdddefff]
[Thu Dec 17 09:07:24 2015] IOMMU: Setting identity map for device 0000:01:00.4 [0xbddde000 - 0xbdddefff]
[Thu Dec 17 09:07:24 2015] IOMMU: Setting identity map for device 0000:03:00.0 [0xe8000 - 0xe8fff]
[Thu Dec 17 09:07:24 2015] IOMMU: Setting identity map for device 0000:03:00.1 [0xe8000 - 0xe8fff]
[Thu Dec 17 09:07:24 2015] IOMMU: Setting identity map for device 0000:02:00.0 [0xe8000 - 0xe8fff]
[Thu Dec 17 09:07:24 2015] IOMMU: Setting identity map for device 0000:02:00.1 [0xe8000 - 0xe8fff]
[Thu Dec 17 09:07:24 2015] IOMMU: Setting identity map for device 0000:06:00.0 [0xe8000 - 0xe8fff]
[Thu Dec 17 09:07:24 2015] IOMMU: Setting identity map for device 0000:06:00.1 [0xe8000 - 0xe8fff]
[Thu Dec 17 09:07:24 2015] IOMMU: Setting identity map for device 0000:01:00.0 [0xe8000 - 0xe8fff]
[Thu Dec 17 09:07:24 2015] IOMMU: Setting identity map for device 0000:01:00.2 [0xe8000 - 0xe8fff]
[Thu Dec 17 09:07:24 2015] IOMMU: Setting identity map for device 0000:21:00.0 [0xe8000 - 0xe8fff]
[Thu Dec 17 09:07:24 2015] IOMMU: Setting identity map for device 0000:21:00.1 [0xe8000 - 0xe8fff]
[Thu Dec 17 09:07:24 2015] IOMMU: Setting identity map for device 0000:23:00.0 [0xe8000 - 0xe8fff]
[Thu Dec 17 09:07:24 2015] IOMMU: Setting identity map for device 0000:23:00.1 [0xe8000 - 0xe8fff]
......
[Thu Dec 17 09:07:24 2015] IOMMU: Setting identity map for device 0000:23:00.0 [0xbdf83000 - 0xbdf84fff]
[Thu Dec 17 09:07:24 2015] IOMMU: Setting identity map for device 0000:23:00.1 [0xbdf83000 - 0xbdf84fff]
......
[Thu Dec 17 09:07:24 2015] IOMMU: Prepare 0-16MiB unity mapping for LPC
[Thu Dec 17 09:07:24 2015] IOMMU: Setting identity map for device 0000:00:1f.0 [0x0 - 0xffffff]

1.2.3. install required linux kernel for SR-IOV based VMX

linux kernel needs to be changed if VMX needs to be setup with SR-IOV.

currently the ixgbe coming with ubuntu does not work on VMX. The main issue is lack of multicast support on ingress - packet received on a VF will be discarded siliently and won’t be delivered into the guest VM, an example of the immediate effect of this is that OSPF (and most of today’s IGP) neighborship won’t come up. Therefore building VMX based on SR-IOV requires to re-compile the ixgbe kernel driver from source code, which is provided by Juniper to fix the multicast support. the code is available in the installation package. At the time of the writing of this document there is problem to compile ixgbe from source code under any kernels other than 3.13.0-32-generic. that’s why the kernel needs to be changed in this setup.

There is a statement about this issue in the VMX document:

Modified IXGBE drivers are included in the package. Multicast promiscuous mode for Virtual Functions is needed to receive control traffic that comes with broadcast MAC addresses. The reference driver does not come with this mode set, so the IXGBE drivers in this package contain certain modifications to overcome this limitation.
— VMX "Getting Started Guide"
there is a plan to make ixgbe kernel module to work with newer kernel in new VMX release.

use below commands (provided in VMX installation doc) to change kernel:

sudo apt-get install linux-firmware linux-image-3.13.0.32-generic \
                     linux-image-extra-3.13.0.32-generic \
                     linux-headers-3.13.0.32-generic
setup default linux kernel

After changing the kernel for the next reboot, you may want to make it default kernel for later reboot. To achieve that following below steps:

in file /boot/grub/grub.cfg locate this line:

menuentry 'Ubuntu, with Linux 3.13.0-32-generic'

then move it before the first menuentry entry:

1    export linux_gfx_mode
2                                                #<------move to here
3    menuentry 'Ubuntu' --class ubuntu -  ....

save and reboot the server.

this is optional because otherwise you will still be given a change to select kernel version during system reboot.

1.2.4. install required software packages

Use below commands (provided in VMX installation doc) to install required software packages:

1sudo apt-get update
2sudo apt-get install bridge-utils qemu-kvm libvirt-bin python numactl \
3                    python-netifaces vnc4server libyaml-dev python-yaml\
4                    libparted0-dev libpciaccess-dev libnuma-dev libyajl-dev\
5                    libxml2-dev libglib2.0-dev libnl-dev libnl-dev python-pip\
6                    python-dev libxml2-dev libxslt-dev

a quick way to verify if all/any of the required software package in the list were installed correctly or not, is to simply re-run the above installation commands again. If everything got installed correctly then you will get sth like this:

 1sudo apt-get install bridge-utils qemu-kvm libvirt-bin python numactl \
 2python-netifaces vnc4server libyaml-dev python-yaml libparted0-dev \
 3libpciaccess-dev libnuma-dev libyajl-dev libxml2-dev libglib2.0-dev \
 4libnl-dev libnl-dev python-pip python-dev libxml2-dev libxslt-dev
 5Reading package lists... Done
 6Building dependency tree
 7Reading state information... Done
 8Note, selecting 'libxslt1-dev' instead of 'libxslt-dev'
 9bridge-utils is already the newest version.
10libpciaccess-dev is already the newest version.
11libxslt1-dev is already the newest version.
12libyajl-dev is already the newest version.
13python is already the newest version.
14python-dev is already the newest version.
15python-netifaces is already the newest version.
16libnl-dev is already the newest version.
17libglib2.0-dev is already the newest version.
18libnuma-dev is already the newest version.
19libparted0-dev is already the newest version.
20libvirt-bin is already the newest version.
21libxml2-dev is already the newest version.
22libyaml-dev is already the newest version.
23python-yaml is already the newest version.
24qemu-kvm is already the newest version.
25numactl is already the newest version.
26python-pip is already the newest version.
27vnc4server is already the newest version.
280 upgraded, 0 newly installed, 0 to remove and 124 not upgraded.  #<------

1.2.5. libvirt

make sure libvirt version 1.2.8 is installed for "performance version" of VMX. refer to the VMX document for detail steps to install it from source code.

In below command captures we demonstrate the libvirt upgrading process.

  1. original libvirt coming with ubuntu14.02:

    1ping@ubuntu:~$ libvirtd --version
    2libvirtd (libvirt) 1.2.2
  2. download and prepare source code:

    1cd /tmp
    2wget http://libvirt.org/sources/libvirt-1.2.8.tar.gz
    3tar zxvf libvirt-1.2.8.tar.gz
  3. stop and uninstall old version:

     1cd libvirt-1.2.8
     2sudo ./configure --prefix=/usr/local --with-numactl
     3
     4    checking for a BSD-compatible install... /usr/bin/install -c
     5    checking whether build environment is sane... yes
     6    checking for a thread-safe mkdir -p... /bin/mkdir -p
     7    checking for gawk... gawk
     8    checking whether make sets $(MAKE)... yes
     9    ......
    10
    11sudo service libvirt-bin stop
    12    libvirt-bin stop/waiting
    13
    14sudo make uninstall
    15    Making uninstall in .
    16    make[1]: Entering directory `/tmp/libvirt-1.2.8'
    17    make[1]: Leaving directory `/tmp/libvirt-1.2.8'
    18    ......
    19
    20/bin/rm rf /usr/local/lib/libvirt*in/rm:
    21    cannot remove ‘/usr/local/lib/libvirt*’: No such file or directory
  4. install new version:

     1sudo ./configure --prefix=/usr --localstatedir=/ --with-numactl
     2    checking for a BSD-compatible install... /usr/bin/install -c
     3    checking whether build environment is sane... yes
     4    checking for a thread-safe mkdir -p... /bin/mkdir -p
     5......
     6
     7sudo make
     8    make  all-recursive
     9    make[1]: Entering directory `/tmp/libvirt-1.2.8'
    10    Making all in .
    11    make[2]: Entering directory `/tmp/libvirt-1.2.8'
    12    make[2]: Leaving directory `/tmp/libvirt-1.2.8'
    13    Making all in gnulib/lib
    14    make[2]: Entering directory `/tmp/libvirt-1.2.8/gnulib/lib'
    15      GEN      alloca.h
    16      GEN      c++defs.h
    17      GEN      warn-on-use.h
    18      GEN      arg-nonnull.h
    19      GEN      arpa/inet.h
    20    ......
    21
    22sudo make install
    23    Making install in .
    24    make[1]: Entering directory `/tmp/libvirt-1.2.8'
    25    make[2]: Entering directory `/tmp/libvirt-1.2.8'
    26    ......
  5. start new version:

    1ping@ubuntu:/tmp/libvirt-1.2.8$ sudo service libvirt-bin start
    2libvirt-bin start/running, process 24450
    3ping@ubuntu:/tmp/libvirt-1.2.8$ ps aux| grep libvirt
    4
    5ping@ubuntu:/tmp/libvirt-1.2.8$ ps aux| grep libvirt
    6root     24450  0.5  0.0 405252 10772 ?        Sl   21:40   0:00 /usr/sbin/libvirtd -d
  6. verify the new version:

     1ping@trinity:~$ libvirtd --version
     2libvirtd (libvirt) 1.2.8
     3
     4ping@trinity:~$ service libvirt-bin status
     5libvirt-bin start/running, process 1559
     6
     7ping@trinity:~$ which libvirtd
     8/usr/sbin/libvirtd
     9
    10ping@trinity:~$ /usr/sbin/libvirtd --version
    11/usr/sbin/libvirtd (libvirt) 1.2.8
    12
    13ping@trinity:~$ virsh --version
    141.2.8
    15
    16ping@trinity:/images/vmx_20151102.0$ sudo virsh -c qemu:///system version
    17Compiled against library: libvirt 1.2.8
    18Using library: libvirt 1.2.8
    19Using API: QEMU 1.2.8
    20Running hypervisor: QEMU 2.0.0

1.3. download vmx installation package

locate the vmx tarball

download VMX tarball from internal or public server.

1pings@svl-jtac-tool02:/volume/publish/dev/wrlinux/mx86/15.1F_att_drop$ ls -l
2total 4180932
3-rw-rw-r--  1 rbu-builder  rbu-builder   816737275 Nov  9 12:23 vmx_20151102.0.tgz  #<------
4-rw-rw-r--  1 rbu-builder  rbu-builder  3447736320 Nov  4 12:18 vmx_20151102.0_tarball_issue.tgz

To have a quick look at the guest image files included in the tarball:

pings@svl-jtac-tool02:~$ cd /volume/publish/dev/wrlinux/mx86/15.1F_att_drop/
pings@svl-jtac-tool02:~$ tar tf vmx_20151102.0.tgz | grep images
vmx_20151102.0/images/
vmx_20151102.0/images/jinstall64-vmx-15.1F-20151104.0-domestic.img      (1)
vmx_20151102.0/images/vFPC-20151102.img                                 (2)
vmx_20151102.0/images/vmxhdd.img                                        (3)
vmx_20151102.0/images/metadata_usb.img
1 jinstall image, vRE/VCP (Virtual Control Plane) VM image
2 vFPC image, vFPC/VFP (Virtual Forwarding Plane) VM image
3 hdd image, vRE virtual harddisk image
when downloaded from Juniper internal server, the jinstall image (vRE guest VM) may or may not be included in the VMX tarball. it is available in the normal Junos releases archives. tarball downloaded from public URL will always include jinstall and all other necessary images.
locate the vmx jinstall image

If the tarball does not contain a vRE jinstall image, we can download the image from other places and copy it over to the same "images" folder after untar the VMX tarball.

 1pings@svl-jtac-tool02:/volume/build/junos/15.1F/daily/20151102.0/ship$
 2ls -l | grep install64 | grep vmx
 31 builder   748510583 jinstall64-vmx-15.1F-20151102.0-domestic-signed.tgz
 41 builder          33 jinstall64-vmx-15.1F-20151102.0-domestic-signed.tgz.md5
 51 builder          41 jinstall64-vmx-15.1F-20151102.0-domestic-signed.tgz.sha1
 61 builder  1005715456 jinstall64-vmx-15.1F-20151102.0-domestic.img      #<----
 71 builder          33 jinstall64-vmx-15.1F-20151102.0-domestic.img.md5
 81 builder          41 jinstall64-vmx-15.1F-20151102.0-domestic.img.sha1
 91 builder   748226059 jinstall64-vmx-15.1F-20151102.0-domestic.tgz
101 builder          33 jinstall64-vmx-15.1F-20151102.0-domestic.tgz.md5
111 builder          41 jinstall64-vmx-15.1F-20151102.0-domestic.tgz.sha1
There is no guarantee that an arbitrary combination of jinstall and vFPC images can work together or not. The publicly released VMX packages should already include the tested and working combination. Read the offical instruction and document coming with the software release before starting to install the VMX.

1.4. prepare a "work folder" for the installation

in my server I organize folder/files in this structure:

 1/virtualization             (1)
 2├── images                  (2)
 3│   ├── ubuntu.img          (3)
 4│   ├── vmx_20151102.0      (3)
 5│   │   ├── build
 6│   │   │   └── vmx1
 7│   │   │       ├── images
 8│   │   │       ├── logs
 9│   │   │       │   └── vmx_1448293071.log
10│   │   │       └── xml
11│   │   ├── config
12│   │   │   ├── samples
13│   │   │   │   ├── vmx.conf.sriov
14│   │   │   │   ├── vmx.conf.virtio
15│   │   │   │   └── vmx-galaxy.conf
16│   │   │   ├── vmx.conf            (6)
17│   │   │   ├── vmx.conf.sriov1     (6)
18│   │   │   ├── vmx.conf.sriov2     (6)
19│   │   │   ├── vmx.conf.ori
20│   │   │   └── vmx-junosdev.conf
21│   │   ├── docs
22
23...<snipped>...
24
25│   └── vmx_20151102.0.tgz  (3)
26├── vmx1                    (4)
27│   ├── br-ext-generated.xml \
28│   ├── br-int-generated.xml  |
29│   ├── cpu_affinitize.sh     |     (5)
30│   ├── vfconfig-generated.sh |
31│   ├── vPFE-generated.xml    |
32│   ├── vRE-generated.xml    /
33│   └── vmxhdd.img          (7)
34└── vmx2                    (4)
35    ├── br-ext-generated.xml  \
36    ├── br-int-generated.xml   |
37    ├── cpu_affinitize.sh      |    (5)
38    ├── vfconfig-generated.sh  |
39    ├── vPFE-generated.xml     |
40    ├── vRE-generated.xml     /
41    └── vmxhdd.img          (7)
42
4327 directories, 136 files
1 parent folder to hold all virtualization files/releases/images
2 "images" sub-folder holds all images for virtualizations
3 VM "images", installation files, tarballs, etc
4 sub-folder to hold all files for installation of multiple VMX VM instances
5 files generated by installation scripts
6 VMX installation configuration file
7 "hard disk" image file, holding all current VMX guest VM configs

2. install VMX using installation script (SR-IOV)

Included in the tarball there is an orchestration script to automate the VMX setup in the server. running this script without any parameter will provide a quick help of the usage.

 1ping@trinity:/virtualization/images/vmx_20151102.0$ sudo ./vmx.sh
 2
 3Usage: vmx.sh [CONTROL OPTIONS]
 4       vmx.sh [LOGGING OPTIONS] [CONTROL OPTIONS]
 5       vmx.sh [JUNOS-DEV BIND OPTIONS]
 6       vmx.sh [CONSOLE LOGIN OPTIONS]
 7
 8    CONTROL OPTIONS:
 9       --install                      : Install And Start vMX
10       --start                        : Start vMX
11       --stop                         : Stop vMX
12       --restart                      : Restart vMX
13       --status                       : Check Status Of vMX
14       --cleanup                      : Stop vMX And Cleanup Build Files
15       --cfg <file>                   : Override With The Specified vmx.conf File
16       --env <file>                   : Override With The Specified Environment .env File
17       --build <directory>            : Override With The Specified Directory for Temporary Files
18       --help                         : This Menu
19
20    LOGGING OPTIONS:
21       -l                             : Enable Logging
22       -lv                            : Enable Verbose Logging
23       -lvf                           : Enable Foreground Verbose Logging
24
25    JUNOS-DEV BIND OPTIONS:
26       --bind-dev                     : Bind Junos Devices
27       --unbind-dev                   : Unbind Junos Devices
28       --bind-check                   : Check Junos Device Bindings
29       --cfg <file>                   : Override With The Specified vmx-junosdev.conf File
30
31    CONSOLE LOGIN OPTIONS:
32       --console [vcp|vfp] [vmx_id]   : Login to the Console of VCP/VFP
33
34    VFP Image OPTIONS:
35       --vfp-info <VFP Image Path>    : Display Information About The Specified vFP image
36
37Copyright(c) Juniper Networks, 2015

2.1. the vmx.conf file

The config file vmx.conf is a centralized place where all of the installation parameters and options are defined. The installation script vmx.sh scan this file as input and generate XML files and shell scripts as output. the generated XML files will be read by libvirt/virsh tool later as input information to setup and manipulate the VMs.

The generated shell script will be executed to configure:

  • vcpu pinning (both SRIOV and VIRTIO)

  • VF properties (SRIOV only)

For better readability vmx.conf is implemented in YAML format. However, in VMX the guest VM instances are manipulated via libvirt, which currently solely relies on XML. This is why there are 2 software modules in the pre-installed packages to support YAML to XML conversion:

  • libyaml-dev Fast YAML 1.1 parser and emitter library (development)

  • python-yaml YAML parser and emitter for Python

a short introduction about YAML

YAML (YAML Ain’t Markup) is a human friendly data serialization language. VMX config file uses YAML because it’s as close to plain English as data serialization and configuration formats get. The advantage of YAML is that it does not require curly braces, allowing you to omit quotation marks for strings in most cases, relying on indentation for structure, which makes it much more readable compared to XML.

Important tips about YAML:

  • indentation matters: YAML relies on indentation to understand the data structure,

  • use space instead of tabs, tabs are not universally supported across implementations

  • case sensitive

In short, every space/indentation matters. Taking cautions when modifying the vmx.conf file. A good practice is to just replace the parameters (numbers, image path, interface names, MAC, etc) and leave everything else intact.

vmx.conf template coming with the VMX tarball looks :files

 1ping@trinity:/virtualization/images/vmx_20151102.0/config$ cat vmx.conf
 2##############################################################
 3#
 4#  vmx.conf
 5#  Config file for vmx on the hypervisor.
 6#  Uses YAML syntax.
 7#  Leave a space after ":" to specify the parameter value.
 8#
 9##############################################################
10
11---
12#Configuration on the host side - management interface, VM images etc.
13HOST:                                                                      (1)
14    identifier                : vmx1   # Maximum 4 characters              (2)
15    host-management-interface : eth0                                       (3)
16    routing-engine-image      : "/home/vmx/vmxlite/images/jinstall64-vmx.img"(4)
17    routing-engine-hdd        : "/home/vmx/vmxlite/images/vmxhdd.img"      (4)
18    forwarding-engine-image   : "/home/vmx/vmxlite/images/vPFE.img"        (4)
19
20---
21#External bridge configuration                                             (5)
22BRIDGES:
23    - type  : external
24      name  : br-ext                  # Max 10 characters
25
26---
27#vRE VM parameters                                                         (6)
28CONTROL_PLANE:
29    vcpus       : 1
30    memory-mb   : 1024
31    console_port: 8601
32
33    interfaces  :
34      - type      : static
35        ipaddr    : 10.102.144.94
36        macaddr   : "0A:00:DD:C0:DE:0E"
37
38---
39#vPFE VM parameters                                                        (7)
40FORWARDING_PLANE:
41    memory-mb   : 6144
42    vcpus       : 3
43    console_port: 8602
44    device-type : virtio                                                   (8)
45
46    interfaces  :
47      - type      : static
48        ipaddr    : 10.102.144.98
49        macaddr   : "0A:00:DD:C0:DE:10"
50
51---
52#Interfaces                                                                (9)
53JUNOS_DEVICES:
54   - interface            : ge-0/0/0
55     mac-address          : "02:06:0A:0E:FF:F0"
56     description          : "ge-0/0/0 interface"
57
58   - interface            : ge-0/0/1
59     mac-address          : "02:06:0A:0E:FF:F1"
60     description          : "ge-0/0/0 interface"
61
62   - interface            : ge-0/0/2
63     mac-address          : "02:06:0A:0E:FF:F2"
64     description          : "ge-0/0/0 interface"
65
66   - interface            : ge-0/0/3
67     mac-address          : "02:06:0A:0E:FF:F3"
68     description          : "ge-0/0/0 interface"
1 "HOST" config section.
2 ID of the vmx instance. this string will be encoded into the final VM name: vcp-<ID> or vfp-<ID> .
3 current management interface of the server. the installation script will "move" the IP/MAC property from this port to an external bridge named "br-ext".
4 vRE/vFPC/Harddisk VM images location.
5 external bridge configuration section: a bridge utility, named br-ext, will be created, for managment connection from/to the external networks.
6 vRE configuration section: this template uses 1 vCPU, 1G mem, console port 8601 to start vRE guest VM. the VM mgmt interface’s "peer interface" [1]from the host - vcp_ext-vmx1 ("attached" to fxp0 port from inside the guest VM), will be configured with the specified MAC address. the corresponding fxp0 interface from inside of vRE VM will inherit same MAC from it. [2]
7 vFPC configuration section: this template uses 3 vCPU, 6G mem, console port 8602 to build vFPC guest VM. the VM mgmt interface’s "peer interface" [1] from the host - vfp_ext-vmx1 ("attached" to ext port from side the guest VM) will be configured with the specified MAC address. the ext interface from inside the vFPC guest VM will inherit same MAC from it.
8 as a KVM implementation, VMX currently supports two type of network IO virtualization : VT-d + SRIOV, or VIRTIO. this config knob device-type will determine which IO virtualization technology will be used to build VMX. This template uses "virtio" IO virtualization.
9 VMX router interface configuration section: This is where the router ge-0/0/z properties can be configured. depending on device-type value the available configurable properties will be different. Since this template uses "virtio" virtualization, only "mac-address" is configurable. more details will be covered in later sections of this doc.

2.2. assigning MAC address

the "virtual" MX will use virual NIC running inside the guest VM. so unlike any real NIC coming with a built-in MAC address which is globally uniquely assigned, the MAC address of "vNIC" is what you assigned before or after the VMX instances were brought up.

86ccb340 b544 11e5 9739 c926e0196ea9
Figure 1. internal and external interfaces of VFP and VCP VM

To avoid confliction to other external devices, at least these below MAC addresses needs to be unique in the diagram:

  1. MAC for ge-0/0/z

  2. fxp0 on VRE(VCP)

  3. eth1 or ext on VPFE(VFP)

VFP VM interface name changed in new VMX release.

Table 2. VFP interface name
before 15.1 from 15.1 roll

eth1

ext

VFP interface facing external networks

eth0

int

VFP interface for internal use

These MAC addresses will exit the server and be learnt by the external device.

Below are internal / isolated interfaces which never communicate with external devices, MAC address doesn’t matter for these interfaces:

  1. em1 on VRE

  2. eth0 or int on VPFE

"Locally Administered MAC Address" is good candidate to be used in lab test environment:

x2-xx-xx-xx-xx-xx
x6-xx-xx-xx-xx-xx
xA-xx-xx-xx-xx-xx
xE-xx-xx-xx-xx-xx

In my setup I follow this simply rule below to avoid MAC address confliction with other systems in the lab network:

mac-address          : "02:04:17:01:02:02"
                         - ----- -- -- --
                         |    |   |  |  |
                         |    |   |  | (5)
                         |    |   | (4)
                         |    |  (3)
                         |   (2)
                        (1)
1 locally administered MAC address
2 last 4 digits of IP (e.g. x.y.4.17) or MAC of management interface.
3 VMX instance number, first VMX instance uses 01, second uses 02, etc.
4 01 for control plane interface (fxp0 for VCP, ext for VFP) of a VM, 02 for forwarding plane interface (ge-0/0/z) of a VM
5 assign a unique number for each type of interface
Table 3. an example of MAC address allocation with this rule

VMX instance1

vRE fxp0

02:04:17:01:01:01

vFPC ext

02:04:17:01:01:02

ge-0/0/0

02:04:17:01:02:01

ge-0/0/1

02:04:17:01:02:02

VMX instance2

vRE fxp0

02:04:17:02:01:01

vFPC ext

02:04:17:02:01:02

ge-0/0/0

02:04:17:02:02:01

ge-0/0/1

02:04:17:02:02:02

2.3. modify the vmx.conf (SR-IOV)

The config file needs to be changed according to the installation plan. this is vmx.conf file that I use to setup SR-IOV based VMX.

 1ping@trinity:/virtualization/images/vmx_20151102.0/config$ cat vmx.conf
 2##############################################################
 3#
 4#  vmx.conf
 5#  Config file for vmx on the hypervisor.
 6#  Uses YAML syntax.
 7#  Leave a space after ":" to specify the parameter value.
 8#
 9##############################################################
10
11---
12#Configuration on the host side - management interface, VM images etc.
13HOST:
14    identifier                : vmx1   # Maximum 4 characters
15    host-management-interface : em1
16    routing-engine-image      : "/virtualization/images/vmx_20151102.0/images/jinstall64-vmx-15.1F-20151104.0-domestic.img"
17    routing-engine-hdd        : "/virtualization/images/vmx_20151102.0/vmx1/vmxhdd.img"
18    forwarding-engine-image   : "/virtualization/images/vmx_20151102.0/images/vFPC-20151102.img"
19
20---
21#External bridge configuration
22BRIDGES:
23    - type  : external
24      name  : br-ext                  # Max 10 characters
25
26---
27#vRE VM parameters
28CONTROL_PLANE:
29    vcpus       : 1
30    memory-mb   : 2048
31    console_port: 8816
32
33    interfaces  :
34      - type      : static
35        ipaddr    : 10.85.4.105
36        macaddr   : "02:04:17:01:01:01"
37---
38#vPFE VM parameters
39FORWARDING_PLANE:
40    memory-mb   : 16384
41    vcpus       : 4                     (1)
42    console_port: 8817
43    device-type : sriov
44
45    interfaces  :
46      - type      : static
47        ipaddr    : 10.85.4.106
48        macaddr   : "02:04:17:01:01:02"
49
50---
51#Interfaces
52JUNOS_DEVICES:
53    - interface            : ge-0/0/0
54      port-speed-mbps      : 10000      (2)
55      nic                  : p3p1       (2)
56      mtu                  : 2000             # DO NOT EDIT (2)
57      virtual-function     : 0          (2)
58      mac-address          : "02:04.17:01:02:01"
59      description          : "ge-0/0/0 connects to eth6"
60
61    - interface            : ge-0/0/1
62      port-speed-mbps      : 10000      (2)
63      nic                  : p2p1       (2)
64      mtu                  : 2000             # DO NOT EDIT
65      virtual-function     : 0          (2)
66      mac-address          : "02:04.17:01:02:02"
67      description          : "ge-0/0/1 connects to eth7"
1 assign 4 CPUs to vPFE VM
2 SR-IOV only options
key parameters in this conf file:
  • my server’s mgmt interface name is em1

  • vRE and vFPC images location will be just the images folder from the untar.ed installation package. these images can then be shared by all VMX instances.

  • virtual harddisk image vmxhdd.img will be in a seperate folder created specifically for current VMX instance

  • use console port 88x6 for vRE guest VM and 88x7 for vFPC guest VM, where x = instance number. Any other number which is not yet in use is fine.

  • MAC addresses are allocated following the above mentioned rule

  • for SR-IOV virtualization, these link properties need to be configured:

    • physical NIC name

    • VF number

    • MAC address

    • link speed

    • MTU

for virtio virtualization, only MAC address can be defined in the config file. the MTU can be defined in a seperate file vmx-junosdev.conf.

In this example only 4 CPUs were assigned to vPFE VM, which is less than what is required for full performance in production environment ("performance mode").

In VMX 15.1, recommended number of CPU for "performance mode" are calculated as following:

  • 1 for host-if

  • 1 for flow-manager

  • 1 for vmxt process

  • 2 per each "IO thread", 1 for each receiving and 1 for sending

  • remaining for "worker thread"

non-performance mode or "lite" mode requires less CPU - minimum 3 CPU is fine to bring up the VFP.

Here in lab environment, for test/study purpose I was able to bring up 2 interfaces SR-IOV VMX with totally 5 CPUs - 1 for VCP and 4 for VFP.

2.4. run the installation script: SR-IOV

After modification of vmx.conf file, we can run the vmx.sh script to setup the VMX.

  1ping@trinity:/virtualization/images/vmx_20151102.0$ sudo ./vmx.sh --install
  2==================================================
  3    Welcome to VMX
  4==================================================
  5Date..............................................11/24/15 21:18:36
  6VMX Identifier....................................vmx1
  7Config file......................................./virtualization/images/vmx_20151102.0/config/vmx.conf
  8Build Directory.................................../virtualization/images/vmx_20151102.0/build/vmx1
  9Environment file................................../virtualization/images/vmx_20151102.0/env/ubuntu_sriov.env
 10Junos Device Type.................................sriov
 11Initialize scripts................................[OK]
 12Copy images to build directory....................[OK]
 13==================================================
 14    VMX Environment Setup Completed
 15==================================================
 16==================================================
 17    VMX Install & Start
 18==================================================
 19Linux distribution................................ubuntu
 20Intel IOMMU status................................[Enabled]
 21Verify if GRUB needs reboot.......................[No]
 22Installation status of qemu-kvm...................[OK]
 23Installation status of libvirt-bin................[OK]
 24Installation status of bridge-utils...............[OK]
 25Installation status of python.....................[OK]
 26Installation status of libyaml-dev................[OK]
 27Installation status of python-yaml................[OK]
 28Installation status of numactl....................[OK]
 29Installation status of libnuma-dev................[OK]
 30Installation status of libparted0-dev.............[OK]
 31Installation status of libpciaccess-dev...........[OK]
 32Installation status of libyajl-dev................[OK]
 33Installation status of libxml2-dev................[OK]
 34Installation status of libglib2.0-dev.............[OK]
 35Installation status of libnl-dev..................[OK]
 36Check Kernel version..............................[OK]
 37Check Qemu version................................[OK]
 38Check libvirt version.............................[OK]
 39Check virsh connectivity..........................[OK]
 40Check IXGBE drivers...............................[OK]
 41==================================================
 42    Pre-Install Checks Completed
 43==================================================
 44Check for VM vcp-vmx1.............................[Running]
 45Shutdown vcp-vmx1.................................[OK]
 46Check for VM vfp-vmx1.............................[Running]
 47Shutdown vfp-vmx1.................................[OK]
 48Cleanup VM states.................................[OK]
 49Check if bridge br-ext exists.....................[Yes]
 50Get Configured Management Interface...............em1
 51Find existing management gateway..................br-ext
 52Mgmt interface needs reconfiguration..............[Yes]
 53Gateway interface needs change....................[Yes]
 54Check if br-ext has valid IP address..............[Yes]
 55Get Management Address............................10.85.4.17
 56Get Management Mask...............................255.255.255.128
 57Get Management Gateway............................10.85.4.1
 58Del em1 from br-ext...............................[OK]
 59Configure em1.....................................[Yes]
 60Cleanup VM bridge br-ext..........................[OK]
 61Cleanup VM bridge br-int-vmx1.....................[OK]
 62Cleanup IXGBE drivers.............................[OK]
 63==================================================
 64    VMX Stop Completed
 65==================================================
 66Check VCP image...................................[OK]
 67Check VFP image...................................[OK]
 68VMX Model.........................................FPC
 69Check VCP Config image............................[OK]
 70Check management interface........................[OK]
 71Check interface p3p1..............................[OK]
 72Check interface p2p1..............................[OK]
 73Setup huge pages to 32768.........................[OK]
 74Number of Intel 82599 NICs........................8
 75Configuring Intel 82599 Adapters for SRIOV........[OK]
 76Number of Virtual Functions created...............[OK]
 77Attempt to kill libvirt...........................[OK]
 78Attempt to start libvirt..........................[OK]
 79Sleep 2 secs......................................[OK]
 80Check libvirt support for hugepages...............[OK]
 81==================================================
 82    System Setup Completed
 83==================================================
 84Get Management Address of em1.....................[OK]
 85Generate libvirt files............................[OK]
 86Sleep 2 secs......................................[OK]
 87Configure virtual functions for SRIOV.............[OK]
 88Find configured management interface..............em1
 89Find existing management gateway..................em1
 90Check if em1 is already enslaved to br-ext........[No]
 91Gateway interface needs change....................[Yes]
 92Create br-ext.....................................[OK]
 93Get Management Gateway............................10.85.4.1
 94Flush em1.........................................[OK]
 95Start br-ext......................................[OK]
 96Bind em1 to br-ext................................[OK]
 97Get Management MAC................................38:ea:a7:37:7c:54
 98Assign Management MAC 38:ea:a7:37:7c:54...........[OK]
 99Add default gw 10.85.4.1..........................[OK]
100Create br-int-vmx1................................[OK]
101Start br-int-vmx1.................................[OK]
102Check and start default bridge....................[OK]
103Define vcp-vmx1...................................[OK]
104Define vfp-vmx1...................................[OK]
105Wait 2 secs.......................................[OK]
106Start vcp-vmx1....................................[OK]
107Start vfp-vmx1....................................[OK]
108Wait 2 secs.......................................[OK]
109Perform CPU pinning...............................[OK]
110==================================================
111    VMX Bringup Completed
112==================================================
113Check if br-ext is created........................[Created]
114Check if br-int-vmx1 is created...................[Created]
115Check if VM vcp-vmx1 is running...................[Running]
116Check if VM vfp-vmx1 is running...................[Running]
117Check if tap interface vcp_ext-vmx1 exists........[OK]
118Check if tap interface vcp_int-vmx1 exists........[OK]
119Check if tap interface vfp_ext-vmx1 exists........[OK]
120Check if tap interface vfp_int-vmx1 exists........[OK]
121==================================================
122    VMX Status Verification Completed.
123==================================================
124Log file........................................../dev/null
125==================================================
126    Thankyou for using VMX
127==================================================

2.5. quick verification

After the installation we need to know if the installed VMX is working well.

  1. list running VMs: vRE and vFPC

    ping@trinity:~$ sudo virsh list
    [sudo] password for ping:
     Id    Name                           State
    ----------------------------------------------------
     2     vcp-vmx1                       running
     3     vfp-vmx1                       running
  2. login to vRE

     1ping@trinity:~$ telnet localhost 8816
     2Trying ::1...
     3Trying 127.0.0.1...
     4Connected to localhost.
     5Escape character is '^]'.
     6
     7Amnesiac (ttyd0)
     8
     9login: root
    10root@% cli
    11root>
  3. verify if virtual PFE is "online"

    1labroot> show chassis fpc
    2                     Temp  CPU Utilization (%)   CPU Utilization (%)  Memory    Utilization (%)
    3Slot State            (C)  Total  Interrupt      1min   5min   15min  DRAM (MB) Heap     Buffer
    4  0  Online           Testing   3         0        3      3      2      1         6          0
    5
    6labroot> show chassis fpc pic-status
    7Slot 0   Online       Virtual FPC
    8  PIC 0  Online       Virtual

    This may take a couple of minutes.

  4. login to vPFE

    root@trinity:~# telnet localhost 8817
    Trying ::1...
    Trying 127.0.0.1...
    Connected to localhost.
    Escape character is '^]'.
    
    
    Wind River Linux 6.0.0.12 vfp-vmx1 console
    
    vfp-vmx1 login: pfe
    Password:
    pfe@vfp-vmx1:~$ cat /etc/issue.net
    Wind River Linux 6.0.0.12 %h

there are two built-in account to login vfp yocto VM:

  • pfe with password pfe

  • root account is root with password root.

Later we’ll demonstrate ping and OSPF neighborship test. A more intensive verification requires to enable more features and protocols (OSPF/BGP/multicast/etc), which is beyond the scope of this doc.

2.6. uninstall (cleanup) VMX with installation script

To uninstall VMX just run the same script with --cleanup option:

ping@trinity:/virtualization/images/vmx_20151102.0$ sudo ./vmx.sh --cleanup
==================================================
    Welcome to VMX
==================================================
Date..............................................11/24/15 21:25:41
VMX Identifier....................................vmx1
Config file......................................./virtualization/images/vmx_20151102.0/config/vmx.conf
Build Directory.................................../virtualization/images/vmx_20151102.0/build/vmx1
Environment file................................../virtualization/images/vmx_20151102.0/env/ubuntu_sriov.env
Junos Device Type.................................sriov
Initialize scripts................................[OK]
==================================================
    VMX Environment Setup Completed
==================================================
==================================================
    VMX Stop & Cleanup
==================================================
Check if vMX is running...........................[Yes]
Check for VM vcp-vmx1.............................[Running]
Shutdown vcp-vmx1.................................[OK]
Check for VM vfp-vmx1.............................[Running]
Shutdown vfp-vmx1.................................[OK]
Cleanup VM states.................................[OK]
Check if bridge br-ext exists.....................[Yes]
Get Configured Management Interface...............em1
Find existing management gateway..................br-ext
Mgmt interface needs reconfiguration..............[Yes]
Gateway interface needs change....................[Yes]
Check if br-ext has valid IP address..............[Yes]
Get Management Address............................10.85.4.17
Get Management Mask...............................255.255.255.128
Get Management Gateway............................10.85.4.1
Del em1 from br-ext...............................[OK]
Configure em1.....................................[Yes]
Cleanup VM bridge br-ext..........................[OK]
Cleanup VM bridge br-int-vmx1.....................[OK]
Cleanup IXGBE drivers.............................[OK]
==================================================
    VMX Stop Completed
==================================================
Cleanup auto-generated files......................[OK]
==================================================
    VMX Cleanup Completed
==================================================
Log file........................................../dev/null
==================================================
    Thankyou for using VMX
==================================================

If a different vmx config file was specified when installing VMX, specify the same as parameter of the script to clean up VMX.

ping@trinity:/virtualization/images/vmx_20151102.0$ sudo ./vmx.sh --install --cfg config/vmx.conf.sriov.1
ping@trinity:/virtualization/images/vmx_20151102.0$ sudo ./vmx.sh --cleanup --cfg config/vmx.conf.sriov.1

3. install VMX using installation script (VIRTIO)

3.1. modify the vmx.conf (virtio)

Most configuration parameters in vmx.conf config file to setup VIRTIO version of VMX are the same as the one used to setup SRIOV VMX as shown above, a few differences are:

  • for virtio virtualization, only MAC address can be defined in this config file

  • MTU can to be configured in a seperate junosdev.conf config file

  • virtio technology generates seperated virtual NICs that are used to build VMX; all L1 properties remains in the physical NIC. therefore no need to specify L1 properties like physical NIC, VF, link speed, etc for VIRTIO setup.

  • To communicate with external networks, virtio virtual NIC can be "bound" to a physical NIC port, or another virtual NIC, using any existing technologies like linux bridge, OVS, etc.

 1ping@trinity:/virtualization/images/vmx_20151102.0/config$ cat vmx.conf.virtio.1
 2##############################################################
 3#
 4#  vmx.conf
 5#  Config file for vmx on the hypervisor.
 6#  Uses YAML syntax.
 7#  Leave a space after ":" to specify the parameter value.
 8#
 9##############################################################
10
11---
12#Configuration on the host side - management interface, VM images etc.
13HOST:
14    identifier                : vmx1   # Maximum 4 characters
15    host-management-interface : em1
16    routing-engine-image      : "/virtualization/images/vmx_20151102.0/images/jinstall64-vmx-15.1F-20151104.0-domestic.img"
17    routing-engine-hdd        : "/virtualization/images/vmx_20151102.0/vmx1/vmxhdd.img"
18    forwarding-engine-image   : "/virtualization/images/vmx_20151102.0/images/vFPC-20151102.img"
19
20---
21#External bridge configuration
22BRIDGES:
23    - type  : external
24      name  : br-ext                  # Max 10 characters
25
26---
27#vRE VM parameters
28CONTROL_PLANE:
29    vcpus       : 1
30    memory-mb   : 2048
31    console_port: 8816
32
33    interfaces  :
34      - type      : static
35        ipaddr    : 10.85.4.105
36        macaddr   : "02:04:17:01:01:01"
37        #macaddr   : "0A:00:DD:C0:DE:0E"
38---
39#vPFE VM parameters
40FORWARDING_PLANE:
41    memory-mb   : 4096
42    vcpus       : 3
43    console_port: 8817
44    device-type : virtio
45
46    interfaces  :
47      - type      : static
48        ipaddr    : 10.85.4.106
49        macaddr   : "02:04:17:01:01:02"
50        #macaddr   : "0A:00:DD:C0:DE:10"
51
52---
53#Interfaces
54JUNOS_DEVICES:
55    - interface            : ge-0/0/0
56      mac-address          : "02:04:17:01:02:01"
57      description          : "ge-0/0/0 connects to eth6"
58
59    - interface            : ge-0/0/1
60      mac-address          : "02:04:17:01:02:02"
61      description          : "ge-0/0/1 connects to eth7"

3.2. run the installation script: virtio

After modification of vmx config file, we can run the same vmx.sh script to setup the VMX. This time we use another option: --cfg, to tell the script where the configuration file can be found. This is necessary if we defined a seperate config file other than the default config/vmx.conf.

  1ping@trinity:/virtualization/images/vmx_20151102.0$ sudo ./vmx.sh --install --cfg config/vmx.conf.virtio.1
  2==================================================
  3    Welcome to VMX
  4==================================================
  5Date..............................................11/30/15 21:40:16
  6VMX Identifier....................................vmx1
  7Config file.......................................
  8    /virtualization/images/vmx_20151102.0/config/vmx.conf.virtio.1
  9Build Directory.................................../virtualization/images/vmx_20151102.0/build/vmx1
 10Environment file................................../virtualization/images/vmx_20151102.0/env/ubuntu_virtio.env
 11Junos Device Type.................................virtio
 12Initialize scripts................................[OK]
 13Copy images to build directory....................[OK]
 14==================================================
 15    VMX Environment Setup Completed
 16==================================================
 17==================================================
 18    VMX Install & Start
 19==================================================
 20Linux distribution................................ubuntu
 21Check GRUB........................................[Disabled]
 22Installation status of qemu-kvm...................[OK]
 23Installation status of libvirt-bin................[OK]
 24Installation status of bridge-utils...............[OK]
 25Installation status of python.....................[OK]
 26Installation status of libyaml-dev................[OK]
 27Installation status of python-yaml................[OK]
 28Installation status of numactl....................[OK]
 29Installation status of libnuma-dev................[OK]
 30Installation status of libparted0-dev.............[OK]
 31Installation status of libpciaccess-dev...........[OK]
 32Installation status of libyajl-dev................[OK]
 33Installation status of libxml2-dev................[OK]
 34Installation status of libglib2.0-dev.............[OK]
 35Installation status of libnl-dev..................[OK]
 36Check Kernel Version..............................[Disabled]
 37Check Qemu Version................................[Disabled]
 38Check libvirt Version.............................[Disabled]
 39Check virsh connectivity..........................[OK]
 40IXGBE Enabled.....................................[Disabled]
 41==================================================
 42    Pre-Install Checks Completed
 43==================================================
 44Check for VM vcp-vmx1.............................[Not Running]
 45Check for VM vfp-vmx1.............................[Not Running]
 46Cleanup VM states.................................[OK]
 47Check if bridge br-ext exists.....................[No]
 48Cleanup VM bridge br-ext..........................[OK]
 49Cleanup VM bridge br-int-vmx1.....................[OK]
 50==================================================
 51    VMX Stop Completed
 52==================================================
 53Check VCP image...................................[OK]
 54Check VFP image...................................[OK]
 55VMX Model.........................................FPC
 56Check VCP Config image............................[OK]
 57Check management interface........................[OK]
 58Setup huge pages to 32768.........................[OK]
 59Attempt to kill libvirt...........................[OK]
 60Attempt to start libvirt..........................[OK]
 61Sleep 2 secs......................................[OK]
 62Check libvirt support for hugepages...............[OK]
 63==================================================
 64    System Setup Completed
 65==================================================
 66Get Management Address of em1.....................[OK]
 67Generate libvirt files............................[OK]
 68Sleep 2 secs......................................[OK]
 69Find configured management interface..............em1
 70Find existing management gateway..................em1
 71Check if em1 is already enslaved to br-ext........[No]
 72Gateway interface needs change....................[Yes]
 73Create br-ext.....................................[OK]
 74Get Management Gateway............................10.85.4.1
 75Flush em1.........................................[OK]
 76Start br-ext......................................[OK]
 77Bind em1 to br-ext................................[OK]
 78Get Management MAC................................38:ea:a7:37:7c:54
 79Assign Management MAC 38:ea:a7:37:7c:54...........[OK]
 80Add default gw 10.85.4.1..........................[OK]
 81Create br-int-vmx1................................[OK]
 82Start br-int-vmx1.................................[OK]
 83Check and start default bridge....................[OK]
 84Define vcp-vmx1...................................[OK]
 85Define vfp-vmx1...................................[OK]
 86Wait 2 secs.......................................[OK]
 87Start vcp-vmx1....................................[OK]
 88Start vfp-vmx1....................................[OK]
 89Wait 2 secs.......................................[OK]
 90==================================================
 91    VMX Bringup Completed
 92==================================================
 93Check if br-ext is created........................[Created]
 94Check if br-int-vmx1 is created...................[Created]
 95Check if VM vcp-vmx1 is running...................[Running]
 96Check if VM vfp-vmx1 is running...................[Running]
 97Check if tap interface vcp_ext-vmx1 exists........[OK]
 98Check if tap interface vcp_int-vmx1 exists........[OK]
 99Check if tap interface vfp_ext-vmx1 exists........[OK]
100Check if tap interface vfp_int-vmx1 exists........[OK]
101==================================================
102    VMX Status Verification Completed.
103==================================================
104Log file........................................../dev/null
105==================================================
106    Thankyou for using VMX
107==================================================

3.3. uninstall (cleanup) VMX with installation script

To uninstall VMX just run the same script with and --cleanup option. again we use --cfg option to specify the vmx.conf file which we used to set up the VMX :

 1ping@trinity:/virtualization/images/vmx_20151102.0$ sudo ./vmx.sh --cleanup --cfg config/vmx.conf.virtio.1
 2==================================================
 3    Welcome to VMX
 4==================================================
 5Date..............................................11/30/15 21:14:08
 6VMX Identifier....................................vmx1
 7Config file.......................................
 8    /virtualization/images/vmx_20151102.0/config/vmx.conf.virtio.1
 9Build Directory.................................../virtualization/images/vmx_20151102.0/build/vmx1
10Environment file................................../virtualization/images/vmx_20151102.0/env/ubuntu_virtio.env
11Junos Device Type.................................virtio
12Initialize scripts................................[OK]
13==================================================
14    VMX Environment Setup Completed
15==================================================
16==================================================
17    VMX Stop & Cleanup
18==================================================
19Check if vMX is running...........................[Yes]
20Check for VM vcp-vmx1.............................[Running]
21Shutdown vcp-vmx1.................................[OK]
22Check for VM vfp-vmx1.............................[Running]
23Shutdown vfp-vmx1.................................[OK]
24Cleanup VM states.................................[OK]
25Check if bridge br-ext exists.....................[Yes]
26Get Configured Management Interface...............em1
27Find existing management gateway..................br-ext
28Mgmt interface needs reconfiguration..............[Yes]
29Gateway interface needs change....................[Yes]
30Check if br-ext has valid IP address..............[Yes]
31Get Management Address............................10.85.4.17
32Get Management Mask...............................255.255.255.128
33Get Management Gateway............................10.85.4.1
34Del em1 from br-ext...............................[OK]
35Configure em1.....................................[Yes]
36Cleanup VM bridge br-ext..........................[OK]
37Cleanup VM bridge br-int-vmx1.....................[OK]
38==================================================
39    VMX Stop Completed
40==================================================
41Cleanup auto-generated files......................[OK]
42==================================================
43    VMX Cleanup Completed
44==================================================
45Log file........................................../dev/null
46==================================================
47    Thankyou for using VMX
48==================================================

4. setup multiple VMX instances

Building multiple VMX instances in a server is no much different than building a one instance VMX - Just running the installation script multiple time with a different config file each time will be almost sufficient.

Depending on the different release, The installation script may or may not has issue to build multiple instances in one server. There are always works in progress to improve the script (based on the feedbacks). In case it doesn’t work well (e.g bailing out in the middle), you can always install VMX manually In this example we use virtio to demonstrate the process, SRIOV will be exactly the same process and the only difference is that more physical NIC or VFs are needed.

In this section we demonstrate how to set up 2 VIRTIO-based VMX instances, setting up multiple SRIOV-based VMX instances will be a same process.

4.1. config file for the first VMX instance

 1ping@trinity:/virtualization/images/vmx_20151102.0$ cat config/vmx.conf.virtio.1
 2##############################################################
 3#
 4#  vmx.conf
 5#  Config file for vmx on the hypervisor.
 6#  Uses YAML syntax.
 7#  Leave a space after ":" to specify the parameter value.
 8#
 9##############################################################
10
11---
12#Configuration on the host side - management interface, VM images etc.
13HOST:
14    identifier                : vmx1   # Maximum 4 characters
15    host-management-interface : em1
16    routing-engine-image      : "/virtualization/images/vmx_20151102.0/images/jinstall64-vmx-15.1F-20151104.0-domestic.img"
17    routing-engine-hdd        : "/virtualization/images/vmx_20151102.0/images/vmxhdd.img"
18    forwarding-engine-image   : "/virtualization/images/vmx_20151102.0/images/vFPC-20151102.img"
19
20---
21#External bridge configuration
22BRIDGES:
23    - type  : external
24      name  : br-ext                  # Max 10 characters
25
26---
27#vRE VM parameters
28CONTROL_PLANE:
29    vcpus       : 1
30    memory-mb   : 2048
31    console_port: 8816
32
33    interfaces  :
34      - type      : static
35        ipaddr    : 10.85.4.105
36        macaddr   : "02:04:17:01:01:01"
37        #macaddr   : "0A:00:DD:C0:DE:0E"
38---
39#vPFE VM parameters
40FORWARDING_PLANE:
41    memory-mb   : 4096
42    vcpus       : 3
43    console_port: 8817
44    device-type : virtio
45
46    interfaces  :
47      - type      : static
48        ipaddr    : 10.85.4.106
49        macaddr   : "02:04:17:01:01:02"
50        #macaddr   : "0A:00:DD:C0:DE:10"
51
52---
53#Interfaces
54JUNOS_DEVICES:
55    - interface            : ge-0/0/0
56      mac-address          : "02:04:17:01:02:01"
57      description          : "ge-0/0/0 connects to eth6"
58
59    - interface            : ge-0/0/1
60      mac-address          : "02:04:17:01:02:02"
61      description          : "ge-0/0/1 connects to eth7"

4.2. config file for the second VMX instance

Following parameters need to be modified before running the script again - this is to avoid confliction with the first instance.

  • HOST section: identifier, the script will create a folder based on this string, VM "domain name" will also be based on this value.

  • console_port in both CONTROL_PLANE and FORWARDING_PLANE section

  • macaddr in both CONTROL_PLANE and FORWARDING_PLANE section: these are for VM mgmt port

  • JUNOS_DEVICES section: settings for the Junos ge-0/0/x interface [1]:

    • for VIRTIO-based VMX:

      1. interface

      2. mac-address:

    • for SRIOV-based VMX, these extra settings are also needed to be configured:

      1. physical NIC name: nic

      2. virtual-function

      3. port-speed-mbps

      4. mtu

 1ping@trinity:/virtualization/images/vmx_20151102.0$ cat config/vmx.conf.virtio.2
 2##############################################################
 3#
 4#  vmx.conf
 5#  Config file for vmx on the hypervisor.
 6#  Uses YAML syntax.
 7#  Leave a space after ":" to specify the parameter value.
 8#
 9##############################################################
10
11---
12#Configuration on the host side - management interface, VM images etc.
13HOST:
14    identifier                : vmx2   # Maximum 4 characters
15    host-management-interface : em1
16    routing-engine-image      : "/virtualization/images/vmx_20151102.0/images/jinstall64-vmx-15.1F-20151104.0-domestic.img"
17    routing-engine-hdd        : "/virtualization/images/vmx_20151102.0/images/vmxhdd.img"
18    forwarding-engine-image   : "/virtualization/images/vmx_20151102.0/images/vFPC-20151102.img"
19
20---
21#External bridge configuration
22BRIDGES:
23    - type  : external
24      name  : br-ext                  # Max 10 characters
25
26---
27#vRE VM parameters
28CONTROL_PLANE:
29    vcpus       : 1
30    memory-mb   : 2048
31    console_port: 8826
32
33    interfaces  :
34      - type      : static
35        ipaddr    : 10.85.4.107
36        macaddr   : "02:04:17:02:01:01"
37        #macaddr   : "0A:00:DD:C0:DE:0E"
38---
39#vPFE VM parameters
40FORWARDING_PLANE:
41    memory-mb   : 4096
42    vcpus       : 3
43    console_port: 8827
44    device-type : virtio
45
46    interfaces  :
47      - type      : static
48        ipaddr    : 10.85.4.108
49        macaddr   : "02:04:17:02:01:02"
50        #macaddr   : "0A:00:DD:C0:DE:10"
51
52---
53#Interfaces
54JUNOS_DEVICES:
55    - interface            : ge-0/0/0
56      mac-address          : "02:04:17:02:02:01"
57      description          : "ge-0/0/0 connects to eth6"
58
59    - interface            : ge-0/0/1
60      mac-address          : "02:04:17:02:02:02"
61      description          : "ge-0/0/1 connects to eth7"

4.3. run installation script with --cfg option

After the two config files are ready, run vmx.sh to bring them up

./vmx.sh -lvf --install --cfg config/vmx.conf.virtio.1
./vmx.sh -lvf --install --cfg config/vmx.conf.virtio.2

4.4. verify the 2 running VMX

list the 2 running VMX instances:

vmx1 and vmx2, each contains a vcp and a vfp VM.

ping@trinity:/virtualization/images/vmx_20151102.0$ sudo virsh list
 Id    Name                           State
----------------------------------------------------
 2     vcp-vmx1                       running
 3     vfp-vmx1                       running
 5     vcp-vmx2                       running
 6     vfp-vmx2                       running

4.4.1. ifconfig (virtio)

Here is the dump of all interfaces after two VMXs are up and running:

ping@trinity:/virtualization/images/vmx_20151102.0$ ifconfig -a
br-ext    Link encap:Ethernet  HWaddr 38:ea:a7:37:7c:54               \
          inet addr:10.85.4.17  Bcast:10.85.4.127  Mask:255.255.255.128\
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1            |
          RX packets:26432 errors:0 dropped:0 overruns:0 frame:0        |
          TX packets:10158 errors:0 dropped:0 overruns:0 carrier:0      |
          collisions:0 txqueuelen:0                                     |
          RX bytes:2025742 (2.0 MB)  TX bytes:1184591 (1.1 MB)          |
                                                                        |
br-ext-nic Link encap:Ethernet  HWaddr 52:54:00:9f:a0:77                |
          BROADCAST MULTICAST  MTU:1500  Metric:1                       |
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0            |
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0          |
          collisions:0 txqueuelen:500                                   |
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)                        |
                                                                        |(1)
vcp_ext-vmx1 Link encap:Ethernet  HWaddr fe:04:17:01:01:01              |
          inet6 addr: fe80::fc04:17ff:fe01:101/64 Scope:Link            |
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1            |
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0            |
          TX packets:19408 errors:0 dropped:0 overruns:0 carrier:0      |
          collisions:0 txqueuelen:500                                   |
          RX bytes:0 (0.0 B)  TX bytes:1826586 (1.8 MB)                 |
                                                                        |
vcp_ext-vmx2 Link encap:Ethernet  HWaddr fe:04:17:02:01:01              |
          inet6 addr: fe80::fc04:17ff:fe02:101/64 Scope:Link            |
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1            |
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0            |
          TX packets:19248 errors:0 dropped:0 overruns:0 carrier:0      |
          collisions:0 txqueuelen:500                                   |
          RX bytes:0 (0.0 B)  TX bytes:1812426 (1.8 MB)                 |
                                                                        |
vfp_ext-vmx1 Link encap:Ethernet  HWaddr fe:04:17:01:01:02              |
          inet6 addr: fe80::fc04:17ff:fe01:102/64 Scope:Link            |
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1            |
          RX packets:18 errors:0 dropped:0 overruns:0 frame:0           |
          TX packets:19409 errors:0 dropped:0 overruns:0 carrier:0      |
          collisions:0 txqueuelen:500                                   |
          RX bytes:2840 (2.8 KB)  TX bytes:1827538 (1.8 MB)             |
                                                                        |
vfp_ext-vmx2 Link encap:Ethernet  HWaddr fe:04:17:02:01:02              |
          inet6 addr: fe80::fc04:17ff:fe02:102/64 Scope:Link            |
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1            |
          RX packets:118 errors:0 dropped:0 overruns:0 frame:0          |
          TX packets:19129 errors:0 dropped:0 overruns:0 carrier:0      |
          collisions:0 txqueuelen:500                                  /
          RX bytes:38268 (38.2 KB)  TX bytes:1774106 (1.7 MB)         /

br-int-vmx1 Link encap:Ethernet  HWaddr 52:54:00:56:2f:2b             \
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1           \
          RX packets:12823 errors:0 dropped:12587 overruns:0 frame:0    |
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0          |
          collisions:0 txqueuelen:0                                     |
          RX bytes:949948 (949.9 KB)  TX bytes:0 (0.0 B)                |
                                                                        |
br-int-vmx1-nic Link encap:Ethernet  HWaddr 52:54:00:56:2f:2b           |
          BROADCAST MULTICAST  MTU:1500  Metric:1                       |
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0            |
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0          |
          collisions:0 txqueuelen:500                                   |
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)                        |
                                                                        | (2)
vcp_int-vmx1 Link encap:Ethernet  HWaddr fe:54:00:d6:6e:85              |
          inet6 addr: fe80::fc54:ff:fed6:6e85/64 Scope:Link             |
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1            |
          RX packets:109078 errors:0 dropped:0 overruns:0 frame:0       |
          TX packets:112829 errors:0 dropped:0 overruns:0 carrier:0     |
          collisions:0 txqueuelen:500                                   |
          RX bytes:6930175 (6.9 MB)  TX bytes:10139042 (10.1 MB)        |
                                                                        |
vfp_int-vmx1 Link encap:Ethernet  HWaddr fe:54:00:f5:4e:ee              |
          inet6 addr: fe80::fc54:ff:fef5:4eee/64 Scope:Link             |
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1            |
          RX packets:108951 errors:0 dropped:0 overruns:0 frame:0       |
          TX packets:112338 errors:0 dropped:0 overruns:0 carrier:0     |
          collisions:0 txqueuelen:500                                  /
          RX bytes:9928502 (9.9 MB)  TX bytes:7099927 (7.0 MB)        /

br-int-vmx2 Link encap:Ethernet  HWaddr 52:54:00:14:94:e5             \
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1           \
          RX packets:12638 errors:0 dropped:12463 overruns:0 frame:0    |
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0          |
          collisions:0 txqueuelen:0                                     |
          RX bytes:938940 (938.9 KB)  TX bytes:0 (0.0 B)                |
                                                                        |
br-int-vmx2-nic Link encap:Ethernet  HWaddr 52:54:00:14:94:e5           |
          BROADCAST MULTICAST  MTU:1500  Metric:1                       |
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0            |
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0          |
          collisions:0 txqueuelen:500                                   |
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)                        |(3)
                                                                        |
vcp_int-vmx2 Link encap:Ethernet  HWaddr fe:54:00:46:01:02              |
          inet6 addr: fe80::fc54:ff:fe46:102/64 Scope:Link              |
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1            |
          RX packets:107951 errors:0 dropped:0 overruns:0 frame:0       |
          TX packets:111724 errors:0 dropped:0 overruns:0 carrier:0     |
          collisions:0 txqueuelen:500                                   |
          RX bytes:6866493 (6.8 MB)  TX bytes:10053803 (10.0 MB)        |
                                                                        |
vfp_int-vmx2 Link encap:Ethernet  HWaddr fe:54:00:b9:67:1d              |
          inet6 addr: fe80::fc54:ff:feb9:671d/64 Scope:Link             |
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1            |
          RX packets:107878 errors:0 dropped:0 overruns:0 frame:0       |
          TX packets:111185 errors:0 dropped:0 overruns:0 carrier:0     |
          collisions:0 txqueuelen:500                                  /
          RX bytes:9845011 (9.8 MB)  TX bytes:7034893 (7.0 MB)        /

em1       Link encap:Ethernet  HWaddr 38:ea:a7:37:7c:54              \
          inet6 addr: fe80::3aea:a7ff:fe37:7c54/64 Scope:Link         \
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1           |
          RX packets:1642746 errors:0 dropped:606 overruns:0 frame:0   |
          TX packets:15147394 errors:0 dropped:0 overruns:0 carrier:0  |
          collisions:0 txqueuelen:1000                                 |
          RX bytes:139306967 (139.3 MB)  TX bytes:21847662292 (21.8 GB)|
                                                                       |
em2       Link encap:Ethernet  HWaddr 38:ea:a7:37:7c:55                |
          BROADCAST MULTICAST  MTU:1500  Metric:1                      |
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0           |
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0         |
          collisions:0 txqueuelen:1000                                 |
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)                       |
                                                                       |
em9       Link encap:Ethernet  HWaddr 38:ea:a7:37:7b:d0                |
          BROADCAST MULTICAST  MTU:1500  Metric:1                      |
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0           |
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0         |
          collisions:0 txqueuelen:1000                                 |
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)                       |
                                                                       |
em10      Link encap:Ethernet  HWaddr 38:ea:a7:37:7b:d1                |
          BROADCAST MULTICAST  MTU:1500  Metric:1                      | (4)
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0           |
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0         |
          collisions:0 txqueuelen:1000                                 |
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)                       |
                                                                       |
p2p1      Link encap:Ethernet  HWaddr 38:ea:a7:17:65:a0                |
          inet6 addr: fe80::3aea:a7ff:fe17:65a0/64 Scope:Link          |
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1           |
          RX packets:5 errors:0 dropped:0 overruns:0 frame:0           |
          TX packets:8 errors:0 dropped:0 overruns:0 carrier:0         |
          collisions:0 txqueuelen:1000                                 |
          RX bytes:300 (300.0 B)  TX bytes:648 (648.0 B)               |
                                                                       |
p2p2      Link encap:Ethernet  HWaddr 38:ea:a7:17:65:a1                |
          BROADCAST MULTICAST  MTU:1500  Metric:1                      |
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0           |
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0         |
          collisions:0 txqueuelen:1000                                 |
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)                       |
                                                                       |
p3p1      Link encap:Ethernet  HWaddr 38:ea:a7:17:65:84                |
          BROADCAST MULTICAST  MTU:1500  Metric:1                      |
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0           |
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0         |
          collisions:0 txqueuelen:1000                                 |
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)                       |
                                                                       |
p3p2      Link encap:Ethernet  HWaddr 38:ea:a7:17:65:85                |
          BROADCAST MULTICAST  MTU:1500  Metric:1                      |
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0           |
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0         |
          collisions:0 txqueuelen:1000                                /
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)                     /

ge-0.0.0-vmx1 Link encap:Ethernet  HWaddr fe:04:17:01:02:01    \
          inet6 addr: fe80::fc04:17ff:fe01:201/64 Scope:Link    \
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1     |
          RX packets:4 errors:0 dropped:0 overruns:0 frame:0     |
          TX packets:3263 errors:0 dropped:0 overruns:0 carrier:0|
          collisions:0 txqueuelen:500                            |
          RX bytes:280 (280.0 B)  TX bytes:169990 (169.9 KB)     |
                                                                 |
ge-0.0.0-vmx2 Link encap:Ethernet  HWaddr fe:04:17:02:02:01      |
          inet6 addr: fe80::fc04:17ff:fe02:201/64 Scope:Link     |
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1     |
          RX packets:3 errors:0 dropped:0 overruns:0 frame:0     |
          TX packets:3238 errors:0 dropped:0 overruns:0 carrier:0\
          collisions:0 txqueuelen:500                             X (5)
          RX bytes:238 (238.0 B)  TX bytes:168680 (168.6 KB)     /
                                                                 |
ge-0.0.1-vmx1 Link encap:Ethernet  HWaddr fe:04:17:01:02:02      |
          inet6 addr: fe80::fc04:17ff:fe01:202/64 Scope:Link     |
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1     |
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0     |
          TX packets:3262 errors:0 dropped:0 overruns:0 carrier:0|
          collisions:0 txqueuelen:500                            |
          RX bytes:0 (0.0 B)  TX bytes:169836 (169.8 KB)         |
                                                                 |
ge-0.0.1-vmx2 Link encap:Ethernet  HWaddr fe:04:17:02:02:02      |
          inet6 addr: fe80::fc04:17ff:fe02:202/64 Scope:Link     |
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1     |
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0     |
          TX packets:3236 errors:0 dropped:0 overruns:0 carrier:0|
          collisions:0 txqueuelen:500                            |
          RX bytes:0 (0.0 B)  TX bytes:168484 (168.4 KB)        /

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:424272 errors:0 dropped:0 overruns:0 frame:0
          TX packets:424272 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:22393676 (22.3 MB)  TX bytes:22393676 (22.3 MB)

virbr0    Link encap:Ethernet  HWaddr fe:04:17:01:02:01
          inet addr:192.168.122.1  Bcast:192.168.122.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:30 errors:0 dropped:0 overruns:0 frame:0
          TX packets:10 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:2751 (2.7 KB)  TX bytes:812 (812.0 B)
1 external bridge
2 internal bridge for vmx1
3 internal bridge for vmx2
4 physical interfaces
5 virtio tap interfaces, peer interfaces [1] of JUNOS ge-0/0/x interfaces in VFP VM

4.4.2. virtio bridging: linux bridge

virtio does not define anything special for interface bridging. Any open techniques (linux brige, OVS, etc) can be used for this purpose.

A separate config file vmx-junosdev.conf is used to automate virtio interface bridging using native linux bridge utility.

in this example, I’ll define a new bridge named bridge1, and put following interfaces into that same bridge:

  • ge-0/0/0 in vmx1

  • ge-0/0/0 in vmx2

  • p2p1

The vmx-junosdev.conf file:

 1ping@trinity:/virtualization/images/vmx_20151102.0$ cat config/vmx-junosdev.conf.1
 2##############################################################
 3#
 4#  vmx-junos-dev.conf
 5#  - Config file for junos device bindings.
 6#  - Uses YAML syntax.
 7#  - Leave a space after ":" to specify the parameter value.
 8#  - For physical NIC, set the 'type' as 'host_dev'
 9#  - For junos devices, set the 'type' as 'junos_dev' and
10#    set the mandatory parameter 'vm-name' to the name of
11#    the vPFE where the device exists
12#  - For bridge devices, set the 'type' as 'bridge_dev'
13#
14##############################################################
15interfaces :
16
17     - link_name  : vmx_bridge1      \
18       mtu        : 1500              |
19       endpoint_1 :                   |
20         - type        : junos_dev    \
21           vm_name     : vmx1          X  (1)
22           dev_name    : ge-0/0/0     /
23       endpoint_2 :                   |
24         - type        : bridge_dev   |
25           dev_name    : bridge1     /
26
27     - link_name  : vmx_bridge1      \
28       mtu        : 1500              |
29       endpoint_1 :                   |
30         - type        : junos_dev    \
31           vm_name     : vmx2          X  (2)
32           dev_name    : ge-0/0/0     /
33       endpoint_2 :                   |
34         - type        : bridge_dev   |
35           dev_name    : bridge1     /
36
37     - link_name  : vmx_bridge1      \
38       mtu        : 1500              |
39       endpoint_1 :                   |
40         - type        : host_dev     \
41           dev_name    : p2p1          X  (3)
42       endpoint_2 :                   /
43         - type        : bridge_dev   |
44           dev_name    : bridge1     /
1 put junos interface (type junos_dev) ge-0/0/0 of first VMX instance vmx1 ,into linux bridge bridge1
2 put junos interface (type junos_dev) ge-0/0/0 of second VMX instance vmx2 ,into same linux bridge bridge1
3 put host server interface (type host_dev) p2p1 into same linux bridge bridge1
bridging before binding:

before binding interfaces to bridge1, all four virtio tap interfaces were under virbr0 interface, which represents the default libvirt network.

ping@trinity:/virtualization/images/vmx_20151102.0$ brctl show

bridge name     bridge id               STP enabled     interfaces

br-ext          8000.38eaa7377c54       yes             br-ext-nic
                                                        em1
                                                        vcp_ext-vmx1
                                                        vcp_ext-vmx2
                                                        vfp_ext-vmx1
                                                        vfp_ext-vmx2

br-int-vmx1             8000.525400562f2b       yes     br-int-vmx1-nic
                                                        vcp_int-vmx1
                                                        vfp_int-vmx1

br-int-vmx2             8000.5254001494e5       yes     br-int-vmx2-nic
                                                        vcp_int-vmx2
                                                        vfp_int-vmx2

virbr0          8000.fe0417010201       yes             ge-0.0.0-vmx1
                                                        ge-0.0.0-vmx2
                                                        ge-0.0.1-vmx1
                                                        ge-0.0.1-vmx2
the bridges diagram

The current bridging structure in this server is illustrated below:

 vcp-vmx1                   br-int-vmx1           br-int-vmx2                   vcp-vmx2
+--------+                         |                 |                          +--------+
|        |                         |                 |                          |        |
|     em1+-------------------------+                 +--------------------------+em1     |
|        |             vcp_int_vmx1|                 |vcp_int_vmx2              |        |
| fxp0   |                         |                 |                          | fxp0   |
+---+----+                         |                 |                          +---+----+
    |               vfp-vmx1       |                 |      vfp-vmx2                |
    |              +--------+      |                 |      +--------+              |
    |              |        | vfp_int-vmx1      vfp-int-vmx2|        |              |
    |              |     int+------+        +-+      +------+int     |              |
    |              |        |               | |             |        |              |
    |              |ext     +---------------+ +-------------+     ext|              |
    |              ++-----+-+ge-0.0.0-vmx1  | |ge-0.0.0-vmx2+-+-----++              |
    |               |     |                 | |               |     |               |
    |               |     +-----------------+ +---------------+     |               |
    |               |   ge-0.0.1-vmx1       | |  ge-0.0.1-vmx2      |               |
    |               |                       +-+                     |               |
    |               |                     virbr0                    |               |
    |               |                                               |               |
    |               |                                               |               |
    |               |                                               |               |
    |vcp_ext-vmx1   |vfp_ext-vmx1                       vfp_ext-vmx2|   vcp_ext-vmx2|
+---+---------------+-----------------------------------------------+---------------+----+
|                                         br-ext    br-ext-nic                           |
+-------------------------------------------------+--------------------------------------+
                                                  |
                                                  |em1

Now we run the vmx.sh script with --bind-dev option and --cfg option:

ping@trinity:/virtualization/images/vmx_20151102.0$ sudo ./vmx.sh --bind-dev --cfg config/vmx-junosdev.conf.1
Checking package ethtool..........................[OK]
Bind Bridge port bridge1(ge-0.0.0-vmx1)...........[OK]
Bind Bridge port bridge1(ge-0.0.0-vmx2)...........[OK]
Bind Bridge port bridge1(p2p1)....................[OK]
bridging after binding:

After running the vmx.sh script, the peer interface [peer interface] of JUNOS interface ge-0/0/0 from each VMX VM: ge-0.0.0-vmx1 and ge-0.0.0-vmx2, along with the physical port p2p1, now is bound to my new bridge bridge1.

ping@trinity:/virtualization/images/vmx_20151102.0$ brctl show
bridge name     bridge id               STP enabled     interfaces
br-ext          8000.38eaa7377c54       yes             br-ext-nic
                                                        em1
                                                        vcp_ext-vmx1
                                                        vcp_ext-vmx2
                                                        vfp_ext-vmx1
                                                        vfp_ext-vmx2

br-int-vmx1             8000.525400562f2b       yes     br-int-vmx1-nic
                                                        vcp_int-vmx1
                                                        vfp_int-vmx1

br-int-vmx2             8000.5254001494e5       yes     br-int-vmx2-nic
                                                        vcp_int-vmx2
                                                        vfp_int-vmx2

bridge1         8000.38eaa71765a0       no              ge-0.0.0-vmx1
                                                        ge-0.0.0-vmx2
                                                        p2p1

virbr0          8000.fe0417010202       yes             ge-0.0.1-vmx1
                                                        ge-0.0.1-vmx2
the bridges diagram

Now the briding structure is changed, as can be illustrated from below diagram, the packet coming from external device, will be bridged to the tap interface ge-0.0.0-vmx1 and ge-0.0.0-vmx2 via bridge bridge1, and then goes to the corresponding ge-0/0/0 JUNOS interface in the VMX VM.

 vcp-vmx1                   br-int-vmx1        br-int-vmx2                   vcp-vmx2
+--------+                         +-+           +-+                          +--------+
|        |                         | |           | |                          |        |
|     em1+-------------------------+ |           | +--------------------------+em1     |
|        |             vcp_int_vmx1| |           | |vcp_int_vmx2              |        |
| fxp0   |                         | |           | |                          | fxp0   |
+---+----+                         | |           | |                          +---+----+
    |               vfp-vmx1       | |           | |      vfp-vmx2                |
    |              +--------+      | |           | |      +--------+              |
    |              |        | vfp_int-vmx1    vfp-int-vmx2|        |              |
    |              |     int+------+ |           | +------+int     |              |
    |              |        |      | |    virbr0 | |      |        |              |
    |              |ext     |      +-+     +-+   +-+      |     ext|              |
    |              ++-----+-+\ge-0.0.1-vmx1| |           /+-+-+---++              |
    |               |     |   -------------+ +-----------   |     |               |
    |               |     |                | |              |     |               |
    |               |     |ge-0.0.0-vmx1   +-+ ge-0.0.0-vmx2|     |               |
    |               |    ++---------------------------------+-+   |               |
    |               |    |             bridge1                |   |               |
    |               |    +----------------+-------------------+   |               |
    |               |                     |  p2p1                 |               |
    |vcp_ext-vmx1   |vfp_ext-vmx1                     vfp_ext-vmx2|   vcp_ext-vmx2|
+---+---------------+---------------------------------------------+---------------+----+
|                                       br-ext    br-ext-nic                           |
+-----------------------------------------------+--------------------------------------+
                                                |
                                                |em1

4.4.3. routing test

A quick way to verify the 2 VMX instances and bridging structure between them is to run OSPF protocol and ping reachability test between them.

For that purpose I enabled IP/OSPF in the external layer 3 switch attached to the p2p1 interface of the server where VMX instances were built.

.............................................
. HP server                                 .
.                                           .
.                               bridge1     .
.    +--------+                 +--+        .
.    |VMX1    |                 |  |        .
.    |ge-0/0/0+-----------------+  |        .
.    |1.1.1.11|    ge-0.0.0-vmx1|  |        .
.    +--------+                 |  |        .   external router
.                               |  |        .   +-------+
.                               |  |p2p1    .   |       |
.                               |  +--------.---+1.1.1.13
.                               |  |        .   |       |
.                               |  |        .   +-------+
.    +--------+                 |  |        .
.    |VMX2    |    ge-0.0.0-vmx2|  |        .
.    |ge-0/0/0+-----------------+  |        .
.    |1.1.1.12|                 |  |        .
.    +--------+                 +--+        .
.                                           .
.                                           .
.............................................

In my setup , both ospf adjacency and ping looks good from first VMX instance vmx1. This indicates both instances are running well.

[edit]
root@vmx1# run ping 1.1.1.12
PING 1.1.1.12 (1.1.1.12): 56 data bytes
64 bytes from 1.1.1.12: icmp_seq=0 ttl=64 time=3.081 ms
^C
--- 1.1.1.12 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max/stddev = 3.081/3.081/3.081/0.000 ms

[edit]
root@vmx1# run ping 1.1.1.13
PING 1.1.1.13 (1.1.1.13): 56 data bytes
64 bytes from 1.1.1.13: icmp_seq=0 ttl=64 time=16.046 ms

root@vmx1# run show ospf neighbor
Address          Interface              State     ID               Pri  Dead
1.1.1.13         ge-0/0/0.0             Full      30.0.0.1         128    32
1.1.1.12         ge-0/0/0.0             Full      1.1.1.12         128    32

[edit]
root@vmx1# run show route

inet.0: 3 destinations, 3 routes (3 active, 0 holddown, 0 hidden)
+ = Active Route, - = Last Active, * = Both

1.1.1.0/24         *[Direct/0] 00:55:52
                    > via ge-0/0/0.0
1.1.1.11/32        *[Local/0] 00:55:52
                      Local via ge-0/0/0.0
224.0.0.5/32       *[OSPF/10] 00:02:50, metric 1

5. install VMX manually

The installation script is handy to set up VMX, because it helps to automate most of the configuration tasks, which otherwise all need to be done manually.

On the other hand, the script does nothing but to prepare the work environment for the VMX and run some commands to bring up the two VM images. Therefore, having the script does not stop us from doing all of the installation work manually. One advantage of doing so is that if for any reason the script bails out (e.g sometime the server does not meet all of the presumptions or criterias the script requires in order to run smoothly), we can still fine tune the system manually and proceed the installation task. But in order to do that, we need to know what exactly the script does so we can at least use the step as a good and tested-and-working reference when something went wrong in the middle.

Here is the steps the installation script go through:

  1. prepare the environment before start:

    • copy all images into a seperate working folder

    • check if all required software packages are installed

  2. if not done yet, re-compile the ixgbe driver kernel module from source code for SR-IOV

  3. check if there are currently running VMX VMs and bridges with the same name, if yes,destroy and undefine any VMs and internal bridges with same name

  4. remove all VFs if existed, by reloading ixgbe kernel module without max_vfs option

  5. configure "hugepage" and libvirt security and prepare for the VT-d/PCI passthough ("PCI stub")

  6. configure VF feature by reloading ixgbe kernel module with max_vfs option

  7. restart libvirt service daemon to make sure libvirt security is successfully enabled

  8. run a python script to parse the YAML config file (vmx.conf) and generate all necessary XML files which will be used later by libvirt (virsh) to bring up the VMs. The python script also generate some shell scripts (group of some commands), which can be used to tune the VM properties (vCPU pinning, VF MAC, etc) later.

  9. for SRIOV-based VMX only, run one of the generated shell script vfconfig-generated.sh, to configure the property of physical port and VFs. the script also use VT-d/PCI passthough feature to detach VFs from the host.

  10. define external / internal bridge

  11. define and bring up VMX VMs

  12. perform vCPU pinning

The running log of the installation script, especially with -lvf debug option, reveals commands and steps to prepare and setup VMX.

5.1. script generated XML files

the installation script will generate some XML and shell scripts:

  • all necessary libvirt XML files

    this will be used by the virsh tool to bring up the VMX.

    ping@trinity:/virtualization/images/vmx_20151102.0/build/vmx1/xml$ ls -l | grep xml
    -rw-r--r-- 1 root root  383 Nov 24 22:23 br-ext-generated.xml
    -rw-r--r-- 1 root root   99 Nov 24 22:23 br-int-generated.xml
    -rw-r--r-- 1 root root 2749 Nov 24 22:23 vPFE-generated.xml
    -rw-r--r-- 1 root root 3560 Nov 24 22:23 vRE-generated.xml
  • shell scripts

    these will be executed by the installation script to configure VF properties and to configure vCPU pinning

    ping@trinity:/virtualization/images/vmx_20151102.0/build/vmx1/xml$ ls -l | grep sh
    -rw-r--r-- 1 root root  226 Nov 24 22:23 cpu_affinitize.sh
    -rw-r--r-- 1 root root  712 Nov 24 22:23 vfconfig-generated.sh
prepare the XML files for libvirt/virsh

The XML files play an crucial role when bringing up VMs using libvirt/virsh tool - the libvirt tool relies on XML to store structured data. Numerous API functions of libvirt (and their implementation in virsh) take XML documents as their arguments. XML documents are passed to a XML parser that detects syntax errors and then are processed internaly.

We need to have the XML files ready for use before start to install VMX manually.

  1. create all the necessary XML files

    XML files can be generated by any of the below methods:

    • modify existing files, which could be generated by VMX installation script in other setup (recommended)

    • creat them from scratch (not recommeneded)

  2. update all parameters in the XML files

    • domain/bridge names

    • image path

    • MAC of ext mgmt (vcp_ext-vmx) interface in external bridge

    • console/serial port

    • for VFP VM: hostdev (SR-IOV VF) function# and MAC#

    • external bridge xml: external MAC and IP of mgmt i/f (vcp_ext-vmx)

The XML files were printed in appendix for reference.

5.2. manual installation steps

  1. enable iommu/VT-d (SR-IOV only)

    enabling IOMMU/VT-d allows the physical PCI devices to be able to be assigned to the VM directly.

    skip if this was done once
  2. recompile and reload ixgbe kernel driver module (SR-IOV only)

    Juniper modified IXGBE kernel driver is required for VMX. the driver needs to be re-compiled from source code that comes with the installation package.

    skip if this was done once

    to verify if the ixgbe version is right:

    ping@ubuntu1:~$ modinfo ixgbe | grep version
    version:        3.19.1
    srcversion:     B97B1E7CF79A25F5E4D7B96
    vermagic:       3.13.0-32-generic SMP mod_unload modversions
    • compile and install ixgbe driver from source code

    cd /virtualization/images/vmx_20151102.0/drivers/ixgbe-3.19.1/src
    rm -f ixgbe.ko
    make install
    sleep 5
    cmp ixgbe.ko /lib/modules/3.13.0-32-generic/kernel/drivers/net/ethernet/intel/ixgbe/ixgbe.ko
    • reload ixgbe driver

    rmmod ixgbevf
    rmmod ixgbe
    modprobe ixgbe
  3. setup hugepage

    skip if this was done once
     1sudo -i
     2echo 32768 > /proc/sys/vm/nr_overcommit_hugepages
     3exit
     4
     5mkdir /HugePage_vPFE
     6sudo mount -t hugetlbfs hugetlbfs /HugePage_vPFE
     7sudo mount | grep "HugePage_vPFE"
     8
     9sudo service libvirt-bin restart
    10
    11cat /etc/apparmor.d/abstractions/libvirt-qemu  | grep HugePage_vPFE
    12sudo -i
    13echo "owner \"/HugePage_vPFE/libvirt/qemu/**\" rw," >> /etc/apparmor.d/abstractions/libvirt-qemu
    14exit
    • VM won’t start properly without hugetlbfs mounted beforehand.

      ping@matrix:~$ ls /HugePage_vPFE/
      ping@matrix:~$ mount | grep "HugePage_vPFE"
      ping@matrix:~$
      ping@matrix:~$ sudo virsh start AVPN-vfp
      error: Failed to start domain AVPN-vfp
      error: internal error: hugetlbfs filesystem is not mounted or disabled by administrator config
    • libvirt needs to be restarted after hugetlbfs got mounted:

      ping@matrix:~$ sudo mount -t hugetlbfs hugetlbfs /HugePage_vPFE
      ping@matrix:~$ sudo virsh start AVPN-vfp
      error: Failed to start domain AVPN-vfp
      error: internal error: hugetlbfs filesystem is not mounted or disabled by administrator config
      
      ping@matrix:~$ ls /HugePage_vPFE/
      
      ping@matrix:~$ sudo service libvirt-bin restart
      libvirt-bin stop/waiting
      libvirt-bin start/running, process 46220
      
      ping@matrix:~$ ls /HugePage_vPFE/
      libvirt
      ping@matrix:~$ sudo virsh start AVPN-vfp
      Domain AVPN-vfp started
  4. prepare for pci-stub (SR-IOV only)

    skip if this was done once
    1sudo modprobe pci-stub
    2
    3sudo -i
    4echo "8086 10ed" > /sys/bus/pci/drivers/pci-stub/new_id
    5> /sys/bus/pci/drivers/pci-stub/bind
    6exit
  5. reload ixgbe kernel driver and configure SR-IOV VF (SR-IOV only)

    1sudo rmmod ixgbevf; sudo rmmod ixgbe; \
    2sudo modprobe ixgbe max_vfs=2,2,2,2,2,2,2,2; \
    3sudo modprobe tun ; \                              (1)
    4sleep 5; sudo brctl addif br-ext em9
    1 tun/tap interface: TUN (namely network TUNnel) simulates a network layer device and it operates with layer 3 packets like IP packets. TAP (namely network tap) simulates a link layer device and it operates with layer 2 packets like Ethernet frames. TUN is used with routing, while TAP is used for creating a network bridge, see wiki. In VMX "tap" interface and bridges will be used and that is why this kernel module is required. this is not directly related to SR-IOV.
    these VF configs, and also a lot of the other configuration here won’t be persistent across system reboot.
  6. configure interface and VF properties (SR-IOV only)

    follow the MAC assignment rule to assign MAC for the 2 VMX instance

    MAC address assignment
    02:04:17:01:01:01 vcp-vmx1 fxp0
    02:04:17:01:01:02 vfp-vmx1 eth0
    02:04:17:01:02:01 vmx1 ge-0/0/0 p3p1 vf0
    02:04:17:01:02:02 vmx1 ge-0/0/1 p2p1 vf0
    
    02:04:17:02:01:01 vcp-vmx2 fxp0
    02:04:17:02:01:02 vfp-vmx2 eth0
    02:04:17:02:02:01 vmx2 ge-0/0/0 p3p1 vf1
    02:04:17:02:02:02 vmx2 ge-0/0/1 p2p1 vf1
    1export XML_FOLDER="/virtualization/vmx1/xml"
    2sudo sh $XML_FOLDER/vfconfig-generated.sh
    3export XML_FOLDER="/virtualization/vmx2/xml"
    4sudo sh $XML_FOLDER/vfconfig-generated.sh

    or

     1#VMX1
     2sudo ifconfig p3p1 up promisc allmulti mtu 2000
     3sudo ifconfig p2p1 up promisc allmulti mtu 2000
     4sudo ip link set p3p1 vf 0 mac 02:04:17:01:02:01
     5sudo ip link set p2p1 vf 0 mac 02:04:17:01:02:02
     6sudo ip link set p3p1 vf 0 rate 10000
     7sudo ip link set p2p1 vf 0 rate 10000
     8sudo ip link set p3p1 vf 0 spoofchk off
     9sudo ip link set p2p1 vf 0 spoofchk off
    10
    11#VMX2
    12sudo ifconfig p3p1 up promisc allmulti mtu 2000
    13sudo ifconfig p2p1 up promisc allmulti mtu 2000
    14sudo ip link set p3p1 vf 1 mac 02:04:17:02:02:01
    15sudo ip link set p2p1 vf 1 mac 02:04:17:02:02:02
    16sudo ip link set p3p1 vf 1 rate 10000
    17sudo ip link set p2p1 vf 1 rate 10000
    18sudo ip link set p3p1 vf 1 spoofchk off
    19sudo ip link set p2p1 vf 1 spoofchk off
  7. pci-stub hide (SR-IOV only)

    unbind the VF from kernel driver (host) and bind it to pci-stub module, effectively hide this VF from host kernel and the VF will later be assigned directly to the VM.

    skip if this was done once
     1sudo -i
     2#p3p1 vf0
     3echo 0000:23:10.0 > /sys/bus/pci/devices/0000:23:10.0/driver/unbind
     4echo 0000:23:10.0 >> /sys/bus/pci/drivers/pci-stub/bind
     5#p3p1 vf1
     6echo 0000:23:10.2 > /sys/bus/pci/devices/0000:23:10.2/driver/unbind
     7echo 0000:23:10.2 >> /sys/bus/pci/drivers/pci-stub/bind
     8
     9#p2p1 vf0
    10echo 0000:06:10.0 > /sys/bus/pci/devices/0000:06:10.0/driver/unbind
    11echo 0000:06:10.0 >> /sys/bus/pci/drivers/pci-stub/bind
    12#p2p1 vf1
    13echo 0000:06:10.2 > /sys/bus/pci/devices/0000:06:10.2/driver/unbind
    14echo 0000:06:10.2 >> /sys/bus/pci/drivers/pci-stub/bind
    15exit
  8. config VN & bridges

     1#vmx1:
     2export XML_FOLDER="/virtualization/vmx1"
     3sudo virsh net-define $XML_FOLDER/br-ext-generated.xml
     4
     5sudo ifconfig em1 0; \
     6sudo virsh net-start br-ext; \
     7sudo brctl addif br-ext em1; \
     8sudo route add default gw 10.85.4.1
     9sudo ifconfig br-ext hw ether 38:ea:a7:37:7c:54;
    10
    11sudo virsh net-define $XML_FOLDER/br-int-generated.xml
    12sudo virsh net-start br-int-vmx1
    13
    14#vmx2:
    15export XML_FOLDER="/virtualization/vmx2"
    16
    17sudo virsh net-define $XML_FOLDER/br-int-generated.xml
    18sudo virsh net-start br-int-vmx2
  9. define and start VMs using the generated xml file:

     1sudo virsh net-list | grep default
     2
     3#vmx1
     4sudo virsh define /virtualization/vmx1/vRE-generated1.xml
     5sudo virsh define /virtualization/vmx1/vPFE-generated1.xml
     6sudo virsh start vcp-vmx1
     7sudo virsh start vfp-vmx1
     8
     9#vmx2
    10sudo virsh define /virtualization/vmx2/vRE-generated2.xml
    11sudo virsh define /virtualization/vmx2/vPFE-generated2.xml
    12sudo virsh start vcp-vmx2
    13sudo virsh start vfp-vmx2
  10. vcpupin:

    1export XML_FOLDER="/virtualization/vmx1"
    2sudo sh $XML_FOLDER/cpu_affinitize.sh
    3export XML_FOLDER="/virtualization/vmx2"
    4sudo sh $XML_FOLDER/cpu_affinitize.sh

    or

     1#vmx1:
     2sudo virsh emulatorpin vcp-vmx1 0
     3sudo virsh emulatorpin vfp-vmx1 0
     4
     5sudo virsh vcpupin vcp-vmx1 0 7
     6
     7sudo virsh vcpupin vfp-vmx1 0 0
     8sudo virsh vcpupin vfp-vmx1 1 1
     9sudo virsh vcpupin vfp-vmx1 2 2
    10sudo virsh vcpupin vfp-vmx1 3 3
    11
    12#vmx2:
    13sudo virsh emulatorpin vcp-vmx2 0
    14sudo virsh emulatorpin vfp-vmx2 0
    15
    16sudo virsh vcpupin vcp-vmx2 0 15
    17
    18sudo virsh vcpupin vfp-vmx2 0 8
    19sudo virsh vcpupin vfp-vmx2 1 9
    20sudo virsh vcpupin vfp-vmx2 2 10
    21sudo virsh vcpupin vfp-vmx2 3 11
multiple VMX instances

using the above files generated from the installation script, it’s easy to bring up multiple instances of VMX manually.

5.3. verify installations

show running VMX instances:
ping@trinity:/virtualization/vmx1$ sudo virsh list --all
 Id    Name                           State
----------------------------------------------------
 4     vcp-vmx1                       running
 6     vfp-vmx1                       running
 7     vcp-vmx2                       running
 8     vfp-vmx2                       running
identify console ports of VCP and VFP VMs for both instance:
ping@trinity:/virtualization/vmx1$ netstat -na | grep ":8[12][67] "
tcp        0      0 127.0.0.1:8826          0.0.0.0:*               LISTEN              #<------
tcp        0      0 127.0.0.1:8827          0.0.0.0:*               LISTEN              #<------
tcp        0      0 127.0.0.1:8816          0.0.0.0:*               LISTEN              #<------
tcp        0      0 127.0.0.1:8817          0.0.0.0:*               LISTEN              #<------
tcp        0      0 127.0.0.1:8816          127.0.0.1:42765         ESTABLISHED
tcp        0      0 127.0.0.1:41726         127.0.0.1:8826          ESTABLISHED
tcp        0      0 127.0.0.1:42765         127.0.0.1:8816          ESTABLISHED
tcp        0      0 127.0.0.1:8826          127.0.0.1:41726         ESTABLISHED

the port number that qemu is listening can be configured to any number, as long as the port is unique in the system wide and not in use by other applications. in the above example I use 88x6 for VCP console port, and 88x7 for VFP console port, where x is instance number.

Another approach, is to use the same port number for all instances, but changing the local IP to 127.0.0.x where x can be instance number - just make sure the local "socket" (IP + port pair) is unique will be sufficient.

ping@ubuntu4.54:~$ sudo netstat -nap | grep qemu
[sudo] password for ping:
tcp        0      0 127.0.0.2:8896          0.0.0.0:*               LISTEN      33284/qemu-system-x     #<------
tcp        0      0 127.0.0.1:8896          0.0.0.0:*               LISTEN      15030/qemu-system-x     #<------
tcp        0      0 127.0.0.2:8897          0.0.0.0:*               LISTEN      33330/qemu-system-x     #<------
tcp        0      0 127.0.0.1:8897          0.0.0.0:*               LISTEN      15075/qemu-system-x     #<------
tcp        0      0 127.0.0.1:5900          0.0.0.0:*               LISTEN      15030/qemu-system-x
tcp        0      0 127.0.0.1:5901          0.0.0.0:*               LISTEN      15075/qemu-system-x
tcp        0      0 127.0.0.2:5902          0.0.0.0:*               LISTEN      33284/qemu-system-x
tcp        0      0 127.0.0.2:5903          0.0.0.0:*               LISTEN      33330/qemu-system-x
unix  2      [ ACC ]     STREAM     LISTENING     62021    15030/qemu-system-x //lib/libvirt/qemu/MIS-VMX-VCP.monitor
unix  2      [ ACC ]     STREAM     LISTENING     23706    15075/qemu-system-x //lib/libvirt/qemu/MIS-VMX-VFP.monitor
unix  2      [ ACC ]     STREAM     LISTENING     62277    33284/qemu-system-x //lib/libvirt/qemu/AVPN-VMX-VCP.monitor
unix  2      [ ACC ]     STREAM     LISTENING     24867    33330/qemu-system-x //lib/libvirt/qemu/AVPN-VMX-VFP.monitor
unix  3      [ ]         STREAM     CONNECTED     63371    33330/qemu-system-x //lib/libvirt/qemu/AVPN-VMX-VFP.monitor
unix  3      [ ]         STREAM     CONNECTED     27800    15030/qemu-system-x //lib/libvirt/qemu/MIS-VMX-VCP.monitor
unix  3      [ ]         STREAM     CONNECTED     63366    33284/qemu-system-x //lib/libvirt/qemu/AVPN-VMX-VCP.monitor
unix  3      [ ]         STREAM     CONNECTED     59462    15075/qemu-system-x //lib/libvirt/qemu/MIS-VMX-VFP.monitor
login to console of the two VMX instances:

vmx1:

ping@trinity:~$ telnet localhost 8816
Trying ::1...
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
Amnesiac (ttyd0)

login: root

--- JUNOS 15.1F-20151104.0 built 2015-11-04 05:39:37 UTC
root@% cli
root >

root> show version
Model: vmx
Junos: 15.1F-20151104.0
JUNOS Base OS boot [15.1F-20151104.0]
JUNOS Base OS Software Suite [15.1F-20151104.0]
...

labroot> show chassis fpc
                     Temp  CPU Utilization (%)   CPU Utilization (%)  Memory    Utilization (%)
Slot State            (C)  Total  Interrupt      1min   5min   15min  DRAM (MB) Heap     Buffer
  0  Online           Testing   3         0        3      3      2      1         6          0

labroot> show chassis fpc pic-status
Slot 0   Online       Virtual FPC
  PIC 0  Online       Virtual

vmx2:

ping@trinity:~$ telnet localhost 8826
Trying ::1...
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
......
Amnesiac (ttyd0)
login: lab
root@% cli
root>
ping test:
labroot@vmx1# run show interfaces ge-0/0/0 terse
Interface               Admin Link Proto    Local                 Remote
ge-0/0/0                up    up
ge-0/0/0.0              up    up   inet     1.1.1.1/24
                                   multiservice

root@vmx2# run show interfaces ge-0/0/0 terse
Interface               Admin Link Proto    Local                 Remote
ge-0/0/0                up    up
ge-0/0/0.0              up    up   inet     1.1.1.2/24
                                   multiservice

[edit]
labroot@vmx1# run ping 1.1.1.2
PING 1.1.1.2 (1.1.1.2): 56 data bytes
64 bytes from 1.1.1.2: icmp_seq=0 ttl=64 time=1.694 ms
64 bytes from 1.1.1.2: icmp_seq=1 ttl=64 time=1.751 ms
64 bytes from 1.1.1.2: icmp_seq=2 ttl=64 time=1.817 ms
^C
--- 1.1.1.2 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max/stddev = 1.694/1.754/1.817/0.050 ms

[edit]
labroot@vmx1# run show arp
MAC Address       Address         Name                      Interface               Flags
02:04:17:02:02:01 1.1.1.2         1.1.1.2                   ge-0/0/0.0              none
52:54:00:9c:b3:f2 128.0.0.16      128.0.0.16                em1.0                   none
Total entries: 2

[edit]
labroot@vmx1# run ping 10.85.4.107
PING 10.85.4.107 (10.85.4.107): 56 data bytes
64 bytes from 10.85.4.107: icmp_seq=0 ttl=64 time=0.822 ms
64 bytes from 10.85.4.107: icmp_seq=1 ttl=64 time=0.794 ms
^C
--- 10.85.4.107 ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max/stddev = 0.794/0.808/0.822/0.014 ms

5.4. internal and external bridges in multiple tenant environment

some highlights of the multiple VMX installation process we demonstrated:

  • external management interfaces of all VMs from both instance share same bridge: br-ext

  • each instance has its own internal bridge br-int-vmx1 and br-int-vmx2 for internal management interfaces of VCP/VFP VM

  • the physical NIC port p2p1 and p3p1 were each "split" into two virtual port (VF), and each VF were used by a different instance.

These can be illustrated in this diagram:

 vcp-vmx1                   br-int-vmx1 br-int-vmx2                   vcp-vmx2
+--------+                         |       |                          +--------+
|        |                         |       |                          |        |
|     em1+-------------------------+       +--------------------------+em1     |
|        |             vcp_int_vmx1|       |vcp_int_vmx2              |        |
| fxp0   |                         |       |                          | fxp0   |
+---+----+                         |       |                          +---+----+
    |               vfp-vmx1       |       |      vfp-vmx2                |
    |              +--------+      |       |      +--------+              |
    |              |   vfp_int-vmx1|       |vfp-int-vmx2   |              |
    |              |    eth0+------+       +------+eth0    |              |
    |              |        |-vmx1 |       |      |        |              |
    |              |eth1    |                     |  eth1  |              |
    |              ++-------+       p2p1          ++--+---++              |
    |               |              +--------+             |               |
    |               |              |        |             |               |
    |               |      --------+--VF0-- |             |               |
    |               |     ge-0/0/1 | -VF1---+------+      |               |
    |               |              |        | ge-0/0/1    |               |
    |               |              +--------+             |               |
    |               |                                     |               |
    |               |              p3p1                   |               |
    |               |              +--------+             |               |
    |               |     ge-0/0/0 |        |             |               |
    |               |      --------+--VF0-- | ge-0/0/0    |               |
    |               |              | -VF1---+---------    |               |
    |               |              |        |             |               |
    |               |              +--------+             |               |
    |vcp_ext-vmx1   | vfp_ext-vmx1            vfp_ext-vmx2|  vcp_ext-vmx2 |
+---+---------------+---+---------------------------------+---------------+----+
|                                 br-ext  br-ext-nic                           |
+---------------------------------------+--------------------------------------+
                                        |
                                        |em1

5.5. manually uninstall VMXs

the steps to "uninstall" VMX are much simpler:

  1. "poweroff" and remove VMs:

    sudo virsh destroy vcp-vmx1
    sudo virsh destroy vfp-vmx1
    sudo virsh undefine vcp-vmx1
    sudo virsh undefine vfp-vmx1
    sudo virsh destroy vcp-vmx2
    sudo virsh destroy vfp-vmx2
    sudo virsh undefine vcp-vmx2
    sudo virsh undefine vfp-vmx2
  2. disable and delete the VNs and associated bridges:

    sudo virsh net-destroy br-ext; \
    sudo ifconfig em1 10.85.4.17/25; \
    sudo route add default gw 10.85.4.1
    sudo virsh net-undefine br-ext
    sudo virsh net-destroy br-int-vmx1
    sudo virsh net-undefine br-int-vmx1
    sudo virsh net-destroy br-int-vmx2
    sudo virsh net-undefine br-int-vmx2
  3. disable VF of SR-IOV:

    sudo rmmod ixgbevf; \
    sudo rmmod ixgbe; \
    sleep 5
    sudo modprobe ixgbe

6. upgrading VMX

The procedure of upgrading VMX is conceptually similiar to the procedure of upgrading physical Junos router - you load the new images and reboot the box with them.

In practice, the steps are different. Since a standalone VMX uses libvirt to manage the VMs, the upgrading steps involve modification of the corresponding XML files that the libvirt tools rely on. These XML files provide all parameters which provide a detail description about how a VM will be brought up. the parameters include, but not limited to:

  • domain name - VM name

  • guest VM memory

  • guest VM vcpu numbers

  • guest VM emulated cpu architecture

  • guest VM (emulated) hard disk images

  • guest VM bridges

  • guest VM (emulated) peripherals devices

    • video/audio/usb/serial/keyboard/mouse/etc…​

  • guest VM (emulated) NICs

  • other devices/components of guest VM

with regard to guest OS upgrading, only "hard disk images" are related. So essentially to upgrade VMX, only guest OS hard disk images (location) need to be modified, followed by a guest VM reboot.

  1. (optional) backup VCP/VFP VM configuration XML files:

    1. virsh dumpxml vfp-vmx1 > vfp-vmx1.xml

    2. virsh dumpxml vcp-vmx1 > vcp-vmx1.xml

    3. virsh net-dumpxml br-int-vmx1 > br-int-vmx1.xml

  2. Edit the generated vfp-vmx1.xml & vcp-vmx1.xml so they point to desired VCP image, PFE image, and HDD image file.

    VCP
     1<disk device="disk" type="file">
     2  <driver cache="directsync" name="qemu" type="qcow2" />
     3  <source file="/path/to/jinstall64-vmx-15.1F-20151104.0-domestic.img" />   #<------
     4  <target bus="ide" dev="hda" />
     5  <address bus="0" controller="0" target="0" type="drive" unit="0" />
     6</disk>
     7
     8<disk device="disk" type="file">
     9  <driver cache="directsync" name="qemu" type="qcow2" />
    10  <source file="/path/to/vmxhdd.img" />     #<------
    11  <target bus="ide" dev="hdb" />
    12  <address bus="0" controller="0" target="0" type="drive" unit="1" />
    13</disk>
    VFP
    1<disk device="disk" type="file">
    2  <driver cache="directsync" name="qemu" type="raw" />
    3  <source file="/images/vmx_20151102.0/build/vmx1/images/vFPC-20151102.img" />      #<------
    4  <target bus="ide" dev="hda" />
    5</disk>
  3. shutdown VCP/VFP VMs

    1. virsh destroy vfp-vmx1

    2. virsh destroy vcp-vmx1

    3. virsh undefine vfp-vmx1

    4. virsh undefine vcp-vmx1

    5. virsh net-undefine br-int-vmx1

  4. start VM with new configuration

    1. virsh define vfp-vmx1.xml

    2. virsh define vcp-vmx1.xml

    3. virsh net-define br-int-vmx1.xml

    4. virsh net-start br-int-vmx1

    5. virsh start vfp-vmx1

    6. virsh start vcp-vmx1

7. troubleshooting installation script

7.1. ixgbe compilation error

 1ping@trinity:/images/vmx_20151102.0$ sudo ./vmx.sh -lvf --install
 2......
 3
 4[OK]
 5Check IXGBE drivers...............................
 6[Command] cd /virtualization/images/vmx_20151102.0/drivers/ixgbe-3.19.1/src
 7
 8[Command] rm -f ixgbe.ko
 9
10[Command] make install
11make -C /lib/modules/3.19.0-25-generic/build SUBDIRS=/images/vmx_20151102.0/drivers/ixgbe-3.19.1/src modules
12make[1]: Entering directory `/usr/src/linux-headers-3.19.0-25-generic'
13  CC [M]  /virtualization/images/vmx_20151102.0/drivers/ixgbe-3.19.1/src/ixgbe_main.o
14/virtualization/images/vmx_20151102.0/drivers/ixgbe-3.19.1/src/ixgbe_main.c: In function ‘ixgbe_service_event_complete’:
15/virtualization/images/vmx_20151102.0/drivers/ixgbe-3.19.1/src/ixgbe_main.c:339:2: error: implicit declaration of function ‘smp_mb__before_clear_bit’ [-Werror=implicit-function-declaration]
16  smp_mb__before_clear_bit();
17  ^
18/virtualization/images/vmx_20151102.0/drivers/ixgbe-3.19.1/src/ixgbe_main.c: In function ‘ixgbe_rx_hash’:
19/virtualization/images/vmx_20151102.0/drivers/ixgbe-3.19.1/src/ixgbe_main.c:923:6: error: ‘struct sk_buff’ has no member named ‘rxhash’
20   skb->rxhash = le32_to_cpu(rx_desc->wb.lower.hi_dword.rss);
21      ^
22/virtualization/images/vmx_20151102.0/drivers/ixgbe-3.19.1/src/ixgbe_main.c: In function ‘ixgbe_del_mac_filter’:
23/virtualization/images/vmx_20151102.0/drivers/ixgbe-3.19.1/src/ixgbe_main.c:4702:3: error: implicit declaration of function ‘compare_ether_addr’ [-Werror=implicit-function-declaration]
24   if (!compare_ether_addr(addr, adapter->mac_table[i].addr) &&
25   ^
26/virtualization/images/vmx_20151102.0/drivers/ixgbe-3.19.1/src/ixgbe_main.c: In function ‘ixgbe_ndo_bridge_getlink’:
27/virtualization/images/vmx_20151102.0/drivers/ixgbe-3.19.1/src/ixgbe_main.c:8800:2: error: too few arguments to function ‘ndo_dflt_bridge_getlink’
28  return ndo_dflt_bridge_getlink(skb, pid, seq, dev, mode);
29  ^
30In file included from include/net/dst.h:13:0,
31                 from include/net/sock.h:68,
32                 from include/linux/tcp.h:22,
33                 from /virtualization/images/vmx_20151102.0/drivers/ixgbe-3.19.1/src/ixgbe_main.c:40:
34include/linux/rtnetlink.h:110:12: note: declared here
35 extern int ndo_dflt_bridge_getlink(struct sk_buff *skb, u32 pid, u32 seq,
36            ^
37/virtualization/images/vmx_20151102.0/drivers/ixgbe-3.19.1/src/ixgbe_main.c: At top level:
38/virtualization/images/vmx_20151102.0/drivers/ixgbe-3.19.1/src/ixgbe_main.c:8830:2: error: unknown field ‘ndo_set_vf_tx_rate’ specified in initializer
39  .ndo_set_vf_tx_rate = ixgbe_ndo_set_vf_bw,
40  ^
41/virtualization/images/vmx_20151102.0/drivers/ixgbe-3.19.1/src/ixgbe_main.c:8830:2: warning: initialization from incompatible pointer type [enabled by default]
42/virtualization/images/vmx_20151102.0/drivers/ixgbe-3.19.1/src/ixgbe_main.c:8830:2: warning: (near initialization for ‘ixgbe_netdev_ops.ndo_set_vf_rate’) [enabled by default]
43/virtualization/images/vmx_20151102.0/drivers/ixgbe-3.19.1/src/ixgbe_main.c:8872:2: warning: initialization from incompatible pointer type [enabled by default]
44  .ndo_fdb_add  = ixgbe_ndo_fdb_add,
45  ^
46/virtualization/images/vmx_20151102.0/drivers/ixgbe-3.19.1/src/ixgbe_main.c:8872:2: warning: (near initialization for ‘ixgbe_netdev_ops.ndo_fdb_add’) [enabled by default]
47/virtualization/images/vmx_20151102.0/drivers/ixgbe-3.19.1/src/ixgbe_main.c: In function ‘ixgbe_ndo_bridge_getlink’:
48/virtualization/images/vmx_20151102.0/drivers/ixgbe-3.19.1/src/ixgbe_main.c:8801:1: warning: control reaches end of non-void function [-Wreturn-type]
49 }
50 ^
51cc1: some warnings being treated as errors
52make[2]: *** [/images/vmx_20151102.0/drivers/ixgbe-3.19.1/src/ixgbe_main.o] Error 1
53make[1]: *** [_module_/images/vmx_20151102.0/drivers/ixgbe-3.19.1/src] Error 2
54make[1]: Leaving directory `/usr/src/linux-headers-3.19.0-25-generic'
55make: *** [default] Error 2
56[Failed]
57Log file........................................../images/vmx_20151102.0/build/vmx1/logs/vmx_1447967843.log
58==================================================
59 Aborted!. 1 error(s) and 0 warning(s)
60==================================================
61ping@trinity:/images/vmx_20151102.0$
62ping@trinity:/images/vmx_20151102.0$
63ping@trinity:/images/vmx_20151102.0$

7.1.1. analysis and solution

This is ixgbe kernel driver module compilation error, and the cause is a "wrong" kernel version.

Currently the solution is to change the host ubuntu kernel to 3.13, see section install required linux kernel.

7.2. 82599 NIC not recognized

 1......
 2[OK]
 3Setup huge pages to 32768.........................
 4[Command] echo 32768 > /proc/sys/vm/nr_overcommit_hugepages
 5
 6[Command] mkdir /HugePage_vPFE
 7
 8[Command] mount | grep "HugePage_vPFE"
 9
10[Command] mount -t hugetlbfs hugetlbfs /HugePage_vPFE
11[OK]
12
13[Command] cat /etc/apparmor.d/abstractions/libvirt-qemu  | grep HugePage_vPFE
14
15[Command] echo "owner \"/HugePage_vPFE/libvirt/qemu/**\" rw," >> /etc/apparmor.d/abstractions/libvirt-qemu
16
17[Command] modprobe pci-stub
18
19[Command] echo "8086 10ed" > /sys/bus/pci/drivers/pci-stub/new_id
20
21[Command] > /sys/bus/pci/drivers/pci-stub/bind
22Number of Intel 82599 NICs........................0
23Minimum Intel 82599 NICs available................No
24==================================================
25    Aborted!
26==================================================

7.2.1. analysis and solution

the script does not find any of the specific NICs variants it was looking for:

  • 82599 ES

  • 82599 EB

  • 82599 EN

and so it aborted.

66 vmx_system_setup_ixgbe()
67 {
68     nic_count=$(lspci | grep 82599 | grep 'ES\|EB\|EN' | wc -l)      #<------
69     vmx_echo_textval "Number of Intel 82599 NICs" "$nic_count"
70     if [ $nic_count -eq 0 ];
71     then
72         vmx_echo_textval_red "Minimum Intel 82599 NICs available" "No"
73         vmx_echo_summary_banner "Aborted!"
74         exit 2
75     fi
76     vmx_echo_text "Configuring Intel 82599 Adapters for SRIOV"
77     cmd="rmmod ixgbevf"
78     vmx_exec_cmd -cmd "$cmd"
79     cmd="rmmod ixgbe"
80     vmx_exec_cmd -cmd "$cmd"
81     max_vfs_str="1"
82     nic_count=$(expr $nic_count - 1)
83     while [ $nic_count -gt 0 ];
84     do
85         max_vfs_str="$max_vfs_str,1"
86         nic_count=$(expr $nic_count - 1)
87     done
88     cmd="modprobe ixgbe max_vfs=$max_vfs_str"
89     vmx_exec_cmd -cmd "$cmd"
90     cmd="modprobe ixgbevf"
91     vmx_exec_cmd -cmd "$cmd"
92     cmd="modprobe tun"
93     vmx_exec_cmd -cmd "$cmd"
94     vmx_echo "[OK]"
95     cmd_descr="Number of Virtual Functions created"
96     cmd="lspci |grep 82599 | grep \"Virtual Function\" | wc -l"
97     vmx_exec_cmd -cmd "$cmd" -cmd_descr "$cmd_descr"
98 }

The solution is to modify the installation script to ignore the NS BS EN keyword when looking for 82599 NIC

7.3. hyperthread

 1==================================================
 2    System Setup Completed
 3==================================================
 4
 5[Command] brctl show br-ext | grep em1
 6can't get info No such device
 7Get Management Address of em1.....................
 8[Command] expr "10.85.4.17" != ""
 91
10[OK]
11Generate libvirt files............................
12[Command] python /virtualization/images/vmx_20151102.0/scripts/common/vmx_configure.py /virtualization/images/vmx_20151102.0/config/vmx.conf
13Handling Host config
14Initializing interface names for vmx1
15Model=FPC
1610.85.4.17
17255.255.255.128
18handling bridge config
19Handling Routing Engine params
20Handling Forwarding Engine params
21['0', '1', '2', '3', '4', '5', '6', '7']
22Core list HTO:
23['0', '1', '2', '3', '4', '5', '6', '7']
24Core list HT1:
25[]
260
27[Failed]
28Traceback (most recent call last):
29  File "/images/vmx_20151102.0/scripts/common/vmx_configure.py", line 673, in <module>
30    node_list[index].cpu_list[2*corenum + 1] = core_list_ht1[corenum]
31IndexError: list index out of range
32Log file........................................../images/vmx_20151102.0/build/vmx1/logs/vmx_1448050510.log
33==================================================
34 Aborted!. 1 error(s) and 0 warning(s)
35==================================================

7.3.1. analysis and solution:

starting for 15.x VMX start to support flow-cache which by design requires hyperthreading feature.

this can be fixed by either:

  • enable hyperthread

  • change script to ignore hyperthreading checking

 12134     #core_list_ht1 = core_list.cpu_list[num_physical_cores_per_node : num_physical_cores_per_node*2]
 22135
 32136     print "Core list HTO: "
 42137     print core_list_ht0
 52138     #print "Core list HT1: "       (1)
 62139     #print core_list_ht1           (1)
 72140
 82141     for corenum in range(num_physical_cores_per_node):
 92142
102143         print corenum
112144         #node_list[index].cpu_list[2*corenum] = core_list_ht0[corenum]    (2)
122145         node_list[index].cpu_list[corenum] = core_list_ht0[corenum]       (2)
132146         #node_list[index].cpu_list[2*corenum + 1] = core_list_ht1[corenum](2)
142147
152148         #for corenum in range(len(node_list)):
162149         #       node_list[index].cpu_list[corenum] =
172150     print node_list[index].cpu_list
1 comment these lines
2 disable hyperthreading check

Part 2: VMX verification

8. VCP/VFP guest VM overview

After installation of VMX, at least two VMs should have been created:

  • VM for virtual forwarding plane of VMX (VFP/vFPC)

  • VM for virtual control plane of VMX (VCP/vRE)

The VFP VM runs the (yocto based) virtual Trio forwarding plane software and the VCP VM runs (Freebsd based) Junos OS.

This is demonstrated in the VMX architecture:

86ccb340 b544 11e5 9739 c926e0196ea9
Figure 2. architecture of VMX instance

These VMs should be in "running" state, as can be listed with libvirt virsh command below:

ping@trinity:~$ sudo virsh list
[sudo] password for ping:
 Id    Name                           State
----------------------------------------------------
 2     vcp-vmx1                       running
 3     vfp-vmx1                       running

The next thing we want to verify is the images these VM were built from:

ping@trinity:~$ sudo virsh domblklist 2
Target     Source
------------------------------------------------
hda        /virtualization/images/vmx_20151102.0/build/vmx1/images/jinstall64-vmx-15.1F-20151104.0-domestic.img
hdb        /virtualization/images/vmx_20151102.0/build/vmx1/images/vmxhdd.img
sda        /virtualization/images/vmx_20151102.0/images/metadata_usb.img
ping@trinity:~$ sudo virsh domblklist 3
Target     Source
------------------------------------------------
hda        /virtualization/images/vmx_20151102.0/build/vmx1/images/vFPC-20151102.img

NOTE: the libvirt virsh utility is a very powerful frontend toolset that can be used to manage the VMs under its control [3]. There are a lot more other data can be gathered simply from libvirt virsh tool, without the need to login to guest VMs. see VM managent - virsh section for more details on this.

9. Login to the VNC console

Right after a fresh VMX installation, the guest VCP/VFP VM will come up with no configs. So the initial configuration (mgmt IP, GW, login, etc) needs to be done via a "console" connection. Previously we demonstrated how to login to the vRE and vPFE console via telnet, which is also a convenient method to collect the guest VM booting messages in the case that IP-based management session is not available. Another way to acquire "console" of a guest VM is via the built-in VNC support provided by KVM.

locate the console/VNC access ports

A quick way to find out tcp port for console connection is to check netstat:

 1ping@trinity: $ sudo netstat -lntp
 2Active Internet connections (only servers)
 3Proto Recv-Q Send-Q Local Address  Foreign Address State   PID/Program name
 4tcp    0     0   127.0.0.1:5900    0.0.0.0:*       LISTEN  99035/qemu-system-x (1)
 5tcp    0     0   127.0.0.1:5901    0.0.0.0:*       LISTEN  99192/qemu-system-x (1)
 6tcp    0     0   127.0.0.1:8816    0.0.0.0:*       LISTEN  99035/qemu-system-x (2)
 7tcp    0     0   127.0.0.1:8817    0.0.0.0:*       LISTEN  99192/qemu-system-x (2)
 8tcp    0     0   10.85.4.17:53     0.0.0.0:*       LISTEN  98162/dnsmasq       (3)
 9tcp    0     0   192.168.122.1:53  0.0.0.0:*       LISTEN  1731/dnsmasq        (3)
10tcp    0     0   0.0.0.0:22        0.0.0.0:*       LISTEN  1438/sshd
11tcp    0     0   127.0.0.1:6010    0.0.0.0:*       LISTEN  2677/1
12tcp    0     0   127.0.0.1:6011    0.0.0.0:*       LISTEN  10570/2
13tcp    0     0   127.0.0.1:6012    0.0.0.0:*       LISTEN  10636/3
14tcp    0     0   127.0.0.1:6013    0.0.0.0:*       LISTEN  14223/4
1 VNC service port
2 console port
3 DNS service port [4]

VNC ports can also be acquired via virsh command:

ping@trinity:~$ sudo virsh vncdisplay 2
127.0.0.1:0

ping@trinity:~$ sudo virsh vncdisplay 3
127.0.0.1:1

This indicates VNC port 5900 and 5901 for VCP and VFP respectively, same as shown in previous netstat command output.

Now VNC GUI can be accessed via legacy vncviewer or virt-viewer tools, from the server’s GUI.

fcfae4e4 ac32 11e5 9694 d735f90e9af7
Figure 3. example of using libvirt virt-viewer to access VNC console
ce32a1b8 ac34 11e5 8b3d 117d84259194
Figure 4. example of using linux vncviewer to access VNC console

10. ixgbe driver and ixgbe-driven interfaces

With SR-IOV version of VMX installation, Juniper modified ixgbe driver will be in use:

  • ixgbe kernel driver version is 3.19.1 (Juniper modified)

  • some more parameters are now supported than the default driver coming with linux kernel

  • ixgbevf driver kernel module is now loaded to support SR-IOV Virtual Function

ping@trinity:~$ modinfo ixgbe
filename:       /lib/modules/3.13.0-32-generic/kernel/drivers/net/ethernet/intel/ixgbe/ixgbe.ko
version:        3.19.1      #<------
license:        GPL
description:    Intel(R) 10 Gigabit PCI Express Network Driver
author:         Intel Corporation, <linux.nics@intel.com>
srcversion:     B97B1E7CF79A25F5E4D7B96
alias:          pci:v00008086d00001560sv*sd*bc*sc*i*
alias:          pci:v00008086d00001558sv*sd*bc*sc*i*
alias:          pci:v00008086d0000154Asv*sd*bc*sc*i*
alias:          pci:v00008086d00001557sv*sd*bc*sc*i*
alias:          pci:v00008086d0000154Fsv*sd*bc*sc*i*
alias:          pci:v00008086d0000154Dsv*sd*bc*sc*i*
alias:          pci:v00008086d00001528sv*sd*bc*sc*i*
alias:          pci:v00008086d000010F8sv*sd*bc*sc*i*
alias:          pci:v00008086d0000151Csv*sd*bc*sc*i*
alias:          pci:v00008086d00001529sv*sd*bc*sc*i*
alias:          pci:v00008086d0000152Asv*sd*bc*sc*i*
alias:          pci:v00008086d000010F9sv*sd*bc*sc*i*
alias:          pci:v00008086d00001514sv*sd*bc*sc*i*
alias:          pci:v00008086d00001507sv*sd*bc*sc*i*
alias:          pci:v00008086d000010FBsv*sd*bc*sc*i*
alias:          pci:v00008086d00001517sv*sd*bc*sc*i*
alias:          pci:v00008086d000010FCsv*sd*bc*sc*i*
alias:          pci:v00008086d000010F7sv*sd*bc*sc*i*
alias:          pci:v00008086d00001508sv*sd*bc*sc*i*
alias:          pci:v00008086d000010DBsv*sd*bc*sc*i*
alias:          pci:v00008086d000010F4sv*sd*bc*sc*i*
alias:          pci:v00008086d000010E1sv*sd*bc*sc*i*
alias:          pci:v00008086d000010F1sv*sd*bc*sc*i*
alias:          pci:v00008086d000010ECsv*sd*bc*sc*i*
alias:          pci:v00008086d000010DDsv*sd*bc*sc*i*
alias:          pci:v00008086d0000150Bsv*sd*bc*sc*i*
alias:          pci:v00008086d000010C8sv*sd*bc*sc*i*
alias:          pci:v00008086d000010C7sv*sd*bc*sc*i*
alias:          pci:v00008086d000010C6sv*sd*bc*sc*i*
alias:          pci:v00008086d000010B6sv*sd*bc*sc*i*
depends:        dca
vermagic:       3.13.0-32-generic SMP mod_unload modversions
parm:           InterruptType:Change Interrupt Mode (0=Legacy, 1=MSI, 2=MSI-X), default IntMode (deprecated) (array of int)
parm:           IntMode:Change Interrupt Mode (0=Legacy, 1=MSI, 2=MSI-X), default 2 (array of int)
parm:           MQ:Disable or enable Multiple Queues, default 1 (array of int)
parm:           DCA:Disable or enable Direct Cache Access, 0=disabled, 1=descriptor only, 2=descriptor and data (array of int)
parm:           RSS:Number of Receive-Side Scaling Descriptor Queues, default 0=number of cpus (array of int)
parm:           VMDQ:Number of Virtual Machine Device Queues: 0/1 = disable, 2-16 enable (default=8) (array of int)
parm:           max_vfs:Number of Virtual Functions: 0 = disable (default), 1-63 = enable this many VFs (array of int)
parm:           L2LBen:L2 Loopback Enable: 0 = disable, 1 = enable (default) (array of int)
parm:           InterruptThrottleRate:Maximum interrupts per second, per vector, (0,1,956-488281), default 1 (array of int)
parm:           LLIPort:Low Latency Interrupt TCP Port (0-65535) (array of int)
parm:           LLIPush:Low Latency Interrupt on TCP Push flag (0,1) (array of int)
parm:           LLISize:Low Latency Interrupt on Packet Size (0-1500) (array of int)
parm:           LLIEType:Low Latency Interrupt Ethernet Protocol Type (array of int)
parm:           LLIVLANP:Low Latency Interrupt on VLAN priority threshold (array of int)
parm:           FdirPballoc:Flow Director packet buffer allocation level:
                        1 = 8k hash filters or 2k perfect filters
                        2 = 16k hash filters or 4k perfect filters
                        3 = 32k hash filters or 8k perfect filters (array of int)
parm:           AtrSampleRate:Software ATR Tx packet sample rate (array of int)
parm:           LRO:Large Receive Offload (0,1), default 1 = on (array of int)
parm:           allow_unsupported_sfp:Allow unsupported and untested SFP+ modules on 82599 based adapters, default 0 = Disable (array of int)

The ixgbe driver used here is the Juniper modified version in order to support the ability of accepting ingress multicast packets arriving in a VF. So it’s different than what is available from Intel website, even though both may show same version number.

root@ubuntu1:~# modinfo ixgbevf
filename:       /lib/modules/3.13.0-32-generic/kernel/drivers/net/ethernet/intel/ixgbevf/ixgbevf.ko
version:        2.11.3-k
license:        GPL
description:    Intel(R) 82599 Virtual Function Driver
author:         Intel Corporation, <linux.nics@intel.com>
srcversion:     AE2D8A25951B508611E943D
alias:          pci:v00008086d00001515sv*sd*bc*sc*i*
alias:          pci:v00008086d000010EDsv*sd*bc*sc*i*
depends:
intree:         Y
vermagic:       3.13.0-32-generic SMP mod_unload modversions
signer:         Magrathea: Glacier signing key
sig_key:        5E:3C:0F:9C:A6:E3:65:43:53:5F:A2:BB:5B:70:9E:84:F1:6D:A7:C7
sig_hashalgo:   sha512
parm:           debug:Debug level (0=none,...,16=all) (int)

To verify interfaces status from a linux server, the most commonly used commands are:

  • ifconfig command: "legacy" system administration utility in Unix-like OS to manipulate or show interfaces

  • ip command : relatively new commands, show / manipulate interface, routing, devices, policy routing and tunnels, etc. much more features-rich than the traditional ifconfig tool.

  • ethtool command: to query or control network driver and low level hardware settings, mostly used to polling L1 physical layer information of an interface (e.g PCI address)

in this section we’ll use ifconfig and ip link show to list the physical ports, VFs and bridges.

10.1. legacy ifconfig command

This is what it looks like after one VMX instance was built and brought up to running status:

ping@trinity:/virtualization/images/vmx_20151102.0$ ifconfig -a

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:37 errors:0 dropped:0 overruns:0 frame:0
          TX packets:37 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:6468 (6.4 KB)  TX bytes:6468 (6.4 KB)

br-ext    Link encap:Ethernet  HWaddr 38:ea:a7:37:7c:54                 \     (1)
          inet addr:10.85.4.17  Bcast:10.85.4.127  Mask:255.255.255.128  |
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1             |
          RX packets:661 errors:0 dropped:0 overruns:0 frame:0           |
          TX packets:584 errors:0 dropped:0 overruns:0 carrier:0         |
          collisions:0 txqueuelen:0                                      |
          RX bytes:36305 (36.3 KB)  TX bytes:77528 (77.5 KB)             |
                                                                         |
br-ext-nic Link encap:Ethernet  HWaddr 52:54:00:9f:a0:77                 |    (2)
          BROADCAST MULTICAST  MTU:1500  Metric:1                        |
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0             |
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0           |
          collisions:0 txqueuelen:500                                    \
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)                          X br-ext bridge
                                                                         |  and I/Fs
vcp_ext-vmx1 Link encap:Ethernet  HWaddr fe:04:17:01:01:01               |    (3)
          inet6 addr: fe80::fc04:17ff:fe01:101/64 Scope:Link             |
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1             |
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0             |
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0           |
          collisions:0 txqueuelen:500                                    |
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)                         |
                                                                         |
vfp_ext-vmx1 Link encap:Ethernet  HWaddr fe:04:17:01:01:02               |    (4)
          inet6 addr: fe80::fc04:17ff:fe01:102/64 Scope:Link             |
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1             |
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0             |
          TX packets:85 errors:0 dropped:0 overruns:0 carrier:0          |
          collisions:0 txqueuelen:500                                    |
          RX bytes:0 (0.0 B)  TX bytes:6934 (6.9 KB)                    /

br-int-vmx1 Link encap:Ethernet  HWaddr 52:54:00:ad:64:15               \     (5)
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1             |
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0             |
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0           |
          collisions:0 txqueuelen:0                                      |
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)                         |
                                                                         |    (6)
br-int-vmx1-nic Link encap:Ethernet  HWaddr 52:54:00:ad:64:15            |
          BROADCAST MULTICAST  MTU:1500  Metric:1                        |
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0             |
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0           |
          collisions:0 txqueuelen:500                                    \
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)                          X
                                                                         |
vcp_int-vmx1 Link encap:Ethernet  HWaddr fe:54:00:84:52:fb               |    (7)
          inet6 addr: fe80::fc54:ff:fe84:52fb/64 Scope:Link              |
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1             |
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0             |
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0           |
          collisions:0 txqueuelen:500                                    |
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)                         |
                                                                         |
vfp_int-vmx1 Link encap:Ethernet  HWaddr fe:54:00:c0:ff:1f               |    (8)
          inet6 addr: fe80::fc54:ff:fec0:ff1f/64 Scope:Link              |
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1             |
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0             |
          TX packets:22 errors:0 dropped:0 overruns:0 carrier:0          |
          collisions:0 txqueuelen:500                                    |
          RX bytes:0 (0.0 B)  TX bytes:1376 (1.3 KB)                    /

virbr0    Link encap:Ethernet  HWaddr 2a:e1:60:a8:87:40                     \
          inet addr:192.168.122.1  Bcast:192.168.122.255  Mask:255.255.255.0 |
          UP BROADCAST MULTICAST  MTU:1500  Metric:1                         | (9)
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0                 |
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0               |
          collisions:0 txqueuelen:0                                          |
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)                            /

em1       Link encap:Ethernet  HWaddr 38:ea:a7:37:7c:54          \
          inet6 addr: fe80::3aea:a7ff:fe37:7c54/64 Scope:Link     \
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1       |
          RX packets:1167 errors:0 dropped:1 overruns:0 frame:0    |
          TX packets:1071 errors:0 dropped:0 overruns:0 carrier:0  |
          collisions:0 txqueuelen:1000                             |
          RX bytes:80815 (80.8 KB)  TX bytes:129388 (129.3 KB)     |
                                                                   |
em2       Link encap:Ethernet  HWaddr 38:ea:a7:37:7c:55            |
          BROADCAST MULTICAST  MTU:1500  Metric:1                  |
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0       |
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0     |
          collisions:0 txqueuelen:1000                             |
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)                   |
                                                                   |
em9       Link encap:Ethernet  HWaddr 38:ea:a7:37:7b:d0            |
          BROADCAST MULTICAST  MTU:1500  Metric:1                  |
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0       |
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0     |
          collisions:0 txqueuelen:1000                             |
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)                   |
                                                                   |
em10      Link encap:Ethernet  HWaddr 38:ea:a7:37:7b:d1            |
          BROADCAST MULTICAST  MTU:1500  Metric:1                  |
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0       |
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0     |
          collisions:0 txqueuelen:1000                             |
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)                   |
                                                                   \
p2p1      Link encap:Ethernet  HWaddr 38:ea:a7:17:65:a0             X (10)
          inet6 addr: fe80::3aea:a7ff:fe17:65a0/64 Scope:Link      /
          UP BROADCAST RUNNING PROMISC ALLMULTI MULTICAST  MTU:2000| Metric:1
          RX packets:9 errors:0 dropped:0 overruns:0 frame:0       |
          TX packets:7 errors:0 dropped:0 overruns:0 carrier:0     |
          collisions:0 txqueuelen:1000                             |
          RX bytes:810 (810.0 B)  TX bytes:558 (558.0 B)           |
                                                                   |
p2p2      Link encap:Ethernet  HWaddr 38:ea:a7:17:65:a1            |
          BROADCAST MULTICAST  MTU:1500  Metric:1                  |
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0       |
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0     |
          collisions:0 txqueuelen:1000                             |
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)                   |
                                                                   |
p3p1      Link encap:Ethernet  HWaddr 38:ea:a7:17:65:84            |
          inet6 addr: fe80::3aea:a7ff:fe17:6584/64 Scope:Link      |
          UP BROADCAST RUNNING PROMISC ALLMULTI MULTICAST  MTU:2000| Metric:1
          RX packets:1 errors:0 dropped:1 overruns:0 frame:0       |
          TX packets:7 errors:0 dropped:0 overruns:0 carrier:0     |
          collisions:0 txqueuelen:1000                             |
          RX bytes:301 (301.0 B)  TX bytes:558 (558.0 B)           |
                                                                   |
p3p2      Link encap:Ethernet  HWaddr 38:ea:a7:17:65:85            |
          BROADCAST MULTICAST  MTU:1500  Metric:1                  |
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0       |
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0     |
          collisions:0 txqueuelen:1000                            /
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)                 /
1 the "external" bridge that bridges vRE (fxp0) and vPFE (ext) mgmt I/F to the mgmt I/F of the "host" or server (em1)
2 br-ext-nic interface
3 vcp_ext-vmx1: external tap interface for vcp-vmx1 fxp0 interface
4 vfp_ext-vmx1: external tap inferface for vfp-vmx1 ext interface
5 the "internal" bridge, that connects vRE (em1) and vPFE (eth1) for the internal communications (software module download, IPC, PFE-RE host bound traffic, etc)
6 br-int-nic interface
7 internal tap interface for vcp-vmx1 em1
8 internal tap interface for vfp-vmx1 eth1
9 the default virtual network (VN), created when libvirtd daemon was first installed and started. missing of this interface may indicate something wrong with libvirt
the tap interface with "configured" MAC address
ping@trinity:/virtualization/images/vmx_20151102.0$ ifconfig -a | grep 04:17
vcp_ext-vmx1 Link encap:Ethernet  HWaddr fe:04:17:01:01:01      (1)
          inet6 addr: fe80::fc:04:17:ff:fe01:101/64 Scope:Link
vfp_ext-vmx1 Link encap:Ethernet  HWaddr fe:04:17:01:01:02      (2)
          inet6 addr: fe80::fc:04:17:ff:fe01:102/64 Scope:Link
1 "peer" tap interface [1] for mgmt interface (fxp0) in VCP VM
2 "peer" tap interface for mgmt interface (ext) in VFP VM

10.2. ip tool (SR-IOV)

comparing with legacy ifconfig command, the new ip tool is more powerful - In this example it prints the configured SR-IOV VF info which ifconfig does not provide, with much more interface properties.

ping@trinity:/virtualization/images/vmx_20151102.0$ ip link show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
10: virbr0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default
    link/ether 2a:e1:60:a8:87:40 brd ff:ff:ff:ff:ff:ff
27: em9: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 38:ea:a7:37:7b:d0 brd ff:ff:ff:ff:ff:ff
    vf 0 MAC 00:00:00:00:00:00, spoof checking on, link-state auto
28: em10: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 38:ea:a7:37:7b:d1 brd ff:ff:ff:ff:ff:ff
    vf 0 MAC 00:00:00:00:00:00, spoof checking on, link-state auto
29: p2p1: <BROADCAST,MULTICAST,ALLMULTI,PROMISC,UP,LOWER_UP> mtu 2000 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 38:ea:a7:17:65:a0 brd ff:ff:ff:ff:ff:ff
    vf 0 MAC 02:04:17:01:02:02, tx rate 10000 (Mbps), spoof checking off, link-state auto               (1)
30: p2p2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 38:ea:a7:17:65:a1 brd ff:ff:ff:ff:ff:ff
    vf 0 MAC 00:00:00:00:00:00, spoof checking on, link-state auto
31: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master br-ext state UP mode DEFAULT group default qlen 1000
    link/ether 38:ea:a7:37:7c:54 brd ff:ff:ff:ff:ff:ff
    vf 0 MAC 00:00:00:00:00:00, spoof checking on, link-state auto
32: em2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 38:ea:a7:37:7c:55 brd ff:ff:ff:ff:ff:ff
    vf 0 MAC 00:00:00:00:00:00, spoof checking on, link-state auto
33: p3p1: <BROADCAST,MULTICAST,ALLMULTI,PROMISC,UP,LOWER_UP> mtu 2000 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 38:ea:a7:17:65:84 brd ff:ff:ff:ff:ff:ff
    vf 0 MAC 02:04:17:01:02:01, tx rate 10000 (Mbps), spoof checking off, link-state aut                (2)
34: p3p2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 38:ea:a7:17:65:85 brd ff:ff:ff:ff:ff:ff
    vf 0 MAC 00:00:00:00:00:00, spoof checking on, link-state auto
35: br-ext: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
    link/ether 38:ea:a7:37:7c:54 brd ff:ff:ff:ff:ff:ff
36: br-ext-nic: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast master br-ext state DOWN mode DEFAULT group default qlen 500
    link/ether 52:54:00:9f:a0:77 brd ff:ff:ff:ff:ff:ff
37: br-int-vmx1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
    link/ether 52:54:00:ad:64:15 brd ff:ff:ff:ff:ff:ff
38: br-int-vmx1-nic: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast master br-int-vmx1 state DOWN mode DEFAULT group default qlen 500
    link/ether 52:54:00:ad:64:15 brd ff:ff:ff:ff:ff:ff
39: vcp_ext-vmx1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master br-ext state UNKNOWN mode DEFAULT group default qlen 500
    link/ether fe:04:17:01:01:01 brd ff:ff:ff:ff:ff:ff
40: vcp_int-vmx1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master br-int-vmx1 state UNKNOWN mode DEFAULT group default qlen 500
    link/ether fe:54:00:84:52:fb brd ff:ff:ff:ff:ff:ff
41: vfp_ext-vmx1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master br-ext state UNKNOWN mode DEFAULT group default qlen 500
    link/ether fe:04:17:01:01:02 brd ff:ff:ff:ff:ff:ff
42: vfp_int-vmx1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master br-int-vmx1 state UNKNOWN mode DEFAULT group default qlen 500
    link/ether fe:54:00:c0:ff:1f brd ff:ff:ff:ff:ff:ff
the tap interface and VF with "configured" MAC address
29: p2p1: <BROADCAST,MULTICAST,ALLMULTI,PROMISC,UP,LOWER_UP> mtu 2000 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 38:ea:a7:17:65:a0 brd ff:ff:ff:ff:ff:ff
    vf 0 MAC 02:04:17:01:02:02, tx rate 10000 (Mbps), spoof checking off, link-state auto               (1)
33: p3p1: <BROADCAST,MULTICAST,ALLMULTI,PROMISC,UP,LOWER_UP> mtu 2000 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 38:ea:a7:17:65:84 brd ff:ff:ff:ff:ff:ff
    vf 0 MAC 02:04:17:01:02:01, tx rate 10000 (Mbps), spoof checking off, link-state auto               (2)
39: vcp_ext-vmx1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master br-ext state UNKNOWN mode DEFAULT group default qlen 500
    link/ether fe:04:17:01:01:01 brd ff:ff:ff:ff:ff:ff
41: vfp_ext-vmx1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master br-ext state UNKNOWN mode DEFAULT group default qlen 500
    link/ether fe:04:17:01:01:02 brd ff:ff:ff:ff:ff:ff
1 p2p1 VF 0 L1/L2 info, this will map to the VMX router ge-0/0/1 interface
2 p3p1 VF 0 L1/L2 info, this will map to the VMX router ge-0/0/0 interface

10.3. host interfaces (virtio)

the list of interfaces after installation of an virtio version of VMX looks very similiar to the ones after SR-IOV verions of VMX installation, but with one exception - Besides external and internal bridges and the associated tap interfaces for management, now we can see 2 more tap interfaces generated by virtio.

 1ge-0.0.0-vmx1 Link encap:Ethernet  HWaddr fe:04:17:01:02:01     \
 2          inet6 addr: fe80::fc04:17ff:fe01:201/64 Scope:Link     |
 3          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1     |
 4          RX packets:0 errors:0 dropped:0 overruns:0 frame:0     |
 5          TX packets:401 errors:0 dropped:0 overruns:0 carrier:0 |
 6          collisions:0 txqueuelen:500                            \
 7          RX bytes:0 (0.0 B)  TX bytes:21084 (21.0 KB)            X (1)
 8                                                                 |
 9ge-0.0.1-vmx1 Link encap:Ethernet  HWaddr fe:04:17:01:02:02      |
10          inet6 addr: fe80::fc04:17ff:fe01:202/64 Scope:Link     |
11          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1     |
12          RX packets:0 errors:0 dropped:0 overruns:0 frame:0     |
13          TX packets:401 errors:0 dropped:0 overruns:0 carrier:0 |
14          collisions:0 txqueuelen:500                            /
15          RX bytes:0 (0.0 B)  TX bytes:21084 (21.0 KB)          /
1 virtio interfaces
"virtio interface" is tap interface
 1ping@trinity:~$ sudo ethtool -i ge-0.0.0-vmx1
 2driver: tun
 3version: 1.6
 4firmware-version:
 5bus-info: tap   #<------
 6supports-statistics: no
 7supports-test: no
 8supports-eeprom-access: no
 9supports-register-dump: no
10supports-priv-flags: no
MAC address (virtio)
 1ping@trinity:/virtualization/images/vmx_20151102.0$ ifconfig -a | grep -i 04:17
 2ge-0.0.0-vmx1 Link encap:Ethernet  HWaddr fe:04:17:01:02:01
 3          inet6 addr: fe80::fc04:17ff:fe01:201/64 Scope:Link
 4ge-0.0.1-vmx1 Link encap:Ethernet  HWaddr fe:04:17:01:02:02
 5          inet6 addr: fe80::fc04:17ff:fe01:202/64 Scope:Link
 6vcp_ext-vmx1 Link encap:Ethernet  HWaddr fe:04:17:01:01:01
 7          inet6 addr: fe80::fc04:17ff:fe01:101/64 Scope:Link
 8vfp_ext-vmx1 Link encap:Ethernet  HWaddr fe:04:17:01:01:02
 9          inet6 addr: fe80::fc04:17ff:fe01:102/64 Scope:Link
10virbr0    Link encap:Ethernet  HWaddr fe:04:17:01:02:01

10.4. VMX fxp0 and em1 interface

In a physical MX box, fxp0 is used for the external Ethernet port on the RE, while em0/em1 (previously fxp1/fxp2 in some old platforms) are internal NICs on the RE that cross-connected to the internal 24x1GE switch on each SCB.

With a up and running VMX , we can now look at the fxp0 and em1 interface, as if this is a physical MX router - with exactly the same Junos command.

fxp0 emulates the same physical fxp0 NIC in MX RE board, and is still used for the same purpose - serving managment connection from external device.

VCP VM management interface - fxp0 (external interface)
 1root@vmx1> show interfaces fxp0
 2Physical interface: fxp0, Enabled, Physical link is Up
 3  Interface index: 8, SNMP ifIndex: 1
 4  Type: Ethernet, Link-level type: Ethernet, MTU: 1514
 5  Device flags   : Present Running
 6  Interface flags: SNMP-Traps
 7  Current address: 02:04:17:01:01:01, Hardware address: 02:04:17:01:01:01
 8  Last flapped   : 2015-12-01 03:03:56 UTC (00:01:20 ago)
 9    Input packets : 384
10    Output packets: 2
11
12  Logical interface fxp0.0 (Index 4) (SNMP ifIndex 13)
13    Flags: Up SNMP-Traps 0x4000000 Encapsulation: ENET2
14    Input packets : 169
15    Output packets: 2
16    Protocol inet, MTU: 1500
17      Flags: Sendbcast-pkt-to-re, Is-Primary
18      Addresses, Flags: Is-Default Is-Preferred Is-Primary
19        Destination: 10.85.4.0/25, Local: 10.85.4.105, Broadcast: 10.85.4.127

Although there is no real CB or SCB board in VMX, the interface em1 remains - it is now connecting directly to the only FPC - VCP VM. As in MX router it can be viewed by the same show interface CLI:

VCP VM interface - em1 (internal interface connecting to VFP only)
 1root# run show interfaces em1
 2Physical interface: em1, Enabled, Physical link is Up
 3  Interface index: 9, SNMP ifIndex: 23
 4  Type: Ethernet, Link-level type: Ethernet, MTU: 1514, Speed: 1000mbps
 5  Device flags   : Present Running
 6  Interface flags: SNMP-Traps
 7  Link type      : Full-Duplex
 8  Current address: 52:54:00:a8:60:36, Hardware address: 52:54:00:a8:60:36
 9  Last flapped   : Never
10    Input packets : 41993
11    Output packets: 41520
12
13  Logical interface em1.0 (Index 3) (SNMP ifIndex 24)
14    Flags: Up SNMP-Traps 0x4000000 Encapsulation: ENET2
15    Input packets : 41993
16    Output packets: 41520
17    Protocol inet, MTU: 1500
18      Flags: Is-Primary
19      Addresses, Flags: Is-Preferred
20        Destination: 10/8, Local: 10.0.0.4, Broadcast: 10.255.255.255
21      Addresses, Flags: Preferred Kernel Is-Preferred
22        Destination: 128/2, Local: 128.0.0.1, Broadcast: 191.255.255.255
23      Addresses, Flags: Primary Is-Default Is-Primary
24        Destination: 128/2, Local: 128.0.0.4, Broadcast: 191.255.255.255
25    Protocol inet6, MTU: 1500
26    Max nh cache: 75000, New hold nh limit: 75000, Curr nh cnt: 0,
27    Curr new hold cnt: 0, NH drop cnt: 0
28      Flags: Is-Primary
29      Addresses, Flags: Is-Preferred
30        Destination: fe80::/64, Local: fe80::5254:ff:fea8:6036
31      Addresses, Flags: Is-Default Is-Preferred Is-Primary
32        Destination: fec0::/64, Local: fec0::a:0:0:4
33    Protocol tnp, MTU: 1500
34      Flags: Primary, Is-Primary
35      Addresses
36        Local: 0x4

10.5. ge-x/y/z interface

the ge-x/y/z interface, built from SR-IOV VF or virtio tag interface in the low level, represents the forwarding plane of VMX. The name ge- may not be acurate - it can be a 10GE or even 100GE capable port, depending on the type of (Intel) NIC card in use.

ge-0/0/0 interface from Junos
 1labroot> show interfaces ge-0/0/0
 2Physical interface: ge-0/0/0, Enabled, Physical link is Up
 3  Interface index: 139, SNMP ifIndex: 517
 4  Link-level type: Ethernet, MTU: 1514, MRU: 1522, LAN-PHY mode,
 5  Speed: 1000mbps, BPDU Error: None, MAC-REWRITE Error: None,
 6  Loopback: Disabled, Source filtering: Disabled, Flow control: Enabled
 7  Pad to minimum frame size: Disabled
 8  Device flags   : Present Running
 9  Interface flags: SNMP-Traps Internal: 0x4000
10  Link flags     : None
11  CoS queues     : 8 supported, 8 maximum usable queues
12  Current address: 02:04:17:01:02:01, Hardware address: 02:04:17:01:02:01
13  Last flapped   : 2015-11-25 07:14:41 UTC (00:00:08 ago)
14  Input rate     : 0 bps (0 pps)
15  Output rate    : 0 bps (0 pps)
16  Active alarms  : None
17  Active defects : None
18  Interface transmit statistics: Disabled

11. virtual networks/bridging (SR-IOV)

To list Virtual Networks (VN) generated by libvirt:

1    ping@trinity:~$ sudo virsh net-list
2    [sudo] password for ping:
3     Name                 State      Autostart     Persistent
4    ----------------------------------------------------------
5     br-ext               active     no            yes
6     br-int-vmx1          active     no            yes
7     default              active     yes           yes

These virtual networks were constructed via linux bridges:

VN br-ext
 1sudo virsh net-dumpxml br-ext
 2
 3<network>
 4  <name>br-ext</name>
 5  <forward mode="route" />                              (1)
 6  <bridge delay="0" name="br-ext" stp="on" />           (2)
 7  <mac address="52:54:00:9f:a0:77" />                   (3)
 8  <ip address="10.85.4.17" netmask="255.255.255.128">   (3)
 9    <dhcp>                                              (4)
10      <host ip="10.85.4.105" mac="02:04:17:01:01:01" name="vcp-vmx1" /> (4)
11      <host ip="10.85.4.106" mac="02:04:17:01:01:02" name="vfp-vmx1" /> (4)
12    </dhcp>
13  </ip>
14</network>
1 here the VN br-ext is in "route" mode, traffic to/from this VN will be "routed" to/from the host interface
2 the host interface is a bridge with same name br-ext
3 the host bridge’s MAC and IP info
4 IP address can be assigned based on MAC address from the libvirt built-in DHCP server
currently VMX mgmt interface implementation doesn’t request IP address acquirement via DHCP. This may be changed in the future.
VN br-int-vm1
1sudo virsh net-dumpxml br-int-vmx1
2
3<network>
4  <name>br-int-vmx1</name>
5  <bridge delay="0" name="br-int-vmx1" stp="on" />      #<------
6</network>
  • the VN br-int-vmx1 is an "isolated" network - there is no "forward" attribute as presenting in br-ext VN

  • VN is implemented as a bridge with same name br-int-vmx1

the default VN
 1ping@trinity:~$ sudo virsh net-dumpxml default
 2<network connections='2'>
 3  <name>default</name>
 4  <uuid>35f8cee2-d217-4e1f-a4c0-e0e08c422a4e</uuid>
 5  <forward mode='nat'>                                  (2)
 6    <nat>
 7      <port start='1024' end='65535'/>
 8    </nat>
 9  </forward>
10  <bridge name='virbr0' stp='on' delay='0'/>            (1)
11  <ip address='192.168.122.1' netmask='255.255.255.0'>  (3)
12    <dhcp>
13      <range start='192.168.122.2' end='192.168.122.254'/> (4)
14    </dhcp>
15  </ip>
16</network>
1 implemented as a bridge named virbr0,
2 nat is used for traffic forwarding
3 IP address of virbr0 bridge
4 IP address pool for the libvirt built-in DHCP server.

Essentially, the default VN emulates a typical CPE router, via which the guest VM can interact with external world without having to each has their own interfaces configured with an external IP and routing entry, just like a home router eliminates the need of having each home PC to acquire an external IP in order to access the Internet.

the VN to brige binding relationship can also be verified via this virsh command: virsh net-info

 1ping@trinity:~$ sudo virsh net-info default
 2Name:           default
 3UUID:           35f8cee2-d217-4e1f-a4c0-e0e08c422a4e
 4Active:         yes
 5Persistent:     yes
 6Autostart:      yes
 7Bridge:         virbr0
 8
 9ping@trinity:~$ sudo virsh net-info br-ext
10Name:           br-ext
11UUID:           b7abcd87-114e-40f3-8b4e-e913135aae56
12Active:         yes
13Persistent:     yes
14Autostart:      no
15Bridge:         br-ext
16
17ping@trinity:~$ sudo virsh net-info br-int-vmx1
18Name:           br-int-vmx1
19UUID:           36aa8f84-84e2-4bec-92cb-f2b41a4ed2f7
20Active:         yes
21Persistent:     yes
22Autostart:      no
23Bridge:         br-int-vmx1

The bridge to interface binding relationship is as following:

 1    ping@trinity:~$ brctl show
 2    bridge name        bridge id                STP enabled        interfaces
 3    br-ext                8000.38eaa7377c54        yes        br-ext-nic
 4                                                        em1
 5                                                        vcp_ext-vmx1
 6                                                        vfp_ext-vmx1
 7    br-int-vmx1                8000.525400b00308        yes        br-int-vmx1-nic
 8                                                        vcp_int-vmx1
 9                                                        vfp_int-vmx1
10    virbr0                8000.000000000000        yes

without login into the VM, VM virtual NIC info (SR-IOV) can be listed with virsh command:

1    ping@trinity:~$ sudo virsh qemu-monitor-command vcp-vmx1 --hmp "info network"
2    net0: index=0,type=nic,model=e1000,macaddr=02:04:17:01:01:01
3     \ hostnet0: index=0,type=tap,fd=18
4    net1: index=0,type=nic,model=virtio-net-pci,macaddr=52:54:00:8d:45:1e
5     \ hostnet1: index=0,type=tap,fd=19
ping@trinity:~$ sudo virsh qemu-monitor-command vfp-vmx1 --hmp "info network"
net0: index=0,type=nic,model=virtio-net-pci,macaddr=02:04:17:01:01:02
 \ hostnet0: index=0,type=tap,fd=18
net1: index=0,type=nic,model=virtio-net-pci,macaddr=52:54:00:5e:96:98
 \ hostnet1: index=0,type=tap,fd=21

11.1. mapping of bridge, tap, guest vNIC (SR-IOV)

this diagram illustrated the relationship between these interfaces:

  • external/internal mgmt interfaces in VCP VM: fxp0, em1

  • external/internal mgmt interfaces in VFP VM: eth0(ext), eth1(int)

  • qemu tap interfaces: vcp(vfp)_ext(int)-vmx1

  • linux bridge interface: br-ext, br-int-vmx1

     vcp-vmx1                                   br-int-vmx1
    +--------+                                  |52:54:00:b0:03:08
    |     em1|52:54:00:8d:45:1e                 |
    |        +-------------------------+--------+
    |        |                     vcp_int-vmx1 |fe:54:00:8d:45:1e
    | fxp0   |                                  |
    +---+----+                                  |
        |02:04:17:01:01:01                      |
        |              +--------+               |
        |              |vfp-vmx1|   vfp_int-vmx1|fe:54:00:5e:96:98
        |              |    eth1+---------------+
        |              |      |52:54:00:5e:96:98|
        |              |  eth0  |               |
        |              +----+---+               |
        |                   |02:04:17:01:01:02
        |                   |
        |                   |
        |                   |
        |                   |
        |                   |
        |fe:04:17:01:01:01  |02:04:17:01:01:02
        |vcp_ext-vmx1       |vfp_ext-vmx1
    ----+---------------+---+-------------+-----br-ext
                        |38:ea:a7:37:7c:54| (mac configured
                        |                 |   same as em1)
                        |em1              |br-ext-nic (a0:77)
  • brige br-int-vmx1 , with attached qemu tap interfaces vNIC vcp_int-vmx1 and vfp_int-vmx1 , and the corresponding guest VM interfaces em1 and int, are for internal communication only, no packets will exit to outside networks, so MAC address does not matters

  • brige br-ext clones MAC and IP from host mgmt port, so it represents the server from the external networks' point of view.

  • VCP(RE) VM mgmt port fxp0 and VFP(vFPC) VM mgmt port ext or eth0 will communicate with external network, so they will expose to the outside network and therefore the MAC addresses need to be unique in the segment.

  • packets of fxp0 goes to external network via this path:

    • packet exits fxp0 of guest VM vcp-vmx1

    • packet is received from host tap interface vcp_ext-vmx1

    • packet is forwarded to physical mgmt interface em1 via bridge br-ext

    • packet is forwarded to external networks

Table 4. bridge/VN/tap/guest interface mapping table
bridge/VN forward mode qemu tap interface guest VM interface guest VM

br-int-vmx1

none

vcp_int-vmx1

em1

VCP(VRE)

vfp_int-vmx1

int or eth1

VFP(VPFE)

br-ext-vmx1

routed

vcp_ext-vmx1

fxp0

VCP(VRE)

vfp_ext-vmx1

ext or eth0

VFP(VPFE)

virbr0

nat

12. vcpu essential

In vmx.conf file of previous example, 4 CPUs were allocated to vPFE VM for "lite mode":

1FORWARDING_PLANE:
2    memory-mb   : 16384
3    vcpus       : 4

In another example we have 16 CPUs allocated for "performance mode":

1FORWARDING_PLANE:
2    memory-mb   : 16384
3    vcpus       : 4

After logging into vPFE VM we can check the guest VM CPU properties:

vfp
 1root@localhost:~# lscpu
 2Architecture:          x86_64
 3CPU op-mode(s):        32-bit, 64-bit
 4Byte Order:            Little Endian
 5CPU(s):                16       #<------
 6On-line CPU(s) list:   0-15
 7Thread(s) per core:    1
 8Core(s) per socket:    16       #<------
 9Socket(s):             1
10NUMA node(s):          1
11Vendor ID:             GenuineIntel
12CPU family:            6
13Model:                 42
14Model name:            Intel Xeon E312xx (Sandy Bridge)
15Stepping:              1
16CPU MHz:               2992.582
17BogoMIPS:              5985.16
18Virtualization:        VT-x     #<------
19Hypervisor vendor:     KVM      #<------
20Virtualization type:   full
21L1d cache:             32K
22L1i cache:             32K
23L2 cache:              4096K
24NUMA node0 CPU(s):     0-15
25
26root@localhost:~# cat /proc/cpuinfo
27processor       : 0
28vendor_id       : GenuineIntel
29cpu family      : 6
30model           : 42
31model name      : Intel Xeon E312xx (Sandy Bridge)
32stepping        : 1
33microcode       : 0x1
34cpu MHz         : 2992.582
35cache size      : 4096 KB
36physical id     : 0
37siblings        : 16
38core id         : 0
39cpu cores       : 16
40apicid          : 0
41initial apicid  : 0
42fpu             : yes
43fpu_exception   : yes
44cpuid level     : 13
45wp              : yes
46flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
47                  pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb
48                  rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq
49                  vmx ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt
50                  tsc_deadline_timer aes xsave avx f16c rdrand hypervisor
51                  lahf_lm xsa veopt vnmi ept fsgsbase smep erms
52bogomips        : 5985.16
53clflush size    : 64
54cache_alignment : 64
55address sizes   : 40 bits physical, 48 bits virtual
56power management:
57......

The Hypervisor vendor indicates the type of hypervisor that the current vCPU was emulated from. In this case it’s KVM emulated CPU. In the case of other hypervisor this option will change accordingly. below is an sample taken from a vmware environment:

 1Architecture:          x86_64
 2CPU op-mode(s):        32-bit, 64-bit
 3Byte Order:            Little Endian
 4CPU(s):                8
 5On-line CPU(s) list:   0-7
 6Thread(s) per core:    1
 7Core(s) per socket:    8
 8Socket(s):             1
 9NUMA node(s):          1
10Vendor ID:             GenuineIntel
11CPU family:            6
12Model:                 47
13Stepping:              2
14CPU MHz:               2394.000
15BogoMIPS:              4788.00
16Hypervisor vendor:     VMware
17Virtualization type:   full
18L1d cache:             32K
19L1i cache:             32K
20L2 cache:              256K
21L3 cache:              30720K
22NUMA node0 CPU(s):     0-7

for comparison purpose, the original cpu info from the host server is also listed here:

 1ping@ubuntu1:~$ lscpu
 2Architecture:          x86_64
 3CPU op-mode(s):        32-bit, 64-bit
 4Byte Order:            Little Endian
 5CPU(s):                20
 6On-line CPU(s) list:   0-19
 7Thread(s) per core:    1
 8Core(s) per socket:    10
 9Socket(s):             2
10NUMA node(s):          2
11Vendor ID:             GenuineIntel
12CPU family:            6
13Model:                 62
14Stepping:              4
15CPU MHz:               2992.939
16BogoMIPS:              6000.66
17Virtualization:        VT-x
18L1d cache:             32K
19L1i cache:             32K
20L2 cache:              256K
21L3 cache:              25600K
22NUMA node0 CPU(s):     0-9
23NUMA node1 CPU(s):     10-19
24
25ping@ubuntu1:~$ cat /proc/cpuinfo
26processor       : 0
27vendor_id       : GenuineIntel
28cpu family      : 6
29model           : 62
30model name      : Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
31stepping        : 4
32microcode       : 0x427
33cpu MHz         : 2992.939
34cache size      : 25600 KB
35physical id     : 0
36siblings        : 10
37core id         : 0
38cpu cores       : 10
39apicid          : 0
40initial apicid  : 0
41fpu             : yes
42fpu_exception   : yes
43cpuid level     : 13
44wp              : yes
45flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
46                  pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
47                  syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs
48                  bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu
49                  pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3
50                  cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt
51                  tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat
52                  epb xsaveopt pln pts dtherm tpr_shadow vnmi
53                  flexpriority ept vpid fsgsbase smep erms
54bogomips        : 5985.87
55clflush size    : 64
56cache_alignment : 64
57address sizes   : 46 bits physical, 48 bits virtual
58power management:

from the comparison it is noticed that the CPU features set (flags) are not identical between the host and guest OS. the host CPU features were provided by the CPU hardware capability and OS support, while the guest CPU features can be either emulated by QEMU software, or simly "exposed" directly from the host CPU capability, by KVM. This was started in libvirt in the form of XML format (see virsh capabilities), and passed to qemu-system-x86_64 process in the form of CPU parameter list (-cpu) before the guess VM was brought up.

VMX qemu processes
 1ping@trinity:/virtualization/images/vmx_20151102.0/config$ ps -ef | grep -i qemu
 2root      17757      1 18 19:01 ?        00:11:45
 3/usr/bin/qemu-system-x86_64 -name vcp-vmx1 -S -machine
 4pc-0.13,accel=kvm,usb=off -cpu
 5SandyBridge,+invtsc,+erms,+smep,+fsgsbase,+pdpe1gb,+rdrand,+f16c,+osxsave,+
 6dca,+pcid,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme
 7-m 1954 -realtime mlock=off -smp 1,sockets=1,cores=1,threads=1 -uuid
 879ab0bbf-b63a-4a08-87ce-ad7670186754 -smbio s type=0,vendor=Juniper -smbios
 9type=1,manufacturer=VMX,product=VM-vcp_vmx1-161-re-0,version=0.1.0
10-no-user-config -nodefaults -chardev
11socket,id=charmonitor,path=//lib/libvirt/qemu/vcp-vmx1.monitor,server,nowai
12t -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc
13-no-shutdown -boot strict=on -device
14piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive
15file=/virtualization/images/vmx_20151102.0/build/vmx1/ima
16ges/jinstall64-vmx-15.1F-20151104.0-domestic.img,if=none,id=drive-ide0-0-0,format=qcow2,cache=directsync
17-device
18ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 -drive
19file=/virtualization/i
20mages/vmx_20151102.0/build/vmx1/images/vmxhdd.img,if=none,id=drive-ide0-0-1,format=qcow2,cache=directsync
21-device ide-hd,bus=ide.0,unit=1,drive=drive-ide0-0-1,id=ide0-0-1 -drive
22file=/virtualization/images/vmx_2
230151102.0/images/metadata_usb.img,if=none,id=drive-usb-disk0,format=raw,cache=directsync
24-device usb-storage,drive=drive-usb-disk0,id=usb-disk0,removable=off
25-netdev tap,fd=18,id=hostnet0 -device e1000,netdev=ho
26stnet0,id=net0,mac=02:04:17:01:01:01,bus=pci.0,addr=0x3 -netdev
27tap,fd=19,id=hostnet1,vhost=on,vhostfd=20 -device
28virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:84:52:fb,bus=pci.0,addr=0x5
29-chardev socket,id=charserial0,host=127.0.0.1,port=8816,telnet,server,nowait -device
30isa-serial,chardev=charserial0,id=serial0 -device usb-tablet,id=input0 -vnc
31127.0.0.1:0 -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device
32AC97,id=sound0,bus=pci.0,addr=0x4 -device
33virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -msg timestamp=on
34
35root      17880      1 62 19:01 ?        00:40:31
36/usr/bin/qemu-system-x86_64 -name vfp-vmx1 -S -machine
37pc-i440fx-trusty,accel=kvm,usb=off,mem-merge=off
38-cpu SandyBridge,+invtsc,+erms,+smep,+fsgsbase,+pdpe1gb,+
39rdrand,+f16c,+osxsave,+dca,+pcid,+pdcm,+xtpr,+tm2,+est,+smx,+vmx,+ds_cpl,+monitor,+dtes64,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme
40-m 15625 -mem-prealloc -mem-path /HugePage_vPFE/libvirt/qemu -realtime
41mlock=off -smp 4,s ockets=1,cores=4,threads=1 -uuid
425d7f31e7-b678-4605-973f-0cb6a3e3b8c1 -no-user-config -nodefaults -chardev
43socket,id=charmonitor,path=//lib/libvirt/qemu/vfp-vmx1.monitor,server,nowait
44-mon chardev=charmonitor,id =monitor,mode=control -rtc base=utc
45-no-shutdown -boot strict=on -device
46piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive
47file=/virtualization/images/vmx_20151102.0/build/vmx1/images/vFPC-20151102.img,if=none
48,id=drive-ide0-0-0,format=raw,cache=directsync -device
49ide-hd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1
50-netdev tap,fd=18,id=hostnet0,vhost=on,vhostfd=20 -device
51virtio-net-pci,netdev=hostnet
520,id=net0,mac=02:04:17:01:01:02,bus=pci.0,addr=0x3 -netdev
53tap,fd=21,id=hostnet1,vhost=on,vhostfd=22 -device
54virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:c0:ff:1f,bus=pci.0,addr=0x4
55-chardev socket,id=charserial0,host=127.0.0.1,port=8817,telnet,server,nowait -device
56isa-serial,chardev=charserial0,id=serial0 -device usb-tablet,id=input0 -vnc
57127.0.0.1:1 -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -device AC97
58,id=sound0,bus=pci.0,addr=0x5 -device
59pci-assign,configfd=23,host=23:10.0,id=hostdev0,bus=pci.0,addr=0x6 -device
60pci-assign,configfd=24,host=06:10.0,id=hostdev1,bus=pci.0,addr=0x7 -device
61virtio-balloon-pci,id=b alloon0,bus=pci.0,addr=0x8 -msg timestamp=on

Here are some highlights about QEMU processes:

  • /usr/bin/qemu-system-x86_64 is the QEMU command to start a VM process

  • the long list of QEMU parameters describe how the process will run and what resources will be allocated to it

  • the break-down of all these QEMU parameters will be covered later (TODO)

  • the VCP(RE) and VFP(FPC) VM are each running as a QEMU process.

    ping@trinity:~$ ps -ef | grep qemu | cut -c -105
    root      42025      1 18 Nov23 ?        02:40:58 /usr/bin/qemu-system-x86_64 -name vcp-vmx1 -S -machine
    root      42148      1 63 Nov23 ?        09:24:41 /usr/bin/qemu-system-x86_64 -name vfp-vmx1 -S -machine
  • each VCPU(Virtual CPU) is essentially just a "thread" running inside one of the QEMU processes

    threads (vCPU):
     1ping@trinity:~$ sudo virsh qemu-monitor-command 2 --hmp "info cpus"
     2* CPU #0: pc=0xffffffff80921b83 thread_id=42028             (1)
     3
     4ping@trinity:~$ sudo virsh qemu-monitor-command 3 --hmp "info cpus"
     5* CPU #0: pc=0xffffffff8100af50 (halted) thread_id=42151    (2)
     6  CPU #1: pc=0xffffffff8100af50 (halted) thread_id=42152    (2)
     7  CPU #2: pc=0xffffffff8100af50 (halted) thread_id=42153    (2)
     8  CPU #3: pc=0xffffffff8100af50 (halted) thread_id=42154    (2)
     9
    10ping@trinity:~$ ps -ef | grep qemu | cut -c -105
    11root      42025      1 18 Nov23 ?        02:40:58 /usr/bin/qemu-system-x86_64 -name vcp-vmx1 -S -machine
    12root      42148      1 63 Nov23 ?        09:24:41 /usr/bin/qemu-system-x86_64 -name vfp-vmx1 -S -machine
    13
    14ping@trinity:~$ ps -efL | grep qemu | cut -c -105
    15root      42025      1  42025 12    4 Nov23 ?        01:47:36 /usr/bin/qemu-system-x86_64 -name vcp-vmx1
    16root      42025      1  42028  5    4 Nov23 ?        00:52:06 /usr/bin/qemu-system-x86_64 -name vcp-vmx1    (1)
    17root      42025      1  42030  0    4 Nov23 ?        00:00:00 /usr/bin/qemu-system-x86_64 -name vcp-vmx1
    18root      42025      1  46726  0    4 11:51 ?        00:00:00 /usr/bin/qemu-system-x86_64 -name vcp-vmx1
    19
    20root      42148      1  42148  0    7 Nov23 ?        00:00:14 /usr/bin/qemu-system-x86_64 -name vfp-vmx1
    21root      42148      1  42151  9    7 Nov23 ?        01:21:37 /usr/bin/qemu-system-x86_64 -name vfp-vmx1    (2)
    22root      42148      1  42152 25    7 Nov23 ?        03:42:01 /usr/bin/qemu-system-x86_64 -name vfp-vmx1    (2)
    23root      42148      1  42153 20    7 Nov23 ?        02:58:23 /usr/bin/qemu-system-x86_64 -name vfp-vmx1    (2)
    24root      42148      1  42154  9    7 Nov23 ?        01:20:30 /usr/bin/qemu-system-x86_64 -name vfp-vmx1    (2)
    25root      42148      1  42157  0    7 Nov23 ?        00:00:00 /usr/bin/qemu-system-x86_64 -name vfp-vmx1
    26root      42148      1  46722  0    7 11:51 ?        00:00:00 /usr/bin/qemu-system-x86_64 -name vfp-vmx1

    another capture:

     1ping@trinity:$ sudo virsh qemu-monitor-command 2 --hmp "info cpus"
     2* CPU #0: pc=0xffffffff80911183 (halted) thread_id=17760
     3
     4ping@trinity:/virtualization/images/vmx_20151102.0/config$ sudo virsh qemu-monitor-command 3 --hmp "info cpus"
     5* CPU #0: pc=0xffffffff8100af50 (halted) thread_id=17883
     6  CPU #1: pc=0xffffffff8100af50 (halted) thread_id=17884
     7  CPU #2: pc=0xffffffff8100af50 thread_id=17885
     8  CPU #3: pc=0xffffffff8100af50 (halted) thread_id=17886
     9
    10ping@trinity:$ ps -ef | grep qemu | cut -c -105
    11root      17757      1 17 19:01 ?        00:12:14 /usr/bin/qemu-system-x86_64 -name vcp-vmx1 -S -machine
    12root      17880      1 62 19:01 ?        00:42:29 /usr/bin/qemu-system-x86_64 -name vfp-vmx1 -S -machine
    13
    14ping@trinity:$ ps -efL | grep qemu | cut -c -105
    15root      17757      1  17757 11    3 19:01 ?        00:07:58 /usr/bin/qemu-system-x86_64 -name vcp-vmx1
    16root      17757      1  17760  6    3 19:01 ?        00:04:08 /usr/bin/qemu-system-x86_64 -name vcp-vmx1
    17root      17757      1  17762  0    3 19:01 ?        00:00:00 /usr/bin/qemu-system-x86_64 -name vcp-vmx1
    18root      17880      1  17880  0    6 19:01 ?        00:00:11 /usr/bin/qemu-system-x86_64 -name vfp-vmx1
    19root      17880      1  17883  9    6 19:01 ?        00:06:37 /usr/bin/qemu-system-x86_64 -name vfp-vmx1
    20root      17880      1  17884 24    6 19:01 ?        00:17:01 /usr/bin/qemu-system-x86_64 -name vfp-vmx1
    21root      17880      1  17885 19    6 19:01 ?        00:13:29 /usr/bin/qemu-system-x86_64 -name vfp-vmx1
    22root      17880      1  17886  7    6 19:01 ?        00:05:14 /usr/bin/qemu-system-x86_64 -name vfp-vmx1
    23root      17880      1  17889  0    6 19:01 ?        00:00:00 /usr/bin/qemu-system-x86_64 -name vfp-vmx1
    1 RE VCPU thread
    2 PFE VCPU thread
Unix thread vs process

a *nix Thread, is a "flow of execution" of a process, from the "coding" perspecive it’s nothing but (almost) just a "procedure" that runs independently from its main program. Internally from the OS perspective, A thread consists of:

  • Program counter

  • Register set

  • Stack space

So if a main program contains a number of procedures. all of these procedures are able to be scheduled to run simultaneously and/or independently by the operating system.

A thread is running inside of a process, and differs with the concept of a "process" in the sense that thread shares with its peer threads its:

  • Code segment

  • Data segment

  • Operating-system resources

e.g., when one thread alters a code segment memory item, all other threads see that, and a file open with one thread is available to others. this is different with a "process" in that each process has its own running space and resources. communicating between processes (called "IPC") is more "expensive" and requires some special OS utilities.

In the context of VMX, "worker thread", and "IO thread" are examples of threads running in the host OS.

13. memory

memory allocated in vmx.conf is 16G:

#vPFE VM parameters
FORWARDING_PLANE:
    memory-mb   : 16384
    vcpus       : 4                     (1)

memory available in guest VM:

vPFE VM:
1root@localhost:~# free -m
2             total       used       free     shared    buffers     cached
3Mem:         15297      11380       3917          0         13       2697
4-/+ buffers/cache:       8669       6628
5Swap:            0          0          0

this looks a little bit smaller than expected: 16384-15297=1087M.

So about 1G memory was "missing".

By looking at kernel boot log we will find more clues about what really happened:

 1root@localhost:~# dmesg -T | grep -i memory
 2[Sun Dec 20 23:59:58 2015] Scanning 1 areas for low memory corruption
 3[Sun Dec 20 23:59:58 2015] Base memory trampoline at [ffff880000099000] 99000 size 24576
 4[Sun Dec 20 23:59:58 2015] init_memory_mapping: [mem 0x00000000-0x000fffff]
 5[Sun Dec 20 23:59:58 2015] init_memory_mapping: [mem 0x410600000-0x4107fffff]
 6[Sun Dec 20 23:59:58 2015] init_memory_mapping: [mem 0x410000000-0x4105fffff]
 7[Sun Dec 20 23:59:58 2015] init_memory_mapping: [mem 0x400000000-0x40fffffff]
 8[Sun Dec 20 23:59:58 2015] init_memory_mapping: [mem 0x00100000-0xbfffbfff]
 9[Sun Dec 20 23:59:58 2015] init_memory_mapping: [mem 0x100000000-0x3ffffffff]
10[Sun Dec 20 23:59:58 2015] init_memory_mapping: [mem 0x410800000-0x4108fffff]
11[Sun Dec 20 23:59:58 2015] Early memory node ranges
12[Sun Dec 20 23:59:58 2015] Memory: 15660932k/17048576k available (8700k kernel code, 1048984k absent, 338660k reserved, 6548k data, 1172k init)
13[Sun Dec 20 23:59:58 2015] please try 'cgroup_disable=memory' option if you don't want memory cgroups
14[Sun Dec 20 23:59:58 2015] Initializing cgroup subsys memory
15[Sun Dec 20 23:59:58 2015] Freeing initrd memory: 728k freed
16[Sun Dec 20 23:59:58 2015] Scanning for low memory corruption every 60 seconds
17[Sun Dec 20 23:59:59 2015] Freeing unused kernel memory: 1172k freed
18[Sun Dec 20 23:59:59 2015] Freeing unused kernel memory: 1528k freed
19[Sun Dec 20 23:59:59 2015] Freeing unused kernel memory: 584k freed

So Kernel detected 17048576k (=16649M), which is much closer with what was allocated. The rest of memory is what kernel reserved for internal tasks.

/proc/meminfo reveals more details about memory usage.

 1root@localhost:~# cat /proc/meminfo
 2MemTotal:       15664972 kB
 3MemFree:         4010928 kB
 4Buffers:           13956 kB
 5Cached:          2762276 kB
 6SwapCached:            0 kB
 7Active:           148364 kB
 8Inactive:        2723568 kB
 9Active(anon):      86864 kB
10Inactive(anon):  2665792 kB
11Active(file):      61500 kB
12Inactive(file):    57776 kB
13Unevictable:           0 kB
14Mlocked:               0 kB
15SwapTotal:             0 kB
16SwapFree:              0 kB
17Dirty:                 0 kB
18Writeback:             0 kB
19AnonPages:        112212 kB
20Mapped:           177764 kB
21Shmem:           2640572 kB
22Slab:              70948 kB
23SReclaimable:      30708 kB
24SUnreclaim:        40240 kB
25KernelStack:        1968 kB
26PageTables:         2712 kB
27NFS_Unstable:          0 kB
28Bounce:                0 kB
29WritebackTmp:          0 kB
30CommitLimit:     3638180 kB
31Committed_AS:    3430712 kB
32VmallocTotal:   34359738367 kB
33VmallocUsed:       51268 kB
34VmallocChunk:   34359683660 kB
35HugePages_Total:    4096
36HugePages_Free:        0
37HugePages_Rsvd:        0
38HugePages_Surp:        0
39Hugepagesize:       2048 kB
40DirectMap4k:       13296 kB
41DirectMap2M:     2355200 kB
42DirectMap1G:    13631488 kB

14. virtio bridging

virtio provides emulations of NIC , disk and other IO devices with high performance, but it does not has any "build-in" bridging facilities or anything in that purpose. So in order to "bind" the virtio interfaces to the target physical NIC, we usually need to add that "bridging" into our virtio VMX setup after the VMs were brought up and running. This can be done with one of the existing utilities that is available, depending on whichever is suitable for the needs.

some popular examples of these utilities will be:

  • legacy linux bridge,

  • open vSwitch (OVS),

  • Juniper vRouter.

Here is a simplest example of "binding" 2 virtio interfaces to each other, with linux bridge.

before "binding"
 1ping@trinity:~$ brctl show
 2bridge name     bridge id               STP enabled     interfaces
 3br-ext          8000.38eaa7377c54       yes             br-ext-nic
 4                                                        em1
 5                                                        vcp_ext-vmx1
 6                                                        vfp_ext-vmx1
 7br-int-vmx1             8000.5254006b156e       yes     br-int-vmx1-nic
 8                                                        vcp_int-vmx1
 9                                                        vfp_int-vmx1
10virbr0          8000.fe0417010201       yes             ge-0.0.0-vmx1
11                                                        ge-0.0.1-vmx1

"binding" 2 virtio interfaces is like to connect a "loopback cable" between them , say , ge-0/0/0 and ge-0/0/1. Changing the config/vmx-junosdev.conf file so it looks:

 1ping@trinity:/virtualization/images/vmx_20151102.0/config$ vim vmx-junosdev.conf.1
 2##############################################################
 3#
 4#  vmx-junos-dev.conf
 5#  - Config file for junos device bindings.
 6#  - Uses YAML syntax.
 7#  - Leave a space after ":" to specify the parameter value.
 8#  - For physical NIC, set the 'type' as 'host_dev'
 9#  - For junos devices, set the 'type' as 'junos_dev' and
10#    set the mandatory parameter 'vm-name' to the name of
11#    the vPFE where the device exists
12#  - For bridge devices, set the 'type' as 'bridge_dev'
13#
14##############################################################
15interfaces :
16
17     - link_name  : vmx_link4
18       endpoint_1 :
19         - type        : junos_dev
20           vm_name     : vmx1
21           dev_name    : ge-0/0/0
22       endpoint_2 :
23         - type        : junos_dev
24           vm_name     : vmx1
25           dev_name    : ge-0/0/1

This read as: "please create a linux bridge vmx_link4, and use it to bind junos device interface ge-0/0/0 of VM vmx1, to the other juno device interface ge-0/0/1 in the same VM"

Now run the vmx.sh script the execute this action:

1sudo ./vmx.sh -lvf --bind-dev --cfg config/vmx-junosdev.conf.1
2Checking package ethtool..........................[OK]
3Bind Link vmx_link4(ge-0.0.0-vmx1, ge-0.0.1-vmx1)
4[OK]

As expected, this will create a new bridge named vmx_link4 and add the JUNOS interfaces (ge-0/0/0 and ge-0/0/1) mapped corresponding virtio tap interface (ge-0.0.0-vmx1 and ge-0.0.1-vmx1) into the bridge.

 1ping@trinity:/virtualization/images/vmx_20151102.0$ brctl show
 2bridge name     bridge id               STP enabled     interfaces
 3br-ext          8000.38eaa7377c54       yes             br-ext-nic
 4                                                        em1
 5                                                        vcp_ext-vmx1
 6                                                        vfp_ext-vmx1
 7
 8br-int-vmx1             8000.525400fa6e56       yes     br-int-vmx1-nic
 9                                                        vcp_int-vmx1
10                                                        vfp_int-vmx1
11
12virbr0          8000.000000000000       yes                            (1)
13
14vmx_link4               8000.fe0417010201       no      ge-0.0.0-vmx1  (2)
15                                                        ge-0.0.1-vmx1
1 the links were removed from the default virbr0 bridge
2 the links were added to new bridge vmx_link4.

14.1. build a setup with internal connections

now we can setup a Junos logical router based virtual test environment like below:

192.168.122.10            192.168.122.3
        .....vmx_link4........
        .                    .
        .                    .
   +----.--------------------.-----+
   | ge-0.0.0-vmx1  ge-0.0.1-vmx1  |
   |    .                    .     |
   |    .                    .     |
   | +--.-------+    +-------.--+  |
   | |(ge-0/0/0)|    |(ge-0/0/1)|  |
   | |192.168.  |    |192.168.  |  |
   | |122.10    |    |122.3     |  |
   | |          |    |          |  |
   | |LR:default|    |LR: r1    |  |
   | +----------+    +----------+  |
   |                               |
   +-------------------------------+

with logical routers (LR) and internal connection between the virtio interfaces, we now have a handy virtual multi-router setup.

ping test:
 1[edit]
 2root# run show interfaces routing
 3Interface        State Addresses
 4ge-0/0/0.0       Up    INET  192.168.122.10         #<------
 5pfe-0/0/0.16383  Up
 6pfh-0/0/0.16384  Up
 7pfh-0/0/0.16383  Up
 8lc-0/0/0.32769   Up
 9lo0.16385        Up
10lo0.16384        Up    INET  127.0.0.1
11fxp0.0           Up    INET  192.168.1.105
12em1.0            Up    INET  10.0.0.4
13                       INET  128.0.0.1
14                       INET  128.0.0.4
15                       INET6 fe80::5254:ff:fe5b:901e
16                       INET6 fec0::a:0:0:4
17
18[edit]
19root# run show interfaces routing logical-system r1
20Interface        State Addresses
21ge-0/0/1.0       Up    INET  192.168.122.3          #<------
22
23[edit]
24root# run ping 192.168.122.3
25PING 192.168.122.3 (192.168.122.3): 56 data bytes
2664 bytes from 192.168.122.3: icmp_seq=0 ttl=63 time=15.073 ms
27^C
28--- 192.168.122.3 ping statistics ---
291 packets transmitted, 1 packets received, 0% packet loss
30round-trip min/avg/max/stddev = 15.073/15.073/15.073/nan ms
31
32[edit]
33root# run ping 192.168.122.10 logical-system r1
34PING 192.168.122.10 (192.168.122.10): 56 data bytes
3564 bytes from 192.168.122.10: icmp_seq=0 ttl=64 time=2.058 ms
36^C
37--- 192.168.122.10 ping statistics ---
381 packets transmitted, 1 packets received, 0% packet loss
39round-trip min/avg/max/stddev = 2.058/2.058/2.058/0.000 ms

Part 3: VMX/KVM features

in this part I’ll explore some of the common, and key VMX-involved KVM features. For each topic, I’ll start with the "basic concept", then proceed with command illustrations.

15. VM management - virsh

15.1. virsh basic concept

Libvirt is collection of open source API, daemon and management tool that provides a convenient way to manage virtual machines and other virtualization functionality, such as storage and network interface management. The software components include:

  • an API library,

  • a daemon (libvirtd), and

  • a command line utility (virsh).

It can be used to manage KVM, Xen, VMware ESX, QEMU and a lot of other (may all?) currently popular virtualization technologies.

4bdf873e b3f6 11e5 8e9f c8483b38fb87
Figure 5. libvirt

A primary goal of libvirt is to provide a single (and simple) way to manage multiple different virtualization providers/hypervisors - same command can be used to manage the existing virtual machines for any supported hypervisor (KVM, Xen, VMWare ESX, etc.)

For example: To list current running guest VMs:

ping@trinity:~$ sudo virsh list
[sudo] password for ping:
 Id    Name                           State
----------------------------------------------------
 2     vcp-vmx1                       running
 3     vfp-vmx1                       running

The above command will list any running VMs that are managed by libvirt, regardless of which hypervisor is currently in use. In this doc we’ll focus on VMX, which is currently implemented based on qemu-kvm hypervisor.

use --all option to list all guest VMs including those not currently running.

Libvirt also provides extensive tools and commands to manage the domain, node and network.

In libvirt terminology, domain indicates guest VM and node refers to the host machine.

15.2. libvirt/virsh commonly used commands

In this section we’ll demonstrate some of most commonly used virsh commands to manage the host and guest OS. refer to virsh online help (virsh help) or libvirt website for more details of the usage.

15.2.1. domain managment

dominfo: (virtual) CPU/Memory info of each domain(VM):
 1ping@trinity:~$ sudo virsh dominfo 2
 2Id:             2
 3Name:           vcp-vmx1
 4UUID:           90520384-41af-4029-9523-040ec59bb7f2
 5OS Type:        hvm
 6State:          running
 7CPU(s):         1       #<------
 8CPU time:       7461.9s
 9Max memory:     2000896 KiB     #<------
10Used memory:    2000000 KiB     #<------
11Persistent:     yes
12Autostart:      disable
13Managed save:   no
14Security model: none
15Security DOI:   0
16
17ping@trinity:~$ sudo virsh dominfo 3
18Id:             3
19Name:           vfp-vmx1
20UUID:           34991bba-ecbf-4680-905f-dc1d6b1c2308
21OS Type:        hvm
22State:          running
23CPU(s):         4       #<------
24CPU time:       29660.4s
25Max memory:     16000000 KiB    #<------
26Used memory:    16000000 KiB    #<------
27Persistent:     yes
28Autostart:      disable
29Managed save:   no
30Security model: none
31Security DOI:   0
vcpuinfo: state of each virtual cpu assigned to guest VM
 1ping@trinity:~$ sudo virsh vcpuinfo vcp-vmx1
 2VCPU:           0
 3CPU:            7
 4State:          running
 5CPU time:       465.7s
 6CPU Affinity:   -------y------------------------
 7
 8ping@trinity:~$ sudo virsh vcpuinfo vfp-vmx1
 9VCPU:           0
10CPU:            0
11State:          running
12CPU time:       571.7s
13CPU Affinity:   y-------------------------------
14
15VCPU:           1
16CPU:            1
17State:          running
18CPU time:       1432.5s
19CPU Affinity:   -y------------------------------
20
21VCPU:           2
22CPU:            2
23State:          running
24CPU time:       1133.6s
25CPU Affinity:   --y-----------------------------
26
27VCPU:           3
28CPU:            3
29State:          running
30CPU time:       517.1s
31CPU Affinity:   ---y----------------------------

the "CPU affinity" indicates the "binding" between vcpu in guest VM and physical cpu core in host machine [5]. This can be archived by using vcpupin command below.

vcpupin: display or control the vcpu to host cpu affinity
 1virsh emulatorpin vcp-vmx1 0
 2virsh emulatorpin vfp-vmx1 0
 3
 4virsh vcpupin vcp-vmx1 0 7
 5
 6virsh vcpupin vfp-vmx1 0 0
 7virsh vcpupin vfp-vmx1 1 1
 8virsh vcpupin vfp-vmx1 2 2
 9virsh vcpupin vfp-vmx1 3 3

With these commands, we can bind VCPU 0 of vcp-vmx1 VM to core 7 of host machine, VCPU 0 of vfp-vmx1 VM to core 0 of host machine, so on so forth. in the section of VCPU essential, we know that each "VCPU" is essentially just a thread running in the space of one of the two QEMU processes, so binding a "VCPU" to a host cpu core is essentially just to "scope" the running of a thread to a specific CPU core in the host - that is why this feature is called "CPU affinity".

The emulatorpin indicate the thread of hypervisor itself.

vcpucount: count of VCPU in use by a VM
1ping@ubuntu1404:~$ sudo virsh vcpucount contrail
2maximum      config         2
3maximum      live           2
4current      config         2
5current      live           2
storage/images of each VM:
1    ping@trinity:~$ sudo virsh domblklist 2
2    Target     Source
3    ------------------------------------------------
4    hda        /virtualization/images/vmx_20151102.0/build/vmx1/images/jinstall64-vmx-15.1F-20151104.0-domestic.img
5    hdb        /virtualization/images/vmx_20151102.0/build/vmx1/images/vmxhdd.img
6    sda        /virtualization/images/vmx_20151102.0/images/metadata_usb.img
ping@trinity:~$ sudo virsh domblklist 3
Target     Source
------------------------------------------------
hda        /virtualization/images/vmx_20151102.0/build/vmx1/images/vFPC-20151102.img

This command is pretty handy that with it we can quickly locate from which images the current VMs were built from, and where are the image files currently located.

useful qemu-monitor-command: this can be used to print the qmp command
ping@trinity:~$ sudo virsh qemu-monitor-command vcp-vmx1 --hmp "info network"
net0: index=0,type=nic,model=e1000,macaddr=02:04:17:01:01:01
 \ hostnet0: index=0,type=tap,fd=18
net1: index=0,type=nic,model=virtio-net-pci,macaddr=52:54:00:8d:45:1e
 \ hostnet1: index=0,type=tap,fd=19

ping@trinity:~$ sudo virsh qemu-monitor-command vfp-vmx1 --hmp "info network"
net0: index=0,type=nic,model=virtio-net-pci,macaddr=02:04:17:01:01:02
 \ hostnet0: index=0,type=tap,fd=18
net1: index=0,type=nic,model=virtio-net-pci,macaddr=52:54:00:5e:96:98
 \ hostnet1: index=0,type=tap,fd=21

ping@trinity:~$ sudo virsh qemu-monitor-command 2 --hmp "info cpus"
* CPU #0: pc=0xffffffff80921b83 thread_id=42028

ping@trinity:~$ sudo virsh qemu-monitor-command 3 --hmp "info cpus"
* CPU #0: pc=0xffffffff8100af50 (halted) thread_id=42151
  CPU #1: pc=0xffffffff8100af50 (halted) thread_id=42152
  CPU #2: pc=0xffffffff8100af50 (halted) thread_id=42153
  CPU #3: pc=0xffffffff8100af50 (halted) thread_id=42154

This is sometimes quite useful because it removes the need to acquire a qmp console of a guest VM in order to send qmp command.

The QEMU Machine Protocol (QMP) is a JSON-based protocol which allows applications to control a QEMU instance. in another word you can query or manipulate the running status of a VM ,without even "login" into the VM itself.

15.2.2. node management

"node" means host machine under the context of virsh. So all commands listed here is about the host, not the VM - This effectively provides another unified CLI to manage the host machine, regardless of the OS types. Considering there are so many OS variations and each may come with different , aften incompatible CLIes, using virsh may be easier to identify some common data about the host.

nodeinfo: cpu+memory brief overview
ping@trinity:~$ sudo virsh nodeinfo
CPU model:           x86_64
CPU(s):              32
CPU frequency:       3300 MHz
CPU socket(s):       1
Core(s) per socket:  8
Thread(s) per core:  1
NUMA cell(s):        4
Memory size:         528417400 KiB

labroot@MX86-host-BL660C-B1:~$ sudo virsh nodeinfo
CPU model:           x86_64
CPU(s):              32
CPU frequency:       1200 MHz
CPU socket(s):       1
Core(s) per socket:  8
Thread(s) per core:  1
NUMA cell(s):        4
Memory size:         528417400 KiB
sysinfo: host system general info in more detail (bios/cpu/memory/etc)
ping@trinity:~$ sudo virsh sysinfo
<sysinfo type='smbios'>
  <bios>        #<------similar to "dmidecode -s bios-version"
    <entry name='vendor'>HP</entry>
    <entry name='version'>I32</entry>
    <entry name='date'>02/10/2014</entry>
  </bios>
  <system>      #<------similar to "dmidecode -t 1"
    <entry name='manufacturer'>HP</entry>
    <entry name='product'>ProLiant BL660c Gen8</entry>
    <entry name='version'>Not Specified</entry>
    <entry name='serial'>USE4379WSS      </entry>
    <entry name='uuid'>31393736-3831-5355-4534-333739575353</entry>
    <entry name='sku'>679118-B21      </entry>
    <entry name='family'>ProLiant</entry>
  </system>
  <processor>   #<------CPU socket: similar to "dmidecode -t 4"
    <entry name='socket_destination'>Proc 1</entry>
    <entry name='type'>Central Processor</entry>
    <entry name='family'>Xeon</entry>
    <entry name='manufacturer'>Intel</entry>
    <entry name='signature'>Type 0, Family 6, Model 62, Stepping 4</entry>
    <entry name='version'> Intel(R) Xeon(R) CPU E5-4627 v2 @ 3.30GHz</entry>
    <entry name='external_clock'>100 MHz</entry>
    <entry name='max_speed'>4800 MHz</entry>
    <entry name='status'>Populated, Enabled</entry>
    <entry name='serial_number'>Not Specified</entry>
    <entry name='part_number'>Not Specified</entry>
  </processor>
  <processor>
    ......
  </processor>
  <memory_device> #<------similar to "dmidecode -t 17"
    <entry name='size'>32 GB</entry>
    <entry name='form_factor'>DIMM</entry>
    <entry name='locator'>PROC  1 DIMM  1</entry>
    <entry name='bank_locator'>Not Specified</entry>
    <entry name='type'>DDR3</entry>
    <entry name='type_detail'>Synchronous</entry>
    <entry name='speed'>1866 MHz</entry>
    <entry name='manufacturer'>HP</entry>
    <entry name='serial_number'>Not Specified</entry>
    <entry name='part_number'>712384-081</entry>
  </memory_device>
  <memory_device>
  ....
  </memory_device>
  ....

</sysinfo>
freecell: available memory
ping@trinity:~$ sudo virsh freecell
Total: 520897088 KiB
nodememstats
ping@ubuntu1404:~$ sudo virsh nodememstats
total  :              8033536 KiB
free   :              2874116 KiB
buffers:              1333740 KiB
cached :              1462096 KiB
capabilities: system supported features
ping@trinity:~$ sudo virsh capabilities
<capabilities>

  <host>
    <uuid>36373931-3138-5553-4534-333739575353</uuid>
    <cpu>       #<------cpu flags: "cat /proc/cpuinfo"
      <arch>x86_64</arch>
      <model>SandyBridge</model>
      <vendor>Intel</vendor>
      <topology sockets='1' cores='8' threads='1'/>
      <feature name='invtsc'/>
      <feature name='erms'/>
      <feature name='smep'/>
      <feature name='fsgsbase'/>
      <feature name='pdpe1gb'/>
      <feature name='rdrand'/>
      <feature name='f16c'/>
      <feature name='osxsave'/>
      <feature name='dca'/>
      <feature name='pcid'/>
      <feature name='pdcm'/>
      <feature name='xtpr'/>
      <feature name='tm2'/>
      <feature name='est'/>
      <feature name='smx'/>
      <feature name='vmx'/>
      <feature name='ds_cpl'/>
      <feature name='monitor'/>
      <feature name='dtes64'/>
      <feature name='pbe'/>
      <feature name='tm'/>
      <feature name='ht'/>
      <feature name='ss'/>
      <feature name='acpi'/>
      <feature name='ds'/>
      <feature name='vme'/>
      <pages unit='KiB' size='4'/>
      <pages unit='KiB' size='2048'/>
    </cpu>
    <power_management>
      <suspend_disk/>
      <suspend_hybrid/>
    </power_management>
    <migration_features>
      <live/>
      <uri_transports>
        <uri_transport>tcp</uri_transport>
      </uri_transports>
    </migration_features>
    <topology>
      <cells num='4'>
        <cell id='0'>   #<------NUMA node0
          <memory unit='KiB'>132067432</memory>
          <pages unit='KiB' size='4'>33016858</pages>
          <pages unit='KiB' size='2048'>0</pages>
          <distances>
            <sibling id='0' value='10'/>
            <sibling id='1' value='21'/>
            <sibling id='2' value='21'/>
            <sibling id='3' value='21'/>
          </distances>
          <cpus num='8'>
            <cpu id='0' socket_id='0' core_id='0' siblings='0'/>
            <cpu id='1' socket_id='0' core_id='1' siblings='1'/>
            <cpu id='2' socket_id='0' core_id='2' siblings='2'/>
            <cpu id='3' socket_id='0' core_id='3' siblings='3'/>
            <cpu id='4' socket_id='0' core_id='4' siblings='4'/>
            <cpu id='5' socket_id='0' core_id='5' siblings='5'/>
            <cpu id='6' socket_id='0' core_id='6' siblings='6'/>
            <cpu id='7' socket_id='0' core_id='7' siblings='7'/>
          </cpus>
        </cell>
        ......
    </topology>
    <secmodel>
      <model>none</model>
      <doi>0</doi>
    </secmodel>
    <secmodel>
      <model>dac</model>
      <doi>0</doi>
      <baselabel type='kvm'>+0:+0</baselabel>
      <baselabel type='qemu'>+0:+0</baselabel>
    </secmodel>
  </host>

  <guest>
    <os_type>hvm</os_type>
    <arch name='i686'>
      <wordsize>32</wordsize>
      <emulator>/usr/bin/qemu-system-i386</emulator>
      <machine canonical='pc-i440fx-trusty' maxCpus='255'>pc</machine>
      <machine maxCpus='255'>pc-0.12</machine>
        ......
      <machine maxCpus='255'>pc-i440fx-2.0</machine>
      <machine maxCpus='255'>pc-0.13</machine>
      <domain type='qemu'>
      </domain>
      <domain type='kvm'>
        <emulator>/usr/bin/kvm</emulator>
        <machine canonical='pc-i440fx-trusty' maxCpus='255'>pc</machine>
        <machine maxCpus='255'>pc-1.3</machine>
        ......
        <machine maxCpus='255'>pc-0.13</machine>
      </domain>
    </arch>
    <features>
      <cpuselection/>
      <deviceboot/>
      <disksnapshot default='on' toggle='no'/>
      <acpi default='on' toggle='yes'/>
      <apic default='on' toggle='no'/>
      <pae/>
      <nonpae/>
    </features>
  </guest>

  <guest>
    ......
  </guest>

</capabilities>
this output is quite long so not everything is printed here. for completeness the whole capture is available in appendix

15.2.3. network and interface management

network and interface management virsh commands are used to manage the network resources in the host.

net-list: list active (or all if also with --all) virtual networks
labroot@MX86-host-BL660C-B1:~$ sudo virsh net-list --all
 Name                 State      Autostart     Persistent
----------------------------------------------------------
 br-ext               active     no            yes
 br-int               active     no            yes
 default              active     yes           yes
net-info: print properties of a specific network
labroot@MX86-host-BL660C-B1:~$ sudo virsh net-info br-ext
Name:           br-ext
UUID:           efef3d11-cf4f-4cc2-9d21-d1814c0e2e0e
Active:         yes
Persistent:     yes
Autostart:      no
Bridge:         br-ext
labroot@MX86-host-BL660C-B1:~$ sudo virsh net-info br-int
Name:           br-int
UUID:           e8af7e46-f574-4d8d-96ef-210a8d61067b
Active:         yes
Persistent:     yes
Autostart:      no
Bridge:         br-int
labroot@MX86-host-BL660C-B1:~$ sudo virsh net-info default
Name:           default
UUID:           0969babb-7b8f-457d-8548-1783152ad05d
Active:         yes
Persistent:     yes
Autostart:      yes
Bridge:         virbr0
iface-list: list all active (up) interfaces in host
ping@trinity:~$ sudo virsh iface-list
 Name                 State      MAC Address
---------------------------------------------------
 br-ext               active     38:ea:a7:37:7c:54
 p2p1                 active     38:ea:a7:17:65:a0
 p3p1                 active     38:ea:a7:17:65:84
ping@matrix:/home$ sudo virsh iface-list
[sudo] password for ping:
 Name                 State      MAC Address
---------------------------------------------------
 br-int-vmx1          active     52:54:00:04:5f:fe
 br-int-vmx2          active     52:54:00:d8:7b:dc
 br0                  active     5c:b9:01:8a:f0:b8
 br1                  active     5c:b9:01:8a:f0:b9
 p1p1                 active     8c:dc:d4:b7:79:d0
 p1p2                 active     8c:dc:d4:b7:79:d1
 p2p1                 active     8c:dc:d4:b7:7a:e8
 p2p2                 active     8c:dc:d4:b7:7a:e9
iface-mac: print MAC address of one specific interface in host
ping@trinity:~$ sudo virsh iface-mac p2p1
38:ea:a7:17:65:a0

16. iommu/VT-d

16.1. VT-d basic concept

I/O Virtualization (IOV) involves sharing a single I/O resource between multiple virtual machines. Approaches for IOV include (but may not be limited to):

  • software based approach

  • direct assignment

  • SR-IOV

16.1.1. software based approach

Software based sharing utilizes emulation techniques to provide a logical I/O hardware device to the VM. The emulation layer interposes itself between the driver running in the guest OS and the underlying hardware. This level of indirection allows the VMM to intercept all traffic issued by the guest driver.

60a3baa8 d379 11e5 9228 c476b575433d
Figure 6. software based I/O sharing

Emulation software involves following common tasks:

  • parse the I/O commands,

  • translate guest addresses into host physical addresses

  • ensure that all referenced memory pages are present in memory.

  • resolve the multiple I/O requests from all the virtual machines

  • serialize them into a single I/O stream that can be handled by the underlying hardware.

There are at least 2 common approaches to software-based sharing:

  • device emulation

  • the splitdriver model:

Device emulation models

In this method, the hypervisor mimic widely supported real devices (such as an Intel 1Gb NIC) and utilize existing drivers in the guest OS. The VMM emulates the I/O device to ensure compatibility and then processes I/O operations before passing them on to the physical device (which may be different).

an example of the implementation of this model is QEMU, which can emulate a lot of legacy devices that can be used in the VM, regardless of real physical device type available in the host.

ping@trinity:~$ qemu-system-x86_64 -device ?
...<snippet>...
name "ioh3420", bus PCI, desc "Intel IOH device id 3420 PCIE Root Port"
...<snippet>...
name "piix3-ide", bus PCI
name "piix3-ide-xen", bus PCI
...<snippet>...
name "e1000", bus PCI, desc "Intel Gigabit Ethernet"
name "i82550", bus PCI, desc "Intel i82550 Ethernet"
name "i82551", bus PCI, desc "Intel i82551 Ethernet"
name "i82557a", bus PCI, desc "Intel i82557A Ethernet"
name "i82557b", bus PCI, desc "Intel i82557B Ethernet"
name "i82557c", bus PCI, desc "Intel i82557C Ethernet"
name "i82558a", bus PCI, desc "Intel i82558A Ethernet"
...<snippet>...
name "isa-parallel", bus ISA
name "isa-serial", bus ISA
...<snippet>...
name "isa-vga", bus ISA
...<snippet>...
name "adlib", bus ISA, desc "Yamaha YM3812 (OPL2)"

A potential problem is that I/O operations then have to traverse two I/O stacks, one in the VM and one in the VMM. Because of the low performance archived, this method is rarely used in an environment where performance is key.

split-driver model

This method takes a similar approach but, instead of emulating a legacy device, the split-driver uses a front-end driver in the guest that works in concert with a back-end driver in the VMM. These drivers are optimized for sharing and have the benefit of not needing to emulate an entire device. The back-end driver communicates with the physical device.

An example of the implementation of this model is virtio, which will be discussed in section virtio.

Both the device emulation and split-driver (para-virtualized driver) provide a subset of the total functionality provided by physical hardware and may as a result not have the ability to take advantage of advanced capabilities provided by the device.

In addition, significant CPU overhead may be required by the VMM to implement a virtual software-based switch that routes packets to and from the appropriate VMS. This CPU overhead can (and generally does) reduce the maximum throughput on an I/O device. As an example, extensive testing has shown using only device emulation, a 10Gbps Ethernet controller can achieve a maximum throughput of 4.5 to around 6.5 Gbps (the range varies with the architecture of the server being tested on).

One reason that line rate, or near line rate cannot be achieved is because each packet must go through the software switch and that requires CPU cycles to process the packets.

16.1.2. direct assignment

Software-based sharing adds overhead to each I/O operation due to the emulation layer between the guest driver and the I/O hardware. This indirection has the additional affect of eliminating the use of hardware acceleration that may be available in the physical device.

Such problems can be reduced by directly exposing the hardware to the guest OS and running a native device driver.

Intel has added enhancements to facilitate memory translation and ensure protection of memory that enables a device to directly DMA to/from host memory. These enhancements provide the ability to bypass the VMM’s I/O emulation layer and can result in throughput improvement for the VMs.

  • Intel® VT-x : allows a VM to have direct access to a physical address (if so configured by the VMM). This ability allows a device driver within a VM to be able to write directly to registers IO device (such as configuring DMA descriptors).

  • Intel® VT-d : provides a similar capability for IO devices to be able to write directly to the memory space of a virtual machine, for example a DMA operation. VT-d technology is illustrated below:

    60a710dc d2a6 11e5 98fb 49979ffbad2d
    Figure 7. direct assignment

VT-d is Intel’s IOMMU implementation. It provides I/O device assignment, DMA remapping, interrupt remapping and other features to improve the performance and security of the virtualization environment.

At least in the context and test environment of this doc, following technical terms refer almost the same thing:

  • IOMMU

  • VT-d

  • PCI device passthgouth

  • direct assignment

In simple terms, with VT-d support it’s possible to "detach" a device from host, and then attach it into a guest VM. This way the IO performance of the VM can be sharply increased.

In my setup, VT-d was enabled in the IOMMU section as part of system preparation.

the installation script check /boot/grub/grub.cfg file to verify whether iommu has been enabled or not.

1ping@trinity:~$ cat /boot/grub/grub.cfg | grep "intel_iommu=on"
2        linux   /vmlinuz-3.19.0-25-generic root=/dev/mapper/trinity--vg-root ro  intel_iommu=on
3                linux   /vmlinuz-3.19.0-25-generic root=/dev/mapper/trinity--vg-root ro  intel_iommu=on
4                linux   /vmlinuz-3.16.0-30-generic root=/dev/mapper/trinity--vg-root ro  intel_iommu=on
5                linux   /vmlinuz-3.13.0-32-generic root=/dev/mapper/trinity--vg-root ro  intel_iommu=on
about VT-d, VT-x, SR-IOV virtualization features
  • Intel VT-d extensions: or "d"irect-I/O, direct assignment: add virtualization support to Intel chipsets that can "assign" specific I/O devices to specific virtual machines (VM)s.

  • Intel VT-x is virtualization extension specificly on CPU, while VT-d is virtualization on chipset. VT-x can be thought of as "basic virtualization architecture" in Intel CPU.

  • SR-IOV is mostly (but "in thoery" not limited to) virtualization on NIC card.

Here is a brief and not-so-acurate summary of each term and its category of virtualization:

term virtualization category

VT-x

CPU

VT-d

chipset

SR-IOV

NIC

16.2. pci_stub

pci_stub is a kernel driver module that is used to "hide" the device from host or any other VMs that were not assigned to use this device. This is required to complete the VT-d/PCI-passthough operation.

 1ping@trinity:~$ modinfo pci_stub
 2filename:       /lib/modules/3.13.0-32-generic/kernel/drivers/pci/pci-stub.ko
 3author:         Chris Wright <chrisw@sous-sol.org>
 4license:        GPL
 5srcversion:     194150416ED68735C4D2803
 6depends:
 7intree:         Y
 8vermagic:       3.13.0-32-generic SMP mod_unload modversions
 9signer:         Magrathea: Glacier signing key
10sig_key:        5E:3C:0F:9C:A6:E3:65:43:53:5F:A2:BB:5B:70:9E:84:F1:6D:A7:C7
11sig_hashalgo:   sha512
12parm:           ids:Initial PCI IDs to add to the stub driver, format is
13                "vendor:device[:subvendor[:subdevice[:class[:class_mask]]]]" and multiple
14                comma separated entries can be specified (string)

pci_stub was compiled as a kernel module and not to be loaded by default in our case. It can be loaded manually with modprobe.

1ping@trinity:~$ lsmod | grep pci_stub
2ping@trinity:~$ modprobe pci_stub
3ping@trinity:~$ lsmod | grep pci_stub
4pci_stub               12622  0

16.3. process to assign PCI device in KVM

Here are the steps to isolate a PCI device, in this example, a SR-IOV VF:

  1. before assignment, p3p1 VF 0 is with ixgbevf driver

    ping@trinity:/images/vmx_20151102.0/build$ sudo lspci -nvks 0000:23:10.0
    23:10.0 0200: 8086:10ed (rev 01)
            Subsystem: 103c:17d2
            Flags: bus master, fast devsel, latency 0
            [virtual] Memory at f3400000 (64-bit, prefetchable) [size=16K]
            [virtual] Memory at f3300000 (64-bit, prefetchable) [size=16K]
            Capabilities: [70] MSI-X: Enable+ Count=3 Masked-
            Capabilities: [a0] Express Endpoint, MSI 00
            Capabilities: [100] Advanced Error Reporting
            Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
            Kernel driver in use: ixgbevf

    the -n option will show PCI vendor and device codes as numbers instead of looking them up in the PCI ID list. In this output it give wendor number 8096:10ed , which we’ll need to use in later commands.

  2. now "unbind" the VF from host, and "bind" it to pci-stub driver

    1ping@trinity:/images/vmx_20151102.0/build$ sudo -i
    2ping@trinity:~# echo "8086 10ed" > /sys/bus/pci/drivers/pci-stub/new_id
    3root@trinity:~# echo 0000:23:10.0 > /sys/bus/pci/devices/0000:23:10.0/driver/unbind
    4root@trinity:~# echo 0000:23:10.0 >> /sys/bus/pci/drivers/pci-stub/bind
    5root@trinity:~# exit
    these operations requires root privilege to perform.
  3. after the "unbind", the p3p1 VF 0 device is now with pci-stub driver instead of the ixgbevf driver, effectively get "detached" from the host.

     1ping@trinity:/images/vmx_20151102.0/build$ sudo lspci -nvks 0000:23:10.0
     223:10.0 0200: 8086:10ed (rev 01)
     3        Subsystem: 103c:17d2
     4        Flags: fast devsel
     5        [virtual] Memory at f3400000 (64-bit, prefetchable) [size=16K]
     6        [virtual] Memory at f3300000 (64-bit, prefetchable) [size=16K]
     7        Capabilities: [70] MSI-X: Enable- Count=3 Masked-
     8        Capabilities: [a0] Express Endpoint, MSI 00
     9        Capabilities: [100] Advanced Error Reporting
    10        Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
    11        Kernel driver in use: pci-stub      #<------
    12
    13ping@trinity:/images/vmx_20151102.0/build$ sudo lspci -nvks 0000:06:10.0
    1406:10.0 0200: 8086:10ed (rev 01)
    15        Subsystem: 103c:17d2
    16        Flags: fast devsel
    17        [virtual] Memory at ee300000 (64-bit, prefetchable) [size=16K]
    18        [virtual] Memory at ee200000 (64-bit, prefetchable) [size=16K]
    19        Capabilities: [70] MSI-X: Enable- Count=3 Masked-
    20        Capabilities: [a0] Express Endpoint, MSI 00
    21        Capabilities: [100] Advanced Error Reporting
    22        Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
    23        Kernel driver in use: pci-stub      #<------

now the p3p1 VF 0 device is ready to be used in guest VM.

The process to assign a device back from guest VM to host machine is similar - just unbind it from current pci-stub driver and then bind it back to ixgbevf driver will suffice:

 1ping@trinity:/images/vmx_20151102.0/build$ sudo -i
 2root@trinity:~# echo "8086 10ed" > /sys/bus/pci/drivers/ixgbevf/new_id
 3root@trinity:~# echo 0000:23:10.0 > /sys/bus/pci/devices/0000:23:10.0/driver/unbind
 4root@trinity:~# echo 0000:23:10.0 >> /sys/bus/pci/drivers/ixgbevf/bind
 5root@trinity:~# exit
 6
 7ping@trinity:/images/vmx_20151102.0/build$ sudo lspci -nvvs 0000:23:10.0
 823:10.0 0200: 8086:10ed (rev 01)
 9        Subsystem: 103c:17d2
10        Flags: bus master, fast devsel, latency 0
11        [virtual] Memory at f3400000 (64-bit, prefetchable) [size=16K]
12        [virtual] Memory at f3300000 (64-bit, prefetchable) [size=16K]
13        Capabilities: [70] MSI-X: Enable+ Count=3 Masked-
14        Capabilities: [a0] Express Endpoint, MSI 00
15        Capabilities: [100] Advanced Error Reporting
16        Capabilities: [150] Alternative Routing-ID Interpretation (ARI)
17        Kernel driver in use: ixgbevf       #<------ back to ixgbevf

17. SRIOV

17.1. SRIOV basic concept

17.1.1. what is "IO virtualization"?

Input/output (I/O) virtualization is a methodology to simplify management, lower costs and improve performance of servers in enterprise environments. I/O virtualization environments are created by abstracting the upper layer protocols from the physical connections.

SR-IOV is a type of "IO virtualization" technique Developed by the PCI-SIG (PCI Special Interest Group), so it is also called PCI-SIG SR-IOV.

The SR-IOV spec defined a very basic SR-IOV concept: enabling a Single "Root Function" (for example, a single Ethernet port), to appear as multiple, separate, devices. A physical device with SR-IOV capabilities can be configured to appear in the PCI configuration space as multiple "virtual functions".

17.1.2. what is a "function"?

traditionally, a PCIe Device has a Unique PCI Function Address, in the form of Bus domain and Function, sometimes called a BDF notation, which can be viewed by CLI command lspci in a linux system.

ping@trinity:~$ lspci | grep -i ethernet
02:00.0 Ethernet controller: Intel Corporation 82599 10 Gigabit Dual Port Backplane Connection (rev 01)
02:00.1 Ethernet controller: Intel Corporation 82599 10 Gigabit Dual Port Backplane Connection (rev 01)

The last number 0 or 1 here, is called a "function", which usually maps to a physical port:

f9d1a552 c74c 11e5 93bd 77be10dc59b3

17.1.3. PCI functions defined in SR-IOV

SR-IOV introduces two new function types:

  • Physical Functions (PFs) are full PCIe devices that include the SR-IOV capabilities. Physical Functions are discovered, managed, and configured as normal PCI devices. Physical Functions configure and manage the SR-IOV functionality by assigning Virtual Functions.

  • Virtual Functions (VFs) are simple PCIe functions that only process I/O. These are "lightweight" PCIe functions that contain the resources necessary for data movement but have a carefully minimized set of configuration resources. Each Virtual Function is derived from a Physical Function. The number of Virtual Functions a device may have is limited by the device hardware. A single Ethernet port, the Physical Device, may map to many Virtual Functions that can be shared to virtual machines.

The hypervisor can map one or more Virtual Functions to a virtual machine. The Virtual Function 's configuration space is then mapped to the configuration space presented to the guest.

in another word: with SR-IOV we can "split" one single physical NIC port into multiple "virtual NIC port" , and each virtual NIC port can then be assigned to different VMs.

SR IOV

Each Virtual Function can only be mapped to a single guest at a time, as Virtual Functions require real hardware resources. A virtual machine can have multiple Virtual Functions. A Virtual Function appears as a network card in the same way as a normal network card would appear to an operating system.

The SR-IOV drivers are implemented in the kernel. The core implementation is contained in the PCI subsystem, but there must also be driver support for both the Physical Function (PF) and Virtual Function (VF) devices. An SR-IOV capable device can allocate VFs from a PF. The VFs appear as PCI devices which are backed on the physical PCI device by resources such as queues and register sets. illustrated below:

afd9c7f4 b072 11e5 9916 302ff60b6148
Figure 8. A typical network adapter supporting SR-IOV functionality
  • pDRV: Physical Driver(ixgbe for Intel 82599 NIC used in our setup)

  • vDRV: Virtual Driver(ixgbevf for Intel 82599 NIC used in our setup)

17.1.4. what is a "root"?

This is a PCIe term. A root complex connects the processor and memory subsystem to the PCIe switch fabric composed of one or more switch devices, similar to a host bridge in a PCI system, which generates transaction requests on behalf of the processor interconnected through a local bus, and may contain more than one PCIe port and multiple switch devices. A root Port is the portion of the motherboard that contains the host bridge, which allows the PCIe ports to talk to the rest of the computer.

8736f316 c766 11e5 94f7 7299db0ef9b9
Figure 9. root port

17.1.5. why is it called "single"?

This is to differentiate it with another IO virtualization technology named "MR-IOV" - Multiple Root IOV, which is defined to share IO resource on multiple HW domains. For example, Multiple servers & VMs sharing one I/O adapter, which can be placed into a separate chassis outside of the server. Bandwidth of the I/O adapter is shared among the servers.

MR-IOV

a typical MR-IOV topology will look like:

MR-IOV
MR-IOV

a brief comparison between SR-IOV and MR-IOV:

MR-IOV

17.1.6. corelation with VT-d

Intel’s implementation of PCI-SIG SR-IOV functionality requires the VMM software to configure the direct assignment of the virtual function to the virtual machine using Intel® Virtualization Technology for Directed I/O (Intel® VT-d). Memory Translation technologies in Intel® VT-d provide hardware assisted techniques to allow direct DMA transfers. Intel VT-d helps with secure translation, and SRIOV provides separate data spaces for the virtual machines. [6]

6574c7a6 b0ac 11e5 8a5f 3b6e861d8fa8
Figure 10. SR-IOV vs. VT-d

In brief, SR-IOV aims to "partition" or "split" the NIC port into VFs, while VT-d aims to "assign" each VF to different VM. If the intention is to always use the entire bandwidth and processing capability of a NIC, then VT-d can be used without SR-IOV. see section of VT-d for more detailed information about VT-d.

17.1.7. other requirement

currently SR-IOV needs to be supported by different components in a server:

  • BIOS support

  • HyperVisor support

  • NIC and driver support

And, Juniper’s modified NIC kernel driver requires linux kernel 3.13. This may be changed later to be compatible with newer kernel releases.

17.2. SR-IOV packet flow

The [SR-IOV-prime] illustrated packet processing flow in SR-IOV:

SR-IOV packet flow
Figure 11. SR-IOV packet flow
  1. The Ethernet packet arrives at the Intel® Ethernet NIC.

  2. The packet is sent to the Layer 2 sorter/switch/classifier.

    • This Layer 2 sorter is configured by the Master Driver(MD).

    • When either the MD or the VF Driver configure a MAC address or VLAN, this Layer 2 sorter is configured.

  3. After being sorted by the Layer 2 Switch, the packet is placed into a receive queue dedicated to the target VF.

    For SR-IOV, MAC address configuration in the VF is important. Missing this step will make SR-IOV and VF fail to work, or lead to unexpected behavior.
  4. The DMA operation is initiated. The target memory address for the DMA operation is defined within the descriptors in the VF, which have been configured by the VF driver within the VM.

    In order for SR-IOV to work, the ixgbevf driver kernel module needs to be loaded also in the guest VM.

    root@localhost:~# lspci | grep 82599
    00:06.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
    00:07.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
    00:08.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
    00:09.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
    
    root@localhost:~# lspci -vks 00:06.0
    00:06.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
            Subsystem: Hewlett-Packard Company Device 17d3
            Flags: bus master, fast devsel, latency 0
            Memory at fe000000 (64-bit, prefetchable) [size=16K]
            Memory at fe004000 (64-bit, prefetchable) [size=16K]
            Capabilities: [a0] Express Endpoint, MSI 00
            Capabilities: [70] MSI-X: Enable+ Count=3 Masked-
            Kernel driver in use: igb_uio
            Kernel modules: ixgbevf
  5. The DMA Operation has reached the chipset. Intel® VT-d, which has been configured by the VMM then remaps the target DMA address from a virtual host address to a physical host address. The DMA operation is completed; the Ethernet packet is now in the memory space of the VM.

    As mentioned earlier, Intel’s SR-IOV implementation requires VT-d to be enabled.
  6. The Intel® Ethernet NIC fires an interrupt, indicating a packet has arrived. This interrupt is handled by the VMM.

  7. The VMM fires a virtual interrupt to the VM, so that it is informed that the packet has arrived.

    this is why sometime it is called "interupt-driven" work mode.

17.3. SR-IOV configuration in VMX

in vmx.conf file, the device-type : sriov option will instruct the installation script to configure Virtual Function (VF) on NIC card.

the actual CLI commands executed by the script to configure SR-IOV is very simple: just reload the ixgbe module with max_vfs parameter:

sudo rmmod ixgbe;
sudo modprobe ixgbe max_vfs=1,1,1,1,1,1,1,1;
just change the max_vfs parameter (here is 1) to the needed number of VF in your setup.

In practice, There is a small problem here:

  • The first command rmmod ixgbe will remove the kernel driver module

  • as a result all interfaces driven by the ixgbe module will be removed from the platform as well.

  • In the case that the mgmt port is also ixgbe-driven, This will disconnect the current telnet/ssh session and we’ll need to login via console (e.g through ilo ).

To workaround this and also to reduce the time of interuption to our telnet/ssh session, I prefer to use below concatenated commands to configure SR-IOV FVs immediately right after ixgbe module removed, and add our mgmt interface back to the external bridge, etc, all in one go.

sudo rmmod ixgbevf; sudo rmmod ixgbe; \
sudo modprobe ixgbe max_vfs=1,1,1,1,1,1,1,1; \
sudo modprobe tun ; sleep 5; \
sudo brctl addif br-ext em1

Below interface configuration in vmx.conf will specify the mapping between VF in the host and NIC in the guest VM, and properties of the VF.

    - interface            : ge-0/0/0
      port-speed-mbps      : 10000
      nic                  : p3p1
      mtu                  : 2000
      virtual-function     : 0
      mac-address          : "02:04.17:01:02:01"
      description          : "ge-0/0/0 connects to eth6"

    - interface            : ge-0/0/1
      port-speed-mbps      : 10000
      nic                  : p2p1
      mtu                  : 2000
      virtual-function     : 0
      mac-address          : "02:04.17:01:02:02"
      description          : "ge-0/0/1 connects to eth7"

17.4. example of 1 VF

to list all PF and VFs, use lspci
 1ping@trinity:~$ lspci | grep -i "ethernet"
 202:00.0 Ethernet controller: Intel Corporation 82599 10 Gigabit Dual Port Backplane Connection (rev 01)
 302:00.1 Ethernet controller: Intel Corporation 82599 10 Gigabit Dual Port Backplane Connection (rev 01)
 402:10.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
 502:10.1 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
 606:00.0 Ethernet controller: Intel Corporation 82599 10 Gigabit Dual Port Backplane Connection (rev 01)
 706:00.1 Ethernet controller: Intel Corporation 82599 10 Gigabit Dual Port Backplane Connection (rev 01)
 806:10.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
 906:10.1 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
1021:00.0 Ethernet controller: Intel Corporation 82599 10 Gigabit Dual Port Backplane Connection (rev 01)
1121:00.1 Ethernet controller: Intel Corporation 82599 10 Gigabit Dual Port Backplane Connection (rev 01)
1221:10.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
1321:10.1 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
1423:00.0 Ethernet controller: Intel Corporation 82599 10 Gigabit Dual Port Backplane Connection (rev 01)
1523:00.1 Ethernet controller: Intel Corporation 82599 10 Gigabit Dual Port Backplane Connection (rev 01)
1623:10.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
1723:10.1 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)

kernel logs will display how many VFs have been enabled for each and every port.

kernel logs:
 1ping@trinity:~$ dmesg | grep SR-IOV
 2[203216.832237] ixgbe 0000:02:00.0 (unregistered net_device): SR-IOV enabled with 1 VFs
 3[203217.067941] ixgbe 0000:02:00.1 (unregistered net_device): SR-IOV enabled with 1 VFs
 4[203217.303815] ixgbe 0000:06:00.0 (unregistered net_device): SR-IOV enabled with 1 VFs
 5[203217.539589] ixgbe 0000:06:00.1 (unregistered net_device): SR-IOV enabled with 1 VFs
 6[203217.783265] ixgbe 0000:21:00.0 (unregistered net_device): SR-IOV enabled with 1 VFs
 7[203218.022982] ixgbe 0000:21:00.1 (unregistered net_device): SR-IOV enabled with 1 VFs
 8[203218.254658] ixgbe 0000:23:00.0 (unregistered net_device): SR-IOV enabled with 1 VFs
 9[203218.490378] ixgbe 0000:23:00.1 (unregistered net_device): SR-IOV enabled with 1 VFs

One way to find out PCI bus address of an interface name, is to use ethtool with -i option:

 1ping@trinity:~$ ethtool -i p2p1
 2driver: ixgbe
 3version: 3.19.1
 4firmware-version: 0x8000076f, 1.475.0
 5bus-info: 0000:06:00.0  #<------
 6supports-statistics: yes
 7supports-test: yes
 8supports-eeprom-access: yes
 9supports-register-dump: yes
10supports-priv-flags: no
11
12ping@trinity:~$ ethtool -i p3p1
13driver: ixgbe
14version: 3.19.1
15firmware-version: 0x8000076f, 1.475.0
16bus-info: 0000:23:00.0  #<------
17supports-statistics: yes
18supports-test: yes
19supports-eeprom-access: yes
20supports-register-dump: yes
21supports-priv-flags: no
22ping@trinity:~$

The other option is to use lshw command, which can be used to list vendor/manufacturer info of all interfaces in the platform, an example is shown below:

 1ping@ubuntu1:~$ sudo lshw -class network | grep -iE "bus info|logical name"
 2       bus info: pci@0000:07:00.0
 3       logical name: p2p1
 4       bus info: pci@0000:07:00.1
 5       logical name: p2p2
 6       bus info: pci@0000:07:10.0
 7       bus info: pci@0000:07:10.1
 8       bus info: pci@0000:07:10.2
 9       bus info: pci@0000:07:10.3
10       bus info: pci@0000:07:10.4
11       bus info: pci@0000:07:10.5
12       bus info: pci@0000:07:10.6
13       bus info: pci@0000:07:10.7
14       bus info: pci@0000:03:00.0
15       logical name: em1
16       bus info: pci@0000:03:00.1
17       logical name: em2
18       bus info: pci@0000:03:00.2
19       logical name: em3
20       bus info: pci@0000:03:00.3
21       logical name: em4
22       bus info: pci@0000:0a:00.0
23       logical name: p3p1
24       bus info: pci@0000:0a:00.1
25       logical name: p3p2
26       bus info: pci@0000:24:00.0
27       logical name: p5p1
28       bus info: pci@0000:24:00.1
29       logical name: p5p2
30       bus info: pci@0000:24:10.0
31       bus info: pci@0000:24:10.1
32       bus info: pci@0000:24:10.2
33       bus info: pci@0000:24:10.3
34       bus info: pci@0000:24:10.4
35       bus info: pci@0000:24:10.5
36       bus info: pci@0000:24:10.6
37       bus info: pci@0000:24:10.7
38       logical name: br-int-vmx1-nic
39       logical name: vfp_int-vmx1
40       logical name: vfp_ext-vmx1
41       logical name: br-ext-nic
42       logical name: vcp_int-vmx1
43       logical name: vcp_ext-vmx1

the advantage of lshw over ethtool is, it list all network interfaces hardware information instead of just one specific interfac, and, it also shows the mapping relationship between PCI address and interface name, which comes very handy.

with the bus address, the PF-VF mapping relationship can be printed:

1ping@trinity:~$ ls -l /sys/bus/pci/devices/0000\:06\:00.0/virtfn*
2lrwxrwxrwx 1 root root 0 Nov 23 21:03 /sys/bus/pci/devices/0000:06:00.0/virtfn0 -> ../0000:06:10.0
3
4ping@trinity:~$ ls -l /sys/bus/pci/devices/0000\:23\:00.0/virtfn*
5lrwxrwxrwx 1 root root 0 Nov 23 21:03 /sys/bus/pci/devices/0000:23:00.0/virtfn0 -> ../0000:23:10.0
to get all PF-VF map in a platform, "find" it like this:
 1ping@trinity:~$ sudo find /sys -name virtfn* | xargs ls -l
 2... /sys/.../0000:02:00.0/virtfn0 -> ../0000:02:10.0
 3... /sys/.../0000:02:00.1/virtfn0 -> ../0000:02:10.1
 4... /sys/.../0000:06:00.0/virtfn0 -> ../0000:06:10.0
 5... /sys/.../0000:06:00.1/virtfn0 -> ../0000:06:10.1
 6... /sys/.../0000:21:00.0/virtfn0 -> ../0000:21:10.0
 7... /sys/.../0000:21:00.1/virtfn0 -> ../0000:21:10.1
 8... /sys/.../0000:23:00.0/virtfn0 -> ../0000:23:10.0
 9... /sys/.../0000:23:00.1/virtfn0 -> ../0000:23:10.1

another method to find out VF-PF mapping corelation is to use libvirt virsh command:

list running VMs and locate VCP name/ID:

ping@ubuntu1:~$ sudo virsh list
 Id    Name                           State
----------------------------------------------------
 34    vhepe-vcp                      running
 37    vhepe-vfp                      running

list VFs(normally with slot number '0x10' or '0x11' ) in use:

ping@ubuntu1:~$ sudo virsh dumpxml 37 | grep -iE "0x10|0x11"
        <address type='pci' domain='0x0000' bus='0x24' slot='0x10' function='0x0'/>
        <address type='pci' domain='0x0000' bus='0x24' slot='0x10' function='0x1'/>
        <address type='pci' domain='0x0000' bus='0x07' slot='0x10' function='0x0'/>
        <address type='pci' domain='0x0000' bus='0x07' slot='0x10' function='0x1'/>

locate the nodedev name and print the mapping relationship

 1ping@ubuntu1:~$ sudo virsh nodedev-list | grep 0000_24_10
 2pci_0000_24_10_0
 3pci_0000_24_10_1
 4pci_0000_24_10_2
 5pci_0000_24_10_3
 6pci_0000_24_10_4
 7pci_0000_24_10_5
 8pci_0000_24_10_6
 9pci_0000_24_10_7
10
11ping@ubuntu1:~$ sudo virsh nodedev-dumpxml pci_0000_24_10_0 | grep function
12    <function>0</function>
13    <capability type='phys_function'>
14      <address domain='0x0000' bus='0x24' slot='0x00' function='0x0'/> (1)
15      <address domain='0x0000' bus='0x24' slot='0x10' function='0x0'/> (2)
1 PF address
2 VF address

if we put all info collected from above commands, we can produce a table that lists the mapping relationship between PF, VF PCI bus address, logical interface name, and the VMX (guest VM) virtual interface name, like this:

Table 5. NIC mapping table (1 VF)
PCI address adapter interface VF address VF# VMX interface

02:00.0

560FLB

em9

02:10.0

0

02:00.1

560FLB

em10

02:10.1

0

06:00.0

560M

p2p1

06:10.0

0

ge-0/0/1

06:00.1

560M

p2p2

06:10.1

0

21:00.0

560FLB

em1

21:10.0

0

21:00.1

560FLB

em2

21:10.1

0

ge-0/0/0

23:00.0

560M

p3p1

23:10.0

0

23:00.1

560M

p3p2

23:10.1

0

It will be very handy to have this table in hand, before you proceed to plan your VMX installation. It will become even helpful if you plan to setup a multi-instances VMX lab involving multiple VFs being configured and used. In below sections I’ll demonstrate examples of configuring 4 VFs and 8 VFs.

17.5. example of 4 VFs

To enable maximum of 4 VFs, give max_vfs number of 4 instead of 1 and repeat the same exact commands:

1sudo rmmod ixgbevf; sudo rmmod ixgbe; \
2sudo modprobe ixgbe max_vfs=4,4,4,4,4,4,4,4; \
3sudo modprobe tun ; sleep 5; \
4sudo brctl addif br-ext em1
list all generated VFs:
 1ping@trinity:~$ lspci | grep -i ethernet
 202:00.0 Ethernet controller: Intel Corporation 82599 10 Gigabit Dual Port Backplane Connection (rev 01)
 302:00.1 Ethernet controller: Intel Corporation 82599 10 Gigabit Dual Port Backplane Connection (rev 01)
 402:10.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
 5...<snippet>...
 602:10.7 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
 706:00.0 Ethernet controller: Intel Corporation 82599 10 Gigabit Dual Port Backplane Connection (rev 01)
 806:00.1 Ethernet controller: Intel Corporation 82599 10 Gigabit Dual Port Backplane Connection (rev 01)
 906:10.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
1006:10.1 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
1106:10.3 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
1206:10.5 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
1306:10.7 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
1421:00.0 Ethernet controller: Intel Corporation 82599 10 Gigabit Dual Port Backplane Connection (rev 01)
1521:00.1 Ethernet controller: Intel Corporation 82599 10 Gigabit Dual Port Backplane Connection (rev 01)
1621:10.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
17...<snippet>...
1821:10.7 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
1923:00.0 Ethernet controller: Intel Corporation 82599 10 Gigabit Dual Port Backplane Connection (rev 01)
2023:00.1 Ethernet controller: Intel Corporation 82599 10 Gigabit Dual Port Backplane Connection (rev 01)
2123:10.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
2223:10.1 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
2323:10.3 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
2423:10.5 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
2523:10.7 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
the kernel logs (dmesg) now show 4 VFs enabled
1[26029.301160] ixgbe 0000:02:00.0 (unregistered net_device): SR-IOV enabled with 4 VFs
2[26029.540597] ixgbe 0000:02:00.1 (unregistered net_device): SR-IOV enabled with 4 VFs
3[26029.671906] ixgbe 0000:06:00.0 (unregistered net_device): SR-IOV enabled with 1 VFs
4[26029.908841] ixgbe 0000:06:00.1 (unregistered net_device): SR-IOV enabled with 4 VFs
5[26030.144053] ixgbe 0000:21:00.0 (unregistered net_device): SR-IOV enabled with 4 VFs
6[26030.383774] ixgbe 0000:21:00.1 (unregistered net_device): SR-IOV enabled with 4 VFs
7[26030.515073] ixgbe 0000:23:00.0 (unregistered net_device): SR-IOV enabled with 1 VFs
8[26030.752200] ixgbe 0000:23:00.1 (unregistered net_device): SR-IOV enabled with 4 VFs
Not all NIC ports now have 4 VFs , which will be explained later.
number of VFs currently enabled, vs number of maximum VFs supported, per NIC port:
 1ping@trinity:~$ sudo find /sys/ -name "*vfs*"
 2/sys/devices/pci0000:00/0000:00:02.2/0000:02:00.0/sriov_numvfs
 3/sys/devices/pci0000:00/0000:00:02.2/0000:02:00.0/sriov_totalvfs
 4/sys/devices/pci0000:00/0000:00:02.2/0000:02:00.1/sriov_numvfs
 5/sys/devices/pci0000:00/0000:00:02.2/0000:02:00.1/sriov_totalvfs
 6/sys/devices/pci0000:00/0000:00:03.0/0000:06:00.0/sriov_numvfs
 7/sys/devices/pci0000:00/0000:00:03.0/0000:06:00.0/sriov_totalvfs
 8/sys/devices/pci0000:00/0000:00:03.0/0000:06:00.1/sriov_numvfs
 9/sys/devices/pci0000:00/0000:00:03.0/0000:06:00.1/sriov_totalvfs
10/sys/devices/pci0000:20/0000:20:02.2/0000:21:00.0/sriov_numvfs
11/sys/devices/pci0000:20/0000:20:02.2/0000:21:00.0/sriov_totalvfs
12/sys/devices/pci0000:20/0000:20:02.2/0000:21:00.1/sriov_numvfs
13/sys/devices/pci0000:20/0000:20:02.2/0000:21:00.1/sriov_totalvfs
14/sys/devices/pci0000:20/0000:20:03.0/0000:23:00.0/sriov_numvfs
15/sys/devices/pci0000:20/0000:20:03.0/0000:23:00.0/sriov_totalvfs
16/sys/devices/pci0000:20/0000:20:03.0/0000:23:00.1/sriov_numvfs
17/sys/devices/pci0000:20/0000:20:03.0/0000:23:00.1/sriov_totalvfs
18/sys/kernel/debug/tracing/events/vfs
19
20ping@trinity:~$ sudo cat /sys/devices/pci0000:00/0000:00:02.2/0000:02:00.0/sriov_numvfs
214
22ping@trinity:~$ sudo cat /sys/devices/pci0000:00/0000:00:02.2/0000:02:00.0/sriov_totalvfs
2363
24ping@trinity:~$ sudo cat /sys/devices/pci0000:00/0000:00:03.0/0000:06:00.0/sriov_numvfs
251
26ping@trinity:~$ sudo cat /sys/devices/pci0000:00/0000:00:03.0/0000:06:00.0/sriov_totalvfs
2763
28ping@trinity:~$ sudo cat /sys/devices/pci0000:00/0000:00:03.0/0000:06:00.1/sriov_numvfs
294
30ping@trinity:~$ sudo cat /sys/devices/pci0000:00/0000:00:03.0/0000:06:00.1/sriov_totalvfs
3163
32ping@trinity:~$ sudo cat /sys/devices/pci0000:20/0000:20:02.2/0000:21:00.0/sriov_numvfs
334
34ping@trinity:~$ sudo cat /sys/devices/pci0000:20/0000:20:02.2/0000:21:00.0/sriov_totalvfs
3563
36ping@trinity:~$ sudo cat /sys/devices/pci0000:20/0000:20:02.2/0000:21:00.1/sriov_numvfs
374
38ping@trinity:~$ sudo cat /sys/devices/pci0000:20/0000:20:02.2/0000:21:00.1/sriov_totalvfs
3963
40ping@trinity:~$ sudo cat /sys/devices/pci0000:20/0000:20:03.0/0000:23:00.0/sriov_numvfs
411
42ping@trinity:~$ sudo cat /sys/devices/pci0000:20/0000:20:03.0/0000:23:00.0/sriov_totalvfs
4363
44ping@trinity:~$ sudo cat /sys/devices/pci0000:20/0000:20:03.0/0000:23:00.1/sriov_numvfs
454
46ping@trinity:~$ sudo cat /sys/devices/pci0000:20/0000:20:03.0/0000:23:00.1/sriov_totalvfs
4763

VF-PF mapping:

 1ping@trinity:~$ ls -l /sys/bus/pci/devices/0000\:02\:00.0/virtfn*
 2lrwxrwxrwx 1 root root 0 Nov 21 18:22 /sys/bus/pci/devices/0000:02:00.0/virtfn0 -> ../0000:02:10.0
 3lrwxrwxrwx 1 root root 0 Nov 21 18:22 /sys/bus/pci/devices/0000:02:00.0/virtfn1 -> ../0000:02:10.2
 4lrwxrwxrwx 1 root root 0 Nov 21 18:22 /sys/bus/pci/devices/0000:02:00.0/virtfn2 -> ../0000:02:10.4
 5lrwxrwxrwx 1 root root 0 Nov 21 18:22 /sys/bus/pci/devices/0000:02:00.0/virtfn3 -> ../0000:02:10.6
 6ping@trinity:~$
 7
 8ping@trinity:~$ ls -l /sys/bus/pci/devices/0000\:02\:00.1/virtfn*
 9lrwxrwxrwx 1 root root 0 Nov 21 18:22 /sys/bus/pci/devices/0000:02:00.1/virtfn0 -> ../0000:02:10.1
10lrwxrwxrwx 1 root root 0 Nov 21 18:22 /sys/bus/pci/devices/0000:02:00.1/virtfn1 -> ../0000:02:10.3
11lrwxrwxrwx 1 root root 0 Nov 21 18:22 /sys/bus/pci/devices/0000:02:00.1/virtfn2 -> ../0000:02:10.5
12lrwxrwxrwx 1 root root 0 Nov 21 18:22 /sys/bus/pci/devices/0000:02:00.1/virtfn3 -> ../0000:02:10.7
13ping@trinity:~$
14
15ping@trinity:~$ ls -l /sys/bus/pci/devices/0000\:06\:00.0/rtfn*
16lrwxrwxrwx 1 root root 0 Nov 21 11:57 /sys/bus/pci/devices/0000:06:00.0/virtfn0 -> ../0000:06:10.0
17
18ping@trinity:~$ ls -l /sys/bus/pci/devices/0000\:06:00.1/virtfn*
19lrwxrwxrwx 1 root root 0 Nov 21 18:22 /sys/bus/pci/devices/0000:06:00.1/virtfn0 -> ../0000:06:10.1
20lrwxrwxrwx 1 root root 0 Nov 21 18:22 /sys/bus/pci/devices/0000:06:00.1/virtfn1 -> ../0000:06:10.3
21lrwxrwxrwx 1 root root 0 Nov 21 18:22 /sys/bus/pci/devices/0000:06:00.1/virtfn2 -> ../0000:06:10.5
22lrwxrwxrwx 1 root root 0 Nov 21 18:22 /sys/bus/pci/devices/0000:06:00.1/virtfn3 -> ../0000:06:10.7
23
24ping@trinity:~$ ls -l /sys/bus/pci/devices/0000\:21:00.0/virtfn*
25lrwxrwxrwx 1 root root 0 Nov 21 18:22 /sys/bus/pci/devices/0000:21:00.0/virtfn0 -> ../0000:21:10.0
26lrwxrwxrwx 1 root root 0 Nov 21 18:22 /sys/bus/pci/devices/0000:21:00.0/virtfn1 -> ../0000:21:10.2
27lrwxrwxrwx 1 root root 0 Nov 21 18:22 /sys/bus/pci/devices/0000:21:00.0/virtfn2 -> ../0000:21:10.4
28lrwxrwxrwx 1 root root 0 Nov 21 18:22 /sys/bus/pci/devices/0000:21:00.0/virtfn3 -> ../0000:21:10.6
29
30ping@trinity:~$ ls -l /sys/bus/pci/devices/0000\:21:00.1/virtfn*
31lrwxrwxrwx 1 root root 0 Nov 21 18:22 /sys/bus/pci/devices/0000:21:00.1/virtfn0 -> ../0000:21:10.1
32lrwxrwxrwx 1 root root 0 Nov 21 18:22 /sys/bus/pci/devices/0000:21:00.1/virtfn1 -> ../0000:21:10.3
33lrwxrwxrwx 1 root root 0 Nov 21 18:22 /sys/bus/pci/devices/0000:21:00.1/virtfn2 -> ../0000:21:10.5
34lrwxrwxrwx 1 root root 0 Nov 21 18:22 /sys/bus/pci/devices/0000:21:00.1/virtfn3 -> ../0000:21:10.7
35
36ping@trinity:~$ ls -l /sys/bus/pci/devices/0000\:23:00.0/virtfn*
37lrwxrwxrwx 1 root root 0 Nov 21 11:57 /sys/bus/pci/devices/0000:23:00.0/virtfn0 -> ../0000:23:10.0
38
39ping@trinity:~$ ls -l /sys/bus/pci/devices/0000\:23:00.1/virtfn*
40lrwxrwxrwx 1 root root 0 Nov 21 18:22 /sys/bus/pci/devices/0000:23:00.1/virtfn0 -> ../0000:23:10.1
41lrwxrwxrwx 1 root root 0 Nov 21 18:22 /sys/bus/pci/devices/0000:23:00.1/virtfn1 -> ../0000:23:10.3
42lrwxrwxrwx 1 root root 0 Nov 21 18:22 /sys/bus/pci/devices/0000:23:00.1/virtfn2 -> ../0000:23:10.5
43lrwxrwxrwx 1 root root 0 Nov 21 18:22 /sys/bus/pci/devices/0000:23:00.1/virtfn3 -> ../0000:23:10.7

the PCI bus address may look different on different server, so correct PCI address has to be used to execute each command above. A more convenient and general way to display all PF-VF mapping is to find them out under /sys folder. below is a sample captured from a different server.

 1ping@ubuntu1:~$ sudo find /sys -name virtfn* | xargs ls -l
 2... /sys/devices/pci0000:00/0000:00:01.0/0000:07:00.0/virtfn0 -> ../0000:07:10.0
 3... /sys/devices/pci0000:00/0000:00:01.0/0000:07:00.0/virtfn1 -> ../0000:07:10.2
 4... /sys/devices/pci0000:00/0000:00:01.0/0000:07:00.0/virtfn2 -> ../0000:07:10.4
 5... /sys/devices/pci0000:00/0000:00:01.0/0000:07:00.0/virtfn3 -> ../0000:07:10.6
 6... /sys/devices/pci0000:00/0000:00:01.0/0000:07:00.1/virtfn0 -> ../0000:07:10.1
 7... /sys/devices/pci0000:00/0000:00:01.0/0000:07:00.1/virtfn1 -> ../0000:07:10.3
 8... /sys/devices/pci0000:00/0000:00:01.0/0000:07:00.1/virtfn2 -> ../0000:07:10.5
 9... /sys/devices/pci0000:00/0000:00:01.0/0000:07:00.1/virtfn3 -> ../0000:07:10.7
10... /sys/devices/pci0000:20/0000:20:02.2/0000:24:00.0/virtfn0 -> ../0000:24:10.0
11... /sys/devices/pci0000:20/0000:20:02.2/0000:24:00.0/virtfn1 -> ../0000:24:10.2
12... /sys/devices/pci0000:20/0000:20:02.2/0000:24:00.0/virtfn2 -> ../0000:24:10.4
13... /sys/devices/pci0000:20/0000:20:02.2/0000:24:00.0/virtfn3 -> ../0000:24:10.6
14... /sys/devices/pci0000:20/0000:20:02.2/0000:24:00.1/virtfn0 -> ../0000:24:10.1
15... /sys/devices/pci0000:20/0000:20:02.2/0000:24:00.1/virtfn1 -> ../0000:24:10.3
16... /sys/devices/pci0000:20/0000:20:02.2/0000:24:00.1/virtfn2 -> ../0000:24:10.5
17... /sys/devices/pci0000:20/0000:20:02.2/0000:24:00.1/virtfn3 -> ../0000:24:10.7

Similiarly, to print the number of currently enabled VFs on each interfaces:

ping@ubuntu1:/export/home/vhepe/images$ sudo find /sys/ -name "*numvfs" | xargs cat
4
4
0
0
4
4

to print the capability of support maximum VFs on each port:

1ping@ubuntu1:/export/home/vhepe/images$ sudo find /sys/ -name "*totalvfs" | xargs cat
264
364
464
564
664
764
Table 6. NIC mapping table (4 VFs)
PCI address adapter interface VF address VF# VMX interface

02:00.0

560FLB

em9

02:10.0

0

02:10.2

1

02:10.4

2

02:10.6

3

02:00.1

560FLB

em10

02:10.1

0

02:10.3

1

02:10.5

2

02:10.7

3

06:00.0

560M

p2p1

06:10.0

0

ge-0/0/1

06:00.1

560M

p2p2

06:10.1

0

06:10.3

1

06:10.5

2

06:10.7

3

21:00.0

560FLB

em1

21:10.0

0

21:10.2

1

21:10.4

2

21:10.6

3

21:00.1

560FLB

em2

21:10.1

0

21:10.3

1

21:10.5

2

21:10.7

3

23:00.0

560M

p3p1

23:10.0

0

ge-0/0/0

23:00.1

560M

p3p2

23:10.1

0

23:10.3

1

23:10.5

2

23:10.7

3

Now, the reason that p2p1 and p3p1 still hold just 1 VF instead of 4 in this example, is because the guest VMX VMs were not torn down at the time when SR-IOV VFs were reconfigured, so the previous VF resources were still hold unchanged on the ports in use.

After removing the VMs and reconfiguring ixgbe the expected number of VFs can be seen:

 1sudo virsh destroy vcp-vmx1
 2sudo virsh destroy vfp-vmx1
 3sudo virsh undefine vcp-vmx1
 4sudo virsh undefine vfp-vmx1
 5sudo rmmod ixgbevf; sudo rmmod ixgbe; \
 6sudo modprobe ixgbe max_vfs=4,4,4,4,4,4,4,4; \
 7sudo modprobe tun ; sleep 5; sudo brctl addif em1
 8
 9[206499.970842] ixgbe 0000:02:00.0 (unregistered net_device): SR-IOV enabled with 4 VFs
10[206500.206573] ixgbe 0000:02:00.1 (unregistered net_device): SR-IOV enabled with 4 VFs
11[206500.446766] ixgbe 0000:06:00.0 (unregistered net_device): SR-IOV enabled with 4 VFs
12[206500.682536] ixgbe 0000:06:00.1 (unregistered net_device): SR-IOV enabled with 4 VFs
13[206500.917846] ixgbe 0000:21:00.0 (unregistered net_device): SR-IOV enabled with 4 VFs
14[206501.153595] ixgbe 0000:21:00.1 (unregistered net_device): SR-IOV enabled with 4 VFs
15[206501.389801] ixgbe 0000:23:00.0 (unregistered net_device): SR-IOV enabled with 4 VFs
16[206501.625554] ixgbe 0000:23:00.1 (unregistered net_device): SR-IOV enabled with 4 VFs

17.5.1. the "rule of thumb" for 4 VF

The previous table for "4 VFs" scenario will now be updated as below:

Table 7. NIC mapping table
PCI address adapter interface VF address VF# VMX interface

02:00.0

560FLB

em9

02:10.0

0

02:10.2

1

02:10.4

2

02:10.6

3

02:00.1

560FLB

em10

02:10.1

0

02:10.3

1

02:10.5

2

02:10.7

3

06:00.0

560M

p2p1

06:10.0

0

ge-0/0/1

06:10.2

1

06:10.4

2

06:10.6

3

06:00.1

560M

p2p2

06:10.1

0

06:10.3

1

06:10.5

2

06:10.7

3

21:00.0

560FLB

em1

21:10.0

0

21:10.2

1

21:10.4

2

21:10.6

3

21:00.1

560FLB

em2

21:10.1

0

21:10.3

1

21:10.5

2

21:10.7

3

23:00.0

560M

p3p1

23:10.0

0

ge-0/0/0

23:10.0

1

23:10.0

2

23:10.0

3

23:00.1

560M

p3p2

23:10.1

0

23:10.3

1

23:10.5

2

23:10.7

3

From this table, we seem to be able to conclude the following rules:

  • a number of 10 represents "virtual" device (or bus)

  • a sequence of "interleave" function numbers 0 2 4 6 represent the 4 "virtual" function 0 to 3 in the first physical port 0

  • an "interleave" numbers 1 3 5 7 represent the 4 "virtual" function 0 to 3 in the second physical port 1

A question is : will this always hold true for other maximum VFs configuration? A quick test can show the answer.

17.6. example of 8 VFs:

As 1 VF and 4 VF examples shown above, to enable 8 VF per port (making it total 64 VFs platform wise), just change the max_vfs accordingly and use same commands will be sufficient:

sudo rmmod ixgbevf; \
sudo rmmod ixgbe; \
sudo modprobe ixgbe max_vfs=8,8,8,8,8,8,8,8; \
sudo modprobe tun ; \
sleep 5; sudo brctl addif br-ext em1

Now we can verify all the VF related data points using same commands as those used in previous examples.

to list all generated VFs
 1ping@trinity:/images/vmx_20151102.0/build$ lspci | grep -i ethernet
 202:00.0 Ethernet controller: Intel Corporation 82599 10 Gigabit Dual Port Backplane Connection (rev 01)
 302:00.1 Ethernet controller: Intel Corporation 82599 10 Gigabit Dual Port Backplane Connection (rev 01)
 402:10.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
 5...<snippet>...
 602:10.7 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
 702:11.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
 8...<snippet>...
 902:11.7 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
1006:00.0 Ethernet controller: Intel Corporation 82599 10 Gigabit Dual Port Backplane Connection (rev 01)
1106:00.1 Ethernet controller: Intel Corporation 82599 10 Gigabit Dual Port Backplane Connection (rev 01)
1206:10.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
13...<snippet>...
1406:10.7 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
1506:11.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
16...<snippet>...
1706:11.7 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
1821:00.0 Ethernet controller: Intel Corporation 82599 10 Gigabit Dual Port Backplane Connection (rev 01)
1921:00.1 Ethernet controller: Intel Corporation 82599 10 Gigabit Dual Port Backplane Connection (rev 01)
2021:10.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
21...<snippet>...
2221:10.7 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
2321:11.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
24...<snippet>...
2521:11.7 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
2623:00.0 Ethernet controller: Intel Corporation 82599 10 Gigabit Dual Port Backplane Connection (rev 01)
2723:00.1 Ethernet controller: Intel Corporation 82599 10 Gigabit Dual Port Backplane Connection (rev 01)
2823:10.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
29...<snippet>...
3023:10.7 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
3123:11.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
32...<snippet>...
3323:11.7 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
kernel message:
 1ping@trinity:/images/vmx_20151102.0/build$ dmesg | grep SR-IOV
 2[72130.750581] ixgbe 0000:02:00.0 (unregistered net_device): SR-IOV enabled with 8 VFs
 3[72130.986275] ixgbe 0000:02:00.1 (unregistered net_device): SR-IOV enabled with 8 VFs
 4[72131.223191] ixgbe 0000:06:00.0 (unregistered net_device): SR-IOV enabled with 8 VFs
 5[72131.458462] ixgbe 0000:06:00.1 (unregistered net_device): SR-IOV enabled with 8 VFs
 6[72131.693388] ixgbe 0000:21:00.0 (unregistered net_device): SR-IOV enabled with 8 VFs
 7[72131.937491] ixgbe 0000:21:00.1 (unregistered net_device): SR-IOV enabled with 8 VFs
 8[72132.173784] ixgbe 0000:23:00.0 (unregistered net_device): SR-IOV enabled with 8 VFs
 9[72132.409462] ixgbe 0000:23:00.1 (unregistered net_device): SR-IOV enabled with 8 VFs
PF-VF mapping:
 1ping@trinity:~$ ls -l /sys/bus/pci/devices/0000\:02\:00.0/virtfn*
 2lrwxrwxrwx 1 root root 0 Nov 24 14:10 /sys/bus/pci/devices/0000:02:00.0/virtfn0 -> ../0000:02:10.0
 3lrwxrwxrwx 1 root root 0 Nov 24 14:10 /sys/bus/pci/devices/0000:02:00.0/virtfn1 -> ../0000:02:10.2
 4lrwxrwxrwx 1 root root 0 Nov 24 14:10 /sys/bus/pci/devices/0000:02:00.0/virtfn2 -> ../0000:02:10.4
 5lrwxrwxrwx 1 root root 0 Nov 24 14:10 /sys/bus/pci/devices/0000:02:00.0/virtfn3 -> ../0000:02:10.6
 6lrwxrwxrwx 1 root root 0 Nov 24 14:10 /sys/bus/pci/devices/0000:02:00.0/virtfn4 -> ../0000:02:11.0
 7lrwxrwxrwx 1 root root 0 Nov 24 14:10 /sys/bus/pci/devices/0000:02:00.0/virtfn5 -> ../0000:02:11.2
 8lrwxrwxrwx 1 root root 0 Nov 24 14:10 /sys/bus/pci/devices/0000:02:00.0/virtfn6 -> ../0000:02:11.4
 9lrwxrwxrwx 1 root root 0 Nov 24 14:10 /sys/bus/pci/devices/0000:02:00.0/virtfn7 -> ../0000:02:11.6

17.6.1. the "rule of thumb" for 8 VF

Now we end up with below table for "8 VFs" scenario:

Table 8. NIC mapping table (8 VFs)
PCI address adapter interface VF address VF# VMX interface

02:00.0

560FLB

em9

02:10.0

0

02:10.2

1

02:10.4

2

02:10.6

3

02:11.0

4

02:11.2

5

02:11.4

6

02:11.6

7

02:00.1

560FLB

em10

02:10.1

0

02:10.3

1

02:10.5

2

02:10.7

3

02:11.1

4

02:11.3

5

02:11.5

6

02:11.7

7

06:00.0

560M

p2p1

06:10.0

0

ge-0/0/1

06:10.2

1

06:10.4

2

06:10.6

3

06:11.0

4

06:11.2

5

06:11.4

6

06:11.6

7

06:00.1

560M

p2p2

06:10.1

0

06:10.3

1

06:10.5

2

06:10.7

3

06:11.1

4

06:11.3

5

06:11.5

6

06:11.7

7

21:00.0

560FLB

em1

21:10.0

0

21:10.2

1

21:10.4

2

21:10.6

3

21:11.0

4

21:11.2

5

21:11.4

6

21:11.6

7

21:00.1

560FLB

em2

21:10.1

0

21:10.3

1

21:10.5

2

21:10.7

3

21:11.1

4

21:11.3

5

21:11.5

6

21:11.7

7

23:00.0

560M

p3p1

23:10.0

0

ge-0/0/0

23:10.2

1

23:10.4

2

23:10.6

3

23:11.0

4

23:11.2

5

23:11.4

6

23:11.6

7

23:00.1

560M

p3p2

23:10.1

0

23:10.3

1

23:10.5

2

23:10.7

3

23:11.1

4

23:11.3

5

23:11.5

6

23:11.7

7

From this table, it looks the previous conclusions we got were almost right , except now we add one more device number 11, so the rules is updated below:

  • a number of 10 and 11 represents "virtual" device (or bus)

  • a sequence of "interleave" function numbers 0 2 4 6 represent the 4 "virtual" function 0 to 3 under each virtual device number 10 and 11, in the first physical port 0

  • a sequence of "interleave" function numbers 1 3 5 7 represent the 4 "virtual" function 0 to 3 under each virtual virtual device number 10 and 11 in the second physical port 1

18. virtio

18.1. virtio basic concept

comparing with "full virtualization" that QEMU software provided, virtio provides a "paravirtualization" environment, where guest operating system is "aware" that its running on a Hypervisor and cooperates with the hypervisor.

8c519726 b199 11e5 9caa 97bdfcae6900
Figure 12. Device emulation in full virtualization and paravirtualization environments

Virtio is a virtualization standard for network and disk device drivers, which enables guests VM to get high performance network and disk operations, and gives most of the performance benefits of paravirtualization.

The key features of virtio are highlighted below:

  • Virtio is an abstraction of common emulated devices like PCI, Hard drive and NICs

  • guest OS needs to have virtio drivers, called "front end", to be able to work in this environment. examples of this front end drivers are:

    • virtio-blk

    • virtio-net

    • virtio-balloon.

      currently ubuntu and most other linux variations have virtio driver supported in kernel hence no need any extra installations. for windows or other OSs, virtio drivers may need to be installed seperately.
  • The hypervisor itself implements "backend drivers" for device emulation

  • These "front end" and "backend drivers" work together to make up virtio

  • Usually front end drivers are part of QEMU and back end drivers are part of KVM

virtio
Figure 13. driver abstraction with virtio

In addition to the front-end drivers (implemented in the guest operating system) and the back-end drivers (implemented in the hypervisor), virtio defines two layers to support guest-to-hypervisor communication:

  • At the top level (called virtio) is the "virtual queue" interface that conceptually attaches front-end drivers to back-end drivers.

  • virtio-ring is used to buffer the info processed by the frontend/backend driver. it can also buffer the I/O requests from the frontend driver before hand over to the backend, improving the I/O processing efficiency.

0e05dcee b1a1 11e5 9042 138b40f4a01f
Figure 14. High-level architecture of the virtio framework

virtio can be used to create virtual NIC interfaces, independent of back end drivers in HyperVisor.

even For SR-IOV based MX86, virtio is still in use, but it is used only for vRE/vPFE internal and external management interfaces, not for data plane interfaces
virtio support in kvm

Virtio was chosen to be the main platform for IO virtualization in KVM. The idea behind it is to have a common framework for hypervisors for IO virtualization. At the moment, network/block/balloon devices are suported for kvm. The host implementation is in userspace - qemu, so no driver is needed in the host.

Virtio is relatively new technique and so it is not supported from first day of qemu-kvm, according to [linux-kvm-virtio], kvm version needs to be >60 and linux kernel needs to be later than 2.6.25. Also it needs some specific configuration options to be activated in the kernel. To verify if the QEMU-KVM installed in the server support virtio, run this command:

ping@trinity:~$ qemu-system-x86_64 -net nic,model=?
qemu: Supported NIC models: ne2k_pci,i82551,i82557b,i82559er,rtl8139,e1000,pcnet,virtio
                                                                                 ^^^^^^ (1)
1 virtio is supported in this qemu-kvm release.
virtio in VMX

the installation of virtio version of VMX does not enforce the following requirements:

  • physical NIC with SR-IOV capability

  • IXGBE driver

  • specific linux kernel version (e.g. 3.13)

  • IOMMU/VT-d

Therefore if performance is not a concern, it’s more convenient to setup virtio version of VMX for learning/testing purpose.

virtio linux kernel driver support

Current linux kernel already has virtio driver module support:

 1ping@trinity:~$ grep -i virtio /boot/config-3.13.0-32-generic
 2CONFIG_NET_9P_VIRTIO=m
 3CONFIG_VIRTIO_BLK=y
 4CONFIG_SCSI_VIRTIO=m
 5CONFIG_VIRTIO_NET=y
 6CONFIG_CAIF_VIRTIO=m
 7CONFIG_VIRTIO_CONSOLE=y
 8CONFIG_HW_RANDOM_VIRTIO=m
 9CONFIG_VIRTIO=y
10# Virtio drivers
11CONFIG_VIRTIO_PCI=y
12CONFIG_VIRTIO_BALLOON=y
13CONFIG_VIRTIO_MMIO=y
14CONFIG_VIRTIO_MMIO_CMDLINE_DEVICES=y
15
16ping@trinity:~$ find /lib/modules/3.13.0-32-generic/ -name "virtio*"
17/lib/modules/3.13.0-32-generic/kernel/drivers/scsi/virtio_scsi.ko
18/lib/modules/3.13.0-32-generic/kernel/drivers/char/hw_random/virtio-rng.ko

18.2. virtio verification in vPFE guest VM

logging into vPFE VM and more details about virtio can be observed.

  1. loggin into vPFE

    1ping@trinity:~$ telnet localhost 8817
    2Trying ::1...
    3Trying 127.0.0.1...
    4Connected to localhost.
    5Escape character is '^]'.
    6Wind River Linux 6.0.0.12 vfp-vmx1 console
    7vfp-vmx1 login: pfe
    8Password:
  2. para-virtualization

    para-virtualization simply means guest VM is "aware of" the virtualization environment, which can be verified by looking at the vFPC guest VM booting log:

     1/boot/modules/virtio.ko size 0x69a0 at 0x1616000
     2/boot/modules/virtio_pci.ko size 0x6fb8 at 0x161d000
     3/boot/modules/virtio_blk.ko size 0x7988 at 0x1624000
     4/boot/modules/if_vtnet.ko size 0x34f10 at 0x162c000
     5/boot/modules/virtio_console.ko size 0x9740 at 0x1661000
     6
     7virtio_pci0: <VirtIO PCI Network adapter> port 0xc560-0xc57f mem
     8    0xfebf1000-0xfebf1fff irq 10 at device 5.0 on pci0
     9em1: <VirtIO Networking Adapter> on virtio_pci0
    10virtio_pci0: host features: 0x511fffe3
    11    <RingIndirect,NotifyOnEmpty,RxModeExtra,VLanFilter,RxMode,ControlVq,Status,
    12    MrgRxBuf,TxUFO,TxTSOECN,TxTSOv6,TxTSOv4,RxUFO,RxECN,RxTSOv6,RxTSOv4,TxAllGSO,
    13    MacAddress,RxChecksum,TxChecksum>
    14virtio_pci0: negotiated features: 0x110f8020 <RingIndirect,NotifyOnEmpty,
    15    VLanFilter,RxMode,ControlVq,Status,MrgRxBuf,MacAddress>
    16virtio_pci1: <VirtIO PCI Balloon adapter> port 0xc580-0xc59f irq 10 at device
    17    6.0 on pci0
  3. virtio NIC

    To view the emulated PCI devices, use the same lspci command in the vPFE guest VM, as what we did in the host:

    pfe@vfp-vmx1:~$ lspci
    00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
    00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
    00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
    00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01)
    00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
    00:02.0 VGA compatible controller: Cirrus Logic GD 5446
    00:03.0 Ethernet controller: Red Hat, Inc Virtio network device         (1)
    00:04.0 Ethernet controller: Red Hat, Inc Virtio network device         (2)
    00:05.0 Ethernet controller: Red Hat, Inc Virtio network device         (3)
    00:06.0 Ethernet controller: Red Hat, Inc Virtio network device         (4)
    00:07.0 Multimedia audio controller: Intel Corporation 82801AA AC'97 Audio Controller (rev 01)
    00:08.0 Unclassified device [00ff]: Red Hat, Inc Virtio memory balloon

    The output captured here matches to the virsh command that we demonstrated previously, listing all VM virtual network(VN) info using KVM qmp command info network from outside of the VM (from host):

    ping@trinity:~$ sudo virsh qemu-monitor-command vcp-vmx1 --hmp "info network"
    [sudo] password for ping:
    net0: index=0,type=nic,model=e1000,macaddr=02:04:17:01:01:01
     \ hostnet0: index=0,type=tap,fd=18
    net1: index=0,type=nic,model=virtio-net-pci,macaddr=52:54:00:10:91:fe
     \ hostnet1: index=0,type=tap,fd=19
    
    ping@trinity:~$ sudo virsh qemu-monitor-command vfp-vmx1 --hmp "info network"
    net0: index=0,type=nic,model=virtio-net-pci,macaddr=02:04:17:01:01:02   (1)
     \ hostnet0: index=0,type=tap,fd=18
    net1: index=0,type=nic,model=virtio-net-pci,macaddr=52:54:00:db:34:a9   (2)
     \ hostnet1: index=0,type=tap,fd=21
    net2: index=0,type=nic,model=virtio-net-pci,macaddr=02:04:17:01:02:01   (3)
     \ hostnet2: index=0,type=tap,fd=23
    net3: index=0,type=nic,model=virtio-net-pci,macaddr=02:04:17:01:02:02   (4)
     \ hostnet3: index=0,type=tap,fd=25
    1 virtio interface for vPFE external mgmt interface: ext
    2 virtio interface for vPFE internal mgmt interface: int
    3 virtio interface for JUNOS interface ge-0/0/0
    4 virtio interface for JUNOS interface ge-0/0/1

    to get more detail about the virtio emulated PCI device:

    pfe@vfp-vmx1:~$ lspci -vvks 00:03.0
    00:03.0 Ethernet controller: Red Hat, Inc Virtio network device
            Subsystem: Red Hat, Inc Device 0001
            Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
            Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
            Latency: 0
            Interrupt: pin A routed to IRQ 10
            Region 0: I/O ports at c520 [size=32]
            Region 1: Memory at febd1000 (32-bit, non-prefetchable) [size=4K]
            Expansion ROM at feac0000 [disabled] [size=256K]
            Capabilities: <access denied>
            Kernel driver in use: virtio-pci        #<------

    as expected, currently virtio driver module is in use on this virtio emulated virtual NICs.

  4. virtio-balloon

    besides NIC, virtio also brings a memory-saving technique called "balloon", which will be covered in more details later.

    pfe@vfp-vmx1:~$ lspci -vvks 00:08.0
    00:08.0 Unclassified device [00ff]: Red Hat, Inc Virtio memory balloon
            Subsystem: Red Hat, Inc Device 0005
            Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
            Status: Cap- 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
            Latency: 0
            Interrupt: pin A routed to IRQ 11
            Region 0: I/O ports at c5a0 [size=32]
            Kernel driver in use: virtio-pci

    TODO

19. cpu pinning (affinitization)

19.1. 2 terms

There are 2 related terms regarding CPU pinning:

CPU pinning

CPU pinning is the ability to run specific VM’s virtual CPU (vCPU) on specific physical CPU (pCPU) in a specific host.

As explained in vcpu essential, a "vcpu" essentially is nothing but a thread running inside the space of a QEMU process. Therefore, pinning a "vCPU" is pinning a thread - restricting a thread to run on a dedicated physical CPU.

Restricting a thread to run on a single CPU avoids the performance cost caused by the cache invalidation that occurs when a thread ceases to execute on one CPU and then recommences execution on a different CPU. this is why CPU pinning usually gain performance - if tuned properly.

CPU affinity

CPU affinity is a scheduler property that "bonds" a process to a given set of CPUs on the system. The Linux scheduler will honor the given CPU affinity and the process will not run on any other CPUs. The CPU affinity is represented as a bitmask, with the lowest order bit corresponding to the first logical CPU and the highest order bit corresponding to the last logical CPU. The masks are typically given in hexadecimal.

A thread’s CPU affinity mask determines the set of CPUs on which it is eligible to run. On a multiprocessor system, vCPU pinning can be internally archived by setting the CPU affinity mask.

It is possible to ensure maximum execution speed for that thread, by dedicating one CPU to a particular thread:

  • setting the affinity mask of that thread to specify a single CPU

  • setting the affinity mask of all other threads to exclude that CPU)

for example:

affinity mask eligibile processors

0x00000001

processor #0

0x00000003

processors #0 and #1

0xFFFFFFFF

all processors (#0 through #31).

the Linux scheduler also supports natural CPU affinity: the scheduler attempts to keep pro‐ cesses on the same CPU as long as practical for performance reasons. Therefore, forcing a specific CPU affinity is useful only in certain applications.

Not all CPUs may exist on a given system but a mask may specify more CPUs than are present. A retrieved mask will reflect only the bits that correspond to CPUs physically on the system.

If an invalid mask is given (i.e., one that corresponds to no valid CPUs on the current system) an error is returned.

19.2. 2 virsh commands

There are at least 2 commands in libvirt virsh tool regarding "CPU affinity" feature:

vcpuinfo

used to retrieve the CPU affinity of a running VM.

vcpupin

used to "pin" guest domain virtual CPUs to physical host CPUs.

these 2 virsh commands play a very similiar roll as what linux utility taskset does, which is used to set or retrieve the CPU affinity of a running process given its PID or to launch a new COMMAND with a given CPU affinity.

19.3. a simple test

here is a simple test to demonstrate this feature:

first, list vcpuinfo BEFORE vcpupin:

vmx1 vRE:
1ping@trinity:~$ sudo virsh vcpuinfo vcp-vmx1
2VCPU:           0
3CPU:            0
4State:          running
5CPU time:       93.2s
6CPU Affinity:   y----------------------------- (1)
vmx1 vFPC:
 1ping@trinity:~$ sudo virsh vcpuinfo vfp-vmx1
 2VCPU:           0
 3CPU:            6
 4State:          running
 5CPU time:       18.1s
 6CPU Affinity:   yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy (1)
 7
 8VCPU:           1
 9CPU:            24
10State:          running
11CPU time:       12.9s
12CPU Affinity:   yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy (1)
13
14VCPU:           2
15CPU:            28
16State:          running
17CPU time:       12.4s
18CPU Affinity:   yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy (1)
19
20VCPU:           3
21CPU:            5
22State:          running
23CPU time:       12.2s
24CPU Affinity:   yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy (1)
1 "CPU Affinity mask". a y on a bit indicate the eligibility of the specific CPU NO. to run current process.

as can be seen before CPU pinning, the 4 VCPU/threads can be running in any of the physical cores, because the "CPU Affinity" is all y.

This is because by default, libvirt provisions guests using the hypervisor’s default policy. For most hypervisors, the policy is to run guests on any available processing core or CPU.

There are times when an explicit policy may be better, in particular for systems with a NUMA (Non-Uniform Memory Access) architecture. A guest on a NUMA system should be pinned to a processing core so that its memory allocations are always local to the node it is running on. This avoids cross-node memory transports which have less bandwidth and can significantly degrade performance.

virsh vcpupin vfp-vmx1 0 11
virsh vcpupin vfp-vmx1 1 12
virsh vcpupin vfp-vmx1 2 13
virsh vcpupin vfp-vmx1 3 8
virsh vcpupin vfp-vmx1 4 9
virsh vcpupin vcp-vmx1 0 15
virsh emulatorpin vcp-vmx1 0
virsh emulatorpin vfp-vmx1 0

This is the current mapping between vCPU(thread) and CPU (to shortly the capture we ignored the cpuinfo output from vmx2 instance):

vCPU(vmx1) CPU vCPU(vmx2) CPU Affinity

0(vcp)

0

0(vcp)

0

any CPU

0

6

0

27

any CPU

1

24

1

31

any CPU

2

28

2

25

any CPU

3

5

3

30

any CPU

after vcpupin
 1ping@trinity:~$ sudo virsh vcpuinfo vcp-vmx1
 2VCPU:           0
 3CPU:            7
 4State:          running
 5CPU time:       465.7s
 6CPU Affinity:   -------y------------------------
 7
 8ping@trinity:~$ sudo virsh vcpuinfo vfp-vmx1
 9VCPU:           0
10CPU:            0
11State:          running
12CPU time:       571.7s
13CPU Affinity:   y-------------------------------
14
15VCPU:           1
16CPU:            1
17State:          running
18CPU time:       1432.5s
19CPU Affinity:   -y------------------------------
20
21VCPU:           2
22CPU:            2
23State:          running
24CPU time:       1133.6s
25CPU Affinity:   --y-----------------------------
26
27VCPU:           3
28CPU:            3
29State:          running
30CPU time:       517.1s
31CPU Affinity:   ---y----------------------------
vCPU(vmx1) CPU vCPU(vmx2) CPU Affinity

0(vcp)

7

0(vcp)

15

only specified CPU

0

0

0

8

only specified CPU

1

1

1

9

only specified CPU

2

2

2

10

only specified CPU

3

3

3

11

only specified CPU

20. hugepage

20.1. hugepage basic concept

Whenever a process uses some memory, CPU will mark the RAM as used by that process. For efficiency, x86 CPU allocate RAM by chunks of 4K bytes, which is named as "a page". Those pages can be swapped to disk, etc.

Since the process address space are virtual, the CPU and the operating system have to remember which page belong to which process, and where it is stored. the data structure that is used to maintain this info is called TLB (translation lookaside buffer) , which is often implemented as CAM (content-addressable memory).

The CAM search key is the virtual address and the search result is a physical address. If the requested address is present in the TLB, the CAM search yields a match quickly and the retrieved physical address can be used to access memory. This is called a TLB hit. If the requested address is not in the TLB, it is a cache miss.

Obviously, the more pages you have, the more time it takes the CPU to search the TLB for the target memory. Increasing page size (hence called Hugepage) will reduce the amount of pages, and eventually improving the CPU performance.

Most current CPU architectures support hugepage, but possibly with a different name/term, but they are all the same thing:

  • Huge pages (on Linux)

  • Super Pages (on BSD)

  • Large Pages (on Windows)

20.2. hugepage allocation

The allocation of hugepages should be done at boot time or as soon as possible after system boot to prevent memory from being fragmented in physical memory. To reserve hugepages at boot time, a parameter is passed to the Linux* kernel on the kernel command line.

x86 by default use 4K page size

ping@ubuntu1:~$ getconf PAGESIZE
4096
root@ubuntu1:~# getconf -a | grep -i page
PAGESIZE                           4096
PAGE_SIZE                          4096
_AVPHYS_PAGES                      72309823
_PHYS_PAGES                        99076139

VMX use hugepage size 2M, and total number of hugepages is configured to be 32k, make it 64G hugepage memory for the guest VMs.

before hugepage was enabled:

ping@trinity:/images/vmx_20151102.0/build$ cat /proc/meminfo | grep -i huge
AnonHugePages:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB

after:

root@ubuntu1:~# grep -i page /proc/meminfo
AnonPages:       2244644 kB
PageTables:        17280 kB
AnonHugePages:   2076672 kB
HugePages_Total:   32786            #<------
HugePages_Free:    24969
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB

same info can be printed from kernel options using sysctl utility:

ping@ubuntu1:/sys/module/ixgbevf$ sudo sysctl -a | grep -i hugepage
vm.hugepages_treat_as_movable = 0
vm.nr_hugepages = 32786     #<------
vm.nr_hugepages_mempolicy = 32786
vm.nr_overcommit_hugepages = 0
ping@ubuntu1:~$ sysctl vm.nr_hugepages
vm.nr_hugepages = 32786

In ubuntu system, and all the other linux variants, these parameters are stored in the below file systems:

  • under /proc/sys/vm/ folder in the /proc file system

  • under /sys/kernel/mm/hugepages/ folder in the /sys file system

    ping@ubuntu1:~$ grep -R "" /proc/sys/vm/*huge*
    /proc/sys/vm/hugepages_treat_as_movable:0
    /proc/sys/vm/hugetlb_shm_group:0
    /proc/sys/vm/nr_hugepages:32768
    /proc/sys/vm/nr_hugepages_mempolicy:32768
    /proc/sys/vm/nr_overcommit_hugepages:0
    ping@ubuntu1:~$ grep -R "" /sys/kernel/mm/hugepages/
    /sys/kernel/mm/hugepages/hugepages-2048kB/nr_overcommit_hugepages:0
    /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages:32768
    /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages_mempolicy:32768
    /sys/kernel/mm/hugepages/hugepages-2048kB/surplus_hugepages:0
    /sys/kernel/mm/hugepages/hugepages-2048kB/resv_hugepages:0
    /sys/kernel/mm/hugepages/hugepages-2048kB/free_hugepages:24951
    ping@ubuntu1:~$
about /proc and /sys

/proc is very special in that it is also a virtual filesystem. It’s sometimes referred to as a process information pseudo-file system. It doesn’t contain 'real' files but runtime system information (e.g. system memory, devices mounted, hardware configuration, etc). For this reason it can be regarded as a control and information centre for the kernel. In fact, quite a lot of system utilities are simply calls to files in this directory. For example, 'lsmod' is the same as 'cat /proc/modules' while 'lspci' is a synonym for 'cat /proc/pci'. By altering files located in this directory you can even read/change kernel parameters (sysctl) while the system is running. refer to [proc_file_system] for more details of proc file system.

sysfs is a feature of the Linux 2.6 kernel that allows kernel code to export information to user processes via an in-memory filesystem. The organization of the filesystem directory hierarchy is strict, and based the internal organization of kernel data structures. The files that are created in the filesystem are (mostly) ASCII files with (usually) one value per file. These features ensure that the information exported is accurate and easily accessible, making sysfs one of the most intuitive and useful features of the 2.6 kernel.

Another tool to query the same data is hugeadm:

ping@ubuntu1:~$ hugeadm --pool-list
      Size  Minimum  Current  Maximum  Default
   2097152    32768    32768    32768        *

Interestingly, with --explain option it shows some "explanation" about the output.

ping@ubuntu1:~$ hugeadm --pool-list --explain
      Size  Minimum  Current  Maximum  Default
   2097152    32768    32768    32768        *
Total System Memory: 387016 MB
Mount Point          Options
/run/hugepages/kvm   rw,relatime,mode=775,gid=112
/HugePage_vPFE       rw,relatime
/HugePage_vPFE       rw,relatime
Huge page pools:
      Size  Minimum  Current  Maximum  Default
   2097152    32768    32768    32768        *
Huge page sizes with configured pools:
A /proc/sys/kernel/shmmax value of 33554432 bytes may be sub-optimal. To maximise
shared memory usage, this should be set to the size of the largest shared memory
segment size you want to be able to use. Alternatively, set it to a size matching
the maximum possible allocation size of all huge pages. This can be done
automatically, using the --set-recommended-shmmax option.
The recommended shmmax for your currently allocated huge pages is 68719476736 bytes.
To make shmmax settings persistent, add the following line to /etc/sysctl.conf:
  kernel.shmmax = 68719476736
hugeadm:WARNING: User ping (uid: 1003) is not a member of the hugetlb_shm_group root (gid: 0)!
Note: Permanent swap space should be preferred when dynamic huge page pools are used.

20.3. change hugepages number

To change the number of hugepages in realtime, use kernel tool sysctl. in this example we change hugepages from 32786 to 32768:

ping@ubuntu1:~$ sudo sysctl vm.nr_hugepages=32768
vm.nr_hugepages = 32768

Now the number of total hugepages are changed, as well as the free hugepages:

ping@ubuntu1:~$ grep -i huge /proc/meminfo
AnonHugePages:   2076672 kB
HugePages_Total:   32768        #<------
HugePages_Free:    24951        #<------
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB

20.4. hugepage and numa

The huage pages are distributed between all numanode, in this server there are 2 numanode , so each of node were allocated 16384 pages. the interesting part is the number of free hugepages in each node: the hugepages from first node 0 will be consumed first (by VMX instance in our context), so only 8567 pages are left. VMX consumed 7817 pages, which is 15634M memory.

ping@ubuntu1:~$ cat /sys/devices/system/node/node*/meminfo | grep Huge
Node 0 AnonHugePages:     88064 kB
Node 0 HugePages_Total: 16384   #<------
Node 0 HugePages_Free:   8567
Node 0 HugePages_Surp:      0
Node 1 AnonHugePages:   2050048 kB
Node 1 HugePages_Total: 16384   #<------
Node 1 HugePages_Free:  16384
Node 1 HugePages_Surp:      0

The hugepages from the second node (node1) will be consumed if we start another VMX instance. The hugepage allocation were performed by libvirt "under the scene".

20.5. a simple test of hugepage

currently 24951 hugepages are free:

ping@ubuntu1:~$ grep -i huge /proc/meminfo
AnonHugePages:   2076672 kB
HugePages_Total:   32768
HugePages_Free:    24951        #<------
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB

if we destroy VPFE VM:

ping@ubuntu1:~$ sudo virsh destroy 36
Domain 36 destroyed

now all hugepages were claimed back:

ping@ubuntu1:~$ grep -i huge /proc/meminfo
AnonHugePages:   2068480 kB
HugePages_Total:   32768
HugePages_Free:    32768        #<------
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB

start VM:

ping@ubuntu1:~$ sudo virsh start vhepe-vfp
Domain vhepe-vfp started

free huagepages were again consumed by the VM.

ping@ubuntu1:~$ grep -i huge /proc/meminfo
AnonHugePages:   2076672 kB
HugePages_Total:   32768
HugePages_Free:    24951        #<------
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB

totally (32768-24951) X 2048K = 15634M hugepage memory were allocated to the vPFE VM.

refer to huge page for more details on this topic.

appendix

reference

VMX package folder structure

  1ping@trinity:~/vmx_20151102.0$ tree
  2.
  3├── config                          #<------all config files that will be read
  4│   ├── samples                             and parsed by the installation script
  5│   │   ├── vmx.conf.sriov
  6│   │   ├── vmx.conf.virtio
  7│   │   └── vmx-galaxy.conf
  8│   ├── vmx.conf                    #<------the main config file used
  9│   └── vmx-junosdev.conf
 10├── docs
 11│   ├── mx86_on_openstack_release_notes.pdf
 12│   └── VMX_Release_Notes_Installation_Guide_Beta.pdf
 13├── drivers
 14│   ├── galaxy
 15│   │   ├── Makefile
 16│   │   └── network_add.c
 17│   └── ixgbe-3.19.1                #<------ixgbe driver source code
 18│       ├── COPYING
 19│       ├── ixgbe.7
 20│       ├── ixgbe.spec
 21│       ├── pci.updates
 22│       ├── README
 23│       ├── scripts
 24│       │   └── set_irq_affinity
 25│       ├── src
 26│       │   ├── ixgbe.7.gz
 27│       │   ├── ixgbe_82598.c
 28│       │   ├── ixgbe_82598.h
 29│       │   ├── ixgbe_82599.c
 30│       │   ├── ixgbe_82599.h
 31│       │   ├── ixgbe_api.c
 32│       │   ├── ixgbe_api.h
 33│       │   ├── ixgbe_common.c
 34│       │   ├── ixgbe_common.h
 35│       │   ├── ixgbe_dcb_82598.c
 36│       │   ├── ixgbe_dcb_82598.h
 37│       │   ├── ixgbe_dcb_82599.c
 38│       │   ├── ixgbe_dcb_82599.h
 39│       │   ├── ixgbe_dcb.c
 40│       │   ├── ixgbe_dcb.h
 41│       │   ├── ixgbe_dcb_nl.c
 42│       │   ├── ixgbe_debugfs.c
 43│       │   ├── ixgbe_ethtool.c
 44│       │   ├── ixgbe_fcoe.c
 45│       │   ├── ixgbe_fcoe.h
 46│       │   ├── ixgbe.h
 47│       │   ├── ixgbe_lib.c
 48│       │   ├── ixgbe_main.c
 49│       │   ├── ixgbe_mbx.c
 50│       │   ├── ixgbe_mbx.h
 51│       │   ├── ixgbe.mod.c
 52│       │   ├── ixgbe_osdep2.h
 53│       │   ├── ixgbe_osdep.h
 54│       │   ├── ixgbe_param.c
 55│       │   ├── ixgbe_phy.c
 56│       │   ├── ixgbe_phy.h
 57│       │   ├── ixgbe_procfs.c
 58│       │   ├── ixgbe_ptp.c
 59│       │   ├── ixgbe_sriov.c
 60│       │   ├── ixgbe_sriov.h
 61│       │   ├── ixgbe_sysfs.c
 62│       │   ├── ixgbe_type.h
 63│       │   ├── ixgbe_x540.c
 64│       │   ├── ixgbe_x540.h
 65│       │   ├── kcompat.c
 66│       │   ├── kcompat_ethtool.c
 67│       │   ├── kcompat.h
 68│       │   ├── Makefile
 69│       │   ├── modules.order
 70│       │   ├── Module.supported
 71│       │   └── Module.symvers
 72│       └── SUMS
 73├── env
 74│   ├── ubuntu_sriov.env
 75│   └── ubuntu_virtio.env
 76├── images                          #<------all VM images that will be
 77│   ├── metadata_usb.img                    loaded by the KVM-qumu hypervisor
 78│   ├── vFPC-20151102.img
 79│   └── vmxhdd.img
 80├── scripts                         #<------installation scripts
 81│   ├── common
 82│   │   ├── get_numa.sh
 83│   │   ├── get_pci.sh
 84│   │   ├── vmx_common_utils.sh
 85│   │   ├── vmx_configure.py        #<------python script to parse YAML config file
 86│   │   ├── vmx_env.sh                      and generate the libvirt friendly XML files
 87│   │   ├── vmx_galaxy.py
 88│   │   ├── vmx_img_cli.sh
 89│   │   └── vmx_preinstall_checks.sh
 90│   ├── junosdev-bind
 91│   │   ├── vmx_brctl.sh
 92│   │   ├── vmx-junosdev-bind.py
 93│   │   ├── vmx_linkctl.sh
 94│   │   └── vmx_vhost_pin.sh
 95│   ├── kvm
 96│   │   ├── common
 97│   │   │   ├── vmx_kvm_bringup.sh
 98│   │   │   ├── vmx_kvm_cleanup.sh
 99│   │   │   ├── vmx_kvm_system_setup.sh
100│   │   │   └── vmx_kvm_verify.sh
101│   │   ├── sriov
102│   │   │   └── vmx_kvm_sriov.sh
103│   │   └── virtio
104│   │       └── vmx_kvm_virtio.sh
105│   └── templates
106│       ├── _br_ext-ref.xml
107│       ├── _br_int-ref.xml
108│       ├── _vPFE-ref-ubuntu.xml
109│       ├── _vPFE-ref.xml
110│       └── _vRE-ref.xml
111└── vmx.sh                          #<------the main script to be executed
112
11318 directories, 91 files

VMX installation script generated XML file

generated vRE xml

build/vmx1/xml/vRE-generated.xml:

  1ping@trinity:~$ cat xml.backup/vRE-generated.xml
  2<domain type="kvm">
  3
  4  <name>vcp-vmx1</name>                             #<------domain name (VM)
  5
  6  <memory unit="Mb">2048</memory>                   #<------memory: 2G
  7
  8  <vcpu placement="static">1</vcpu>                 #<------1 CPU for RE
  9
 10  <cputune>                                         #<------finetune: vcpupin
 11    <vcpupin cpuset="0" vcpu="0" />
 12  </cputune>
 13
 14  <resource>
 15    <partition>/machine</partition>
 16  </resource>
 17
 18  <sysinfo type="smbios">
 19    <bios>
 20      <entry name="vendor">Juniper</entry>
 21    </bios>
 22    <system>
 23      <entry name="manufacturer">VMX</entry>
 24      <entry name="product">VM-vcp_vmx1-161-re-0</entry>
 25      <entry name="version">0.1.0</entry>
 26    </system>
 27  </sysinfo>
 28
 29  <os>                                              #<------os info
 30    <smbios mode="sysinfo" />
 31    <type arch="x86_64" machine="pc-0.13">hvm</type>
 32    <boot dev="hd" />                               #<------boot sequence
 33  </os>
 34
 35  <features>                                        #<------HW featured supported
 36    <acpi />
 37    <apic />
 38    <pae />
 39  </features>
 40
 41  <cpu mode="host-model">
 42    <topology cores="1" sockets="1" threads="1" />
 43  </cpu>
 44
 45  <clock offset="utc" />
 46
 47  <on_poweroff>destroy</on_poweroff>
 48
 49  <on_reboot>restart</on_reboot>
 50
 51  <on_crash>restart</on_crash>
 52
 53  <devices>
 54    <emulator>/usr/bin/qemu-system-x86_64</emulator>
 55
 56    <disk device="disk" type="file">
 57      <driver cache="directsync" name="qemu" type="qcow2" />
 58      <source file="/images/vmx_20151102.0/build/vmx1/images/jinstall64-vmx-15.1F-20151104.0-domestic.img" />
 59      <target bus="ide" dev="hda" />
 60      <address bus="0" controller="0" target="0" type="drive" unit="0" />
 61    </disk>
 62
 63    <disk device="disk" type="file">
 64      <driver cache="directsync" name="qemu" type="qcow2" />
 65      <source file="/images/vmx_20151102.0/build/vmx1/images/vmxhdd.img" />
 66      <target bus="ide" dev="hdb" />
 67      <address bus="0" controller="0" target="0" type="drive" unit="1" />
 68    </disk>
 69
 70    <disk device="disk" type="file">
 71      <driver cache="directsync" name="qemu" type="raw" />
 72      <source file="/images/vmx_20151102.0/images/metadata_usb.img" />
 73      <target bus="usb" dev="sda" />
 74    </disk>
 75
 76    <controller index="0" type="usb">               #<------pci device
 77      <address bus="0x00" domain="0x0000" function="0x2" slot="0x01" type="pci" />
 78    </controller>
 79    <controller index="0" type="ide">
 80      <address bus="0x00" domain="0x0000" function="0x1" slot="0x01" type="pci" />
 81    </controller>
 82    <controller index="0" model="pci-root" type="pci" />
 83
 84    <interface type="bridge">               #<------generate bridge interface
 85      <source bridge="br-ext" />            #<------named vcp_ext-vmx1
 86      <target dev="vcp_ext-vmx1" />         #<------bound to br-ext bridge
 87      <model type="e1000" />                #<------soft emulation
 88      <mac address="02:12:DE:C0:DE:22" />
 89    </interface>
 90
 91    <interface type="bridge">               #<------internal bridge
 92      <source bridge="br-int-vmx1" />
 93      <target dev="vcp_int-vmx1" />
 94      <model type="virtio" />               #<------virtio emulation
 95    </interface>
 96
 97    <serial type="tcp">
 98      <source host="127.0.0.1" mode="bind" service="8896" />
 99      <protocol type="telnet" />
100      <target port="0" />
101    </serial>
102
103    <console type="tcp">
104      <source host="127.0.0.1" mode="bind" service="8896" />
105      <protocol type="telnet" />
106      <target port="0" type="serial" />
107    </console>
108
109    <input bus="usb" type="tablet" />
110    <input bus="ps2" type="mouse" />
111    <input bus="ps2" type="keyboard" />
112
113    <graphics autoport="yes" listen="127.0.0.1" port="-1" type="vnc">
114                                          #<------automatic vnc server port
115      <listen address="127.0.0.1" type="address" />
116    </graphics>
117
118    <sound model="ac97">
119      <address bus="0x00" domain="0x0000" function="0x0" slot="0x04" type="pci" />
120    </sound>
121
122    <video>
123      <model heads="1" type="cirrus" vram="9216" />
124      <address bus="0x00" domain="0x0000" function="0x0" slot="0x02" type="pci" />
125    </video>
126
127    <memballoon model="virtio">            #<------memory balloon location
128      <address bus="0x00" domain="0x0000" function="0x0" slot="0x06" type="pci" />
129    </memballoon>
130
131  </devices>
132</domain>

generated vPFE xml

build/vmx1/xml/vPFE-generated.xml:

  1ping@trinity:~$ cat xml.backup/vPFE-generated.xml
  2<domain type="kvm">
  3
  4  <name>vfp-vmx1</name>
  5
  6  <memory unit="MB">16384</memory>
  7
  8  <memoryBacking>
  9    <nosharepages />
 10    <hugepages />
 11  </memoryBacking>
 12
 13  <vcpu placement="static">4</vcpu>
 14
 15  <numatune>
 16    <memory mode="strict" nodeset="1" />
 17  </numatune>
 18
 19  <os>
 20    <type arch="x86_64" machine="pc-i440fx-trusty">hvm</type>
 21    <boot dev="hd" />
 22  </os>
 23
 24  <features>
 25    <acpi />
 26  </features>
 27
 28  <cpu mode="host-model">
 29    <topology cores="4" sockets="1" threads="1" />
 30  </cpu>
 31
 32  <clock offset="utc" />
 33  <on_poweroff>destroy</on_poweroff>
 34  <on_reboot>restart</on_reboot>
 35  <on_crash>restart</on_crash>
 36
 37  <devices>
 38    <emulator>/usr/bin/qemu-system-x86_64</emulator>
 39
 40    <disk device="disk" type="file">
 41      <driver cache="directsync" name="qemu" type="raw" />
 42      <source file="/images/vmx_20151102.0/build/vmx1/images/vFPC-20151102.img" />
 43      <target bus="ide" dev="hda" />
 44    </disk>
 45
 46    <controller index="0" model="pci-root" type="pci" />
 47
 48    <interface type="bridge">
 49      <source bridge="br-ext" />
 50      <target dev="vfp_ext-vmx1" />
 51      <model type="virtio" />
 52      <alias name="net0" />
 53      <address bus="0x00" domain="0x0000" function="0x0" slot="0x03" type="pci" />
 54      <mac address="02:12:DE:C0:DE:23" />
 55    </interface>
 56
 57    <interface type="bridge">
 58      <source bridge="br-int-vmx1" />
 59      <target dev="vfp_int-vmx1" />
 60      <model type="virtio" />
 61      <alias name="net0" />
 62      <address bus="0x00" domain="0x0000" function="0x0" slot="0x04" type="pci" />
 63    </interface>
 64
 65    <serial type="tcp">
 66      <source host="127.0.0.1" mode="bind" service="8897" />
 67      <protocol type="telnet" />
 68      <target port="0" />
 69    </serial>
 70
 71    <console type="tcp">
 72      <source host="127.0.0.1" mode="bind" service="8897" />
 73      <protocol type="telnet" />
 74      <target port="0" type="serial" />
 75    </console>
 76
 77    <input bus="usb" type="tablet" />
 78    <input bus="ps2" type="mouse" />
 79    <input bus="ps2" type="keyboard" />
 80    <graphics autoport="yes" listen="127.0.0.1" port="-1" type="vnc">
 81      <listen address="127.0.0.1" type="address" />
 82    </graphics>
 83    <sound model="ac97">
 84    </sound>
 85    <video>
 86      <model heads="1" type="cirrus" vram="9216" />
 87    </video>
 88
 89    <memballoon model="virtio">
 90    </memballoon>
 91
 92    <interface managed="yes" type="hostdev">        #<------VT-d SR-IOV VF assignment
 93      <mac address="02:16:0A:0E:FF:31" />           #<------MAC appeared to VM
 94      <source>                                      #<------VF to be assigned
 95        <address bus="0x23" domain="0x0000" function="0x0" slot="0x10" type="pci" />
 96      </source>                                     #<------23:10.0 is p3p1 vf0
 97    </interface>
 98
 99    <interface managed="yes" type="hostdev">
100      <mac address="02:16:0A:0E:FF:32" />
101      <source>
102        <address bus="0x06" domain="0x0000" function="0x0" slot="0x10" type="pci" />
103      </source>                                     #<------06:10.0 is p2p1 vf0
104    </interface>
105  </devices>
106
107</domain>

generated virtual network XML

br-ext-generated.xml:
 1<network>
 2  <name>br-ext</name>
 3  <forward mode="route" />
 4  <bridge delay="0" name="br-ext" stp="on" />
 5  <mac address="52:54:00:9f:a0:77" />
 6  <ip address="10.85.4.17" netmask="255.255.255.128">
 7    <dhcp>
 8      <host ip="10.85.4.105" mac="02:04:17:01:01:01" name="vcp-vmx1" />
 9      <host ip="10.85.4.106" mac="02:04:17:01:01:02" name="vfp-vmx1" />
10    </dhcp>
11  </ip>
12</network>
br-int-generated.xml:
1<network>
2  <name>br-int-vmx1</name>
3  <bridge delay="0" name="br-int-vmx1" stp="on" />
4</network>

generated shell scripts

build/vmx1/xml/cpu_affinitize.sh:
1virsh vcpupin vfp-vmx1 0 11
2virsh vcpupin vfp-vmx1 1 12
3virsh vcpupin vfp-vmx1 2 13
4virsh vcpupin vfp-vmx1 3 8
5virsh vcpupin vfp-vmx1 4 9
6virsh vcpupin vcp-vmx1 0 15
7virsh emulatorpin vcp-vmx1 0
8virsh emulatorpin vfp-vmx1 0
build/vmx1/xml/vfconfig-generated.sh
 1#Handling interface p3p1
 2ifconfig p3p1 up
 3sleep 2
 4ifconfig p3p1 promisc
 5ifconfig p3p1 allmulti
 6ifconfig p3p1 mtu 2000
 7ip link set p3p1 vf 0 mac 02:04:17:01:02:01
 8ip link set p3p1 vf 0 rate 10000
 9echo 0000:23:10.0 > /sys/bus/pci/devices/0000:23:10.0/driver/unbind
10echo 0000:23:10.0 >> /sys/bus/pci/drivers/pci-stub/bind
11ip link set p3p1 vf 0 spoofchk off
12
13#Handling interface p2p1
14ifconfig p2p1 up
15sleep 2
16ifconfig p2p1 promisc
17ifconfig p2p1 allmulti
18ifconfig p2p1 mtu 2000
19ip link set p2p1 vf 0 mac 02:04:17:01:02:02
20ip link set p2p1 vf 0 rate 10000
21echo 0000:06:10.0 > /sys/bus/pci/devices/0000:06:10.0/driver/unbind
22echo 0000:06:10.0 >> /sys/bus/pci/drivers/pci-stub/bind
23ip link set p2p1 vf 0 spoofchk off

generated files for VIRTIO

comparing with SRIOV, the generated files for VIRTIO remains the same except:

  1. the "interface" part for vPFE VM now looks:

     1......
     2<interface type="network">
     3  <mac address="02:04:17:01:02:01" />
     4  <source network="default" />
     5  <model type="virtio" />
     6  <target dev="ge-0.0.0-vmx1" />
     7</interface>
     8<interface type="network">
     9  <mac address="02:04:17:01:02:02" />
    10  <source network="default" />
    11  <model type="virtio" />
    12  <target dev="ge-0.0.1-vmx1" />
    13</interface>
  2. since there is no concept of VF for VIRTIO virtual interface, there is no script generated for VF configuration.

dumpxml vcp-vmx1

  1ping@trinity:/virtualization/vmx1$ cat vRE-generated-all.xml
  2<domain type='kvm' id='2'>
  3  <name>vcp-vmx1</name>
  4  <uuid>956ce015-cf26-4752-8679-6925f512870c</uuid>
  5  <memory unit='KiB'>2000896</memory>
  6  <currentMemory unit='KiB'>2000000</currentMemory>
  7  <vcpu placement='static'>1</vcpu>
  8  <cputune>
  9    <vcpupin vcpu='0' cpuset='15'/>
 10    <emulatorpin cpuset='0'/>
 11  </cputune>
 12  <resource>
 13    <partition>/machine</partition>
 14  </resource>
 15    <sysinfo type='smbios'>
 16      <bios>
 17        <entry name='vendor'>Juniper</entry>
 18      </bios>
 19      <system>
 20        <entry name='manufacturer'>VMX</entry>
 21        <entry name='product'>VM-vcp_vmx1-161-re-0</entry>
 22        <entry name='version'>0.1.0</entry>
 23      </system>
 24    </sysinfo>
 25  <os>
 26    <type arch='x86_64' machine='pc-0.13'>hvm</type>
 27    <boot dev='hd'/>
 28    <smbios mode='sysinfo'/>
 29  </os>
 30  <features>
 31    <acpi/>
 32    <apic/>
 33    <pae/>
 34  </features>
 35  <cpu mode='host-model'>
 36    <model fallback='allow'/>
 37    <topology sockets='1' cores='1' threads='1'/>
 38  </cpu>
 39  <clock offset='utc'/>
 40  <on_poweroff>destroy</on_poweroff>
 41  <on_reboot>restart</on_reboot>
 42  <on_crash>restart</on_crash>
 43  <devices>
 44    <emulator>/usr/bin/qemu-system-x86_64</emulator>
 45    <disk type='file' device='disk'>
 46      <driver name='qemu' type='qcow2' cache='directsync'/>
 47      <source file='/virtualization/images/vmx_20151102.0/build/vmx1/images/jinstall64-vmx-15.1F-20151104.0-domestic.img'/>
 48      <backingStore/>
 49      <target dev='hda' bus='ide'/>
 50      <alias name='ide0-0-0'/>
 51      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
 52    </disk>
 53    <disk type='file' device='disk'>
 54      <driver name='qemu' type='qcow2' cache='directsync'/>
 55      <source file='/virtualization/images/vmx_20151102.0/build/vmx1/images/vmxhdd.img'/>
 56      <backingStore/>
 57      <target dev='hdb' bus='ide'/>
 58      <alias name='ide0-0-1'/>
 59      <address type='drive' controller='0' bus='0' target='0' unit='1'/>
 60    </disk>
 61    <disk type='file' device='disk'>
 62      <driver name='qemu' type='raw' cache='directsync'/>
 63      <source file='/virtualization/images/vmx_20151102.0/images/metadata_usb.img'/>
 64      <backingStore/>
 65      <target dev='sda' bus='usb'/>
 66      <alias name='usb-disk0'/>
 67    </disk>
 68    <controller type='usb' index='0'>
 69      <alias name='usb0'/>
 70      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/>
 71    </controller>
 72    <controller type='ide' index='0'>
 73      <alias name='ide0'/>
 74      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/>
 75    </controller>
 76    <controller type='pci' index='0' model='pci-root'>
 77      <alias name='pci.0'/>
 78    </controller>
 79    <interface type='bridge'>
 80      <mac address='02:04:17:01:01:01'/>
 81      <source bridge='br-ext'/>
 82      <target dev='vcp_ext-vmx1'/>
 83      <model type='e1000'/>
 84      <alias name='net0'/>
 85      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
 86    </interface>
 87    <interface type='bridge'>
 88      <mac address='52:54:00:7a:91:1a'/>
 89      <source bridge='br-int-vmx1'/>
 90      <target dev='vcp_int-vmx1'/>
 91      <model type='virtio'/>
 92      <alias name='net1'/>
 93      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
 94    </interface>
 95    <serial type='tcp'>
 96      <source mode='bind' host='127.0.0.1' service='8816'/>
 97      <protocol type='telnet'/>
 98      <target port='0'/>
 99      <alias name='serial0'/>
100    </serial>
101    <console type='tcp'>
102      <source mode='bind' host='127.0.0.1' service='8816'/>
103      <protocol type='telnet'/>
104      <target type='serial' port='0'/>
105      <alias name='serial0'/>
106    </console>
107    <input type='tablet' bus='usb'>
108      <alias name='input0'/>
109    </input>
110    <input type='mouse' bus='ps2'/>
111    <input type='keyboard' bus='ps2'/>
112    <graphics type='vnc' port='5900' autoport='yes' listen='127.0.0.1'>
113      <listen type='address' address='127.0.0.1'/>
114    </graphics>
115    <sound model='ac97'>
116      <alias name='sound0'/>
117      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
118    </sound>
119    <video>
120      <model type='cirrus' vram='9216' heads='1'/>
121      <alias name='video0'/>
122      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
123    </video>
124    <memballoon model='virtio'>
125      <alias name='balloon0'/>
126      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
127    </memballoon>
128  </devices>
129</domain>

dumpxml vfp-vmx1

  1ping@trinity:/virtualization/vmx1$ cat vPFE-generated-all.xml
  2<domain type='kvm' id='3'>
  3  <name>vfp-vmx1</name>
  4  <uuid>399d395f-d583-42e5-ace9-86118b346565</uuid>
  5  <memory unit='KiB'>16000000</memory>
  6  <currentMemory unit='KiB'>16000000</currentMemory>
  7  <memoryBacking>
  8    <hugepages/>
  9    <nosharepages/>
 10  </memoryBacking>
 11  <vcpu placement='static'>4</vcpu>
 12  <cputune>
 13    <vcpupin vcpu='0' cpuset='11'/>
 14    <vcpupin vcpu='1' cpuset='12'/>
 15    <vcpupin vcpu='2' cpuset='13'/>
 16    <vcpupin vcpu='3' cpuset='8'/>
 17    <emulatorpin cpuset='0'/>
 18  </cputune>
 19  <numatune>
 20    <memory mode='strict' nodeset='1'/>
 21  </numatune>
 22  <resource>
 23    <partition>/machine</partition>
 24  </resource>
 25  <os>
 26    <type arch='x86_64' machine='pc-i440fx-trusty'>hvm</type>
 27    <boot dev='hd'/>
 28  </os>
 29  <features>
 30    <acpi/>
 31  </features>
 32  <cpu mode='host-model'>
 33    <model fallback='allow'/>
 34    <topology sockets='1' cores='4' threads='1'/>
 35  </cpu>
 36  <clock offset='utc'/>
 37  <on_poweroff>destroy</on_poweroff>
 38  <on_reboot>restart</on_reboot>
 39  <on_crash>restart</on_crash>
 40  <devices>
 41    <emulator>/usr/bin/qemu-system-x86_64</emulator>
 42    <disk type='file' device='disk'>
 43      <driver name='qemu' type='raw' cache='directsync'/>
 44      <source file='/virtualization/images/vmx_20151102.0/build/vmx1/images/vFPC-20151102.img'/>
 45      <backingStore/>
 46      <target dev='hda' bus='ide'/>
 47      <alias name='ide0-0-0'/>
 48      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
 49    </disk>
 50    <controller type='pci' index='0' model='pci-root'>
 51      <alias name='pci.0'/>
 52    </controller>
 53    <controller type='usb' index='0'>
 54      <alias name='usb0'/>
 55      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/>
 56    </controller>
 57    <controller type='ide' index='0'>
 58      <alias name='ide0'/>
 59      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/>
 60    </controller>
 61    <interface type='bridge'>
 62      <mac address='02:04:17:01:01:02'/>
 63      <source bridge='br-ext'/>
 64      <target dev='vfp_ext-vmx1'/>
 65      <model type='virtio'/>
 66      <alias name='net0'/>
 67      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
 68    </interface>
 69    <interface type='bridge'>
 70      <mac address='52:54:00:9c:b3:f2'/>
 71      <source bridge='br-int-vmx1'/>
 72      <target dev='vfp_int-vmx1'/>
 73      <model type='virtio'/>
 74      <alias name='net1'/>
 75      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
 76    </interface>
 77    <interface type='hostdev' managed='yes'>
 78      <mac address='02:04:17:01:02:01'/>
 79      <driver name='kvm'/>
 80      <source>
 81        <address type='pci' domain='0x0000' bus='0x23' slot='0x10' function='0x0'/>
 82      </source>
 83      <alias name='hostdev0'/>
 84      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' function='0x0'/>
 85    </interface>
 86    <interface type='hostdev' managed='yes'>
 87      <mac address='02:04:17:01:02:02'/>
 88      <driver name='kvm'/>
 89      <source>
 90        <address type='pci' domain='0x0000' bus='0x06' slot='0x10' function='0x0'/>
 91      </source>
 92      <alias name='hostdev1'/>
 93      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
 94    </interface>
 95    <serial type='tcp'>
 96      <source mode='bind' host='127.0.0.1' service='8817'/>
 97      <protocol type='telnet'/>
 98      <target port='0'/>
 99      <alias name='serial0'/>
100    </serial>
101    <console type='tcp'>
102      <source mode='bind' host='127.0.0.1' service='8817'/>
103      <protocol type='telnet'/>
104      <target type='serial' port='0'/>
105      <alias name='serial0'/>
106    </console>
107    <input type='tablet' bus='usb'>
108      <alias name='input0'/>
109    </input>
110    <input type='mouse' bus='ps2'/>
111    <input type='keyboard' bus='ps2'/>
112    <graphics type='vnc' port='5901' autoport='yes' listen='127.0.0.1'>
113      <listen type='address' address='127.0.0.1'/>
114    </graphics>
115    <sound model='ac97'>
116      <alias name='sound0'/>
117      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
118    </sound>
119    <video>
120      <model type='cirrus' vram='9216' heads='1'/>
121      <alias name='video0'/>
122      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
123    </video>
124    <memballoon model='virtio'>
125      <alias name='balloon0'/>
126      <address type='pci' domain='0x0000' bus='0x00' slot='0x08' function='0x0'/>
127    </memballoon>
128  </devices>
129</domain>

virsh capabilities complete output

ping@trinity:~$ sudo virsh capabilities
<capabilities>

  <host>
    <uuid>36373931-3138-5553-4534-333739575353</uuid>
    <cpu>
      <arch>x86_64</arch>
      <model>SandyBridge</model>
      <vendor>Intel</vendor>
      <topology sockets='1' cores='8' threads='1'/>
      <feature name='invtsc'/>
      <feature name='erms'/>
      <feature name='smep'/>
      <feature name='fsgsbase'/>
      <feature name='pdpe1gb'/>
      <feature name='rdrand'/>
      <feature name='f16c'/>
      <feature name='osxsave'/>
      <feature name='dca'/>
      <feature name='pcid'/>
      <feature name='pdcm'/>
      <feature name='xtpr'/>
      <feature name='tm2'/>
      <feature name='est'/>
      <feature name='smx'/>
      <feature name='vmx'/>
      <feature name='ds_cpl'/>
      <feature name='monitor'/>
      <feature name='dtes64'/>
      <feature name='pbe'/>
      <feature name='tm'/>
      <feature name='ht'/>
      <feature name='ss'/>
      <feature name='acpi'/>
      <feature name='ds'/>
      <feature name='vme'/>
      <pages unit='KiB' size='4'/>
      <pages unit='KiB' size='2048'/>
    </cpu>
    <power_management>
      <suspend_disk/>
      <suspend_hybrid/>
    </power_management>
    <migration_features>
      <live/>
      <uri_transports>
        <uri_transport>tcp</uri_transport>
      </uri_transports>
    </migration_features>
    <topology>
      <cells num='4'>
        <cell id='0'>
          <memory unit='KiB'>132067432</memory>
          <pages unit='KiB' size='4'>33016858</pages>
          <pages unit='KiB' size='2048'>0</pages>
          <distances>
            <sibling id='0' value='10'/>
            <sibling id='1' value='21'/>
            <sibling id='2' value='21'/>
            <sibling id='3' value='21'/>
          </distances>
          <cpus num='8'>
            <cpu id='0' socket_id='0' core_id='0' siblings='0'/>
            <cpu id='1' socket_id='0' core_id='1' siblings='1'/>
            <cpu id='2' socket_id='0' core_id='2' siblings='2'/>
            <cpu id='3' socket_id='0' core_id='3' siblings='3'/>
            <cpu id='4' socket_id='0' core_id='4' siblings='4'/>
            <cpu id='5' socket_id='0' core_id='5' siblings='5'/>
            <cpu id='6' socket_id='0' core_id='6' siblings='6'/>
            <cpu id='7' socket_id='0' core_id='7' siblings='7'/>
          </cpus>
        </cell>
        <cell id='1'>
          <memory unit='KiB'>132116676</memory>
          <pages unit='KiB' size='4'>33029169</pages>
          <pages unit='KiB' size='2048'>0</pages>
          <distances>
            <sibling id='0' value='21'/>
            <sibling id='1' value='10'/>
            <sibling id='2' value='21'/>
            <sibling id='3' value='21'/>
          </distances>
          <cpus num='8'>
            <cpu id='8' socket_id='1' core_id='0' siblings='8'/>
            <cpu id='9' socket_id='1' core_id='1' siblings='9'/>
            <cpu id='10' socket_id='1' core_id='2' siblings='10'/>
            <cpu id='11' socket_id='1' core_id='3' siblings='11'/>
            <cpu id='12' socket_id='1' core_id='4' siblings='12'/>
            <cpu id='13' socket_id='1' core_id='5' siblings='13'/>
            <cpu id='14' socket_id='1' core_id='6' siblings='14'/>
            <cpu id='15' socket_id='1' core_id='7' siblings='15'/>
          </cpus>
        </cell>
        <cell id='2'>
          <memory unit='KiB'>132116676</memory>
          <pages unit='KiB' size='4'>33029169</pages>
          <pages unit='KiB' size='2048'>0</pages>
          <distances>
            <sibling id='0' value='21'/>
            <sibling id='1' value='21'/>
            <sibling id='2' value='10'/>
            <sibling id='3' value='21'/>
          </distances>
          <cpus num='8'>
            <cpu id='16' socket_id='2' core_id='0' siblings='16'/>
            <cpu id='17' socket_id='2' core_id='1' siblings='17'/>
            <cpu id='18' socket_id='2' core_id='2' siblings='18'/>
            <cpu id='19' socket_id='2' core_id='3' siblings='19'/>
            <cpu id='20' socket_id='2' core_id='4' siblings='20'/>
            <cpu id='21' socket_id='2' core_id='5' siblings='21'/>
            <cpu id='22' socket_id='2' core_id='6' siblings='22'/>
            <cpu id='23' socket_id='2' core_id='7' siblings='23'/>
          </cpus>
        </cell>
        <cell id='3'>
          <memory unit='KiB'>132116616</memory>
          <pages unit='KiB' size='4'>33029154</pages>
          <pages unit='KiB' size='2048'>0</pages>
          <distances>
            <sibling id='0' value='21'/>
            <sibling id='1' value='21'/>
            <sibling id='2' value='21'/>
            <sibling id='3' value='10'/>
          </distances>
          <cpus num='8'>
            <cpu id='24' socket_id='3' core_id='0' siblings='24'/>
            <cpu id='25' socket_id='3' core_id='1' siblings='25'/>
            <cpu id='26' socket_id='3' core_id='2' siblings='26'/>
            <cpu id='27' socket_id='3' core_id='3' siblings='27'/>
            <cpu id='28' socket_id='3' core_id='4' siblings='28'/>
            <cpu id='29' socket_id='3' core_id='5' siblings='29'/>
            <cpu id='30' socket_id='3' core_id='6' siblings='30'/>
            <cpu id='31' socket_id='3' core_id='7' siblings='31'/>
          </cpus>
        </cell>
      </cells>
    </topology>
    <secmodel>
      <model>none</model>
      <doi>0</doi>
    </secmodel>
    <secmodel>
      <model>dac</model>
      <doi>0</doi>
      <baselabel type='kvm'>+0:+0</baselabel>
      <baselabel type='qemu'>+0:+0</baselabel>
    </secmodel>
  </host>

  <guest>
    <os_type>hvm</os_type>
    <arch name='i686'>
      <wordsize>32</wordsize>
      <emulator>/usr/bin/qemu-system-i386</emulator>
      <machine canonical='pc-i440fx-trusty' maxCpus='255'>pc</machine>
      <machine maxCpus='255'>pc-0.12</machine>
      <machine maxCpus='255'>pc-1.3</machine>
      <machine maxCpus='255'>pc-q35-1.6</machine>
      <machine canonical='pc-1.0-qemu-kvm' maxCpus='255'>pc-1.0-precise</machine>
      <machine maxCpus='255'>pc-q35-1.5</machine>
      <machine maxCpus='1'>xenpv</machine>
      <machine maxCpus='255'>pc-i440fx-1.6</machine>
      <machine maxCpus='255'>pc-i440fx-1.7</machine>
      <machine canonical='pc-i440fx-1.5-qemu-kvm' maxCpus='255'>pc-i440fx-1.5-saucy</machine>
      <machine maxCpus='255'>pc-0.11</machine>
      <machine maxCpus='255'>pc-0.10</machine>
      <machine maxCpus='255'>pc-1.2</machine>
      <machine maxCpus='1'>isapc</machine>
      <machine maxCpus='255'>pc-q35-1.4</machine>
      <machine maxCpus='128'>xenfv</machine>
      <machine maxCpus='255'>pc-0.15</machine>
      <machine maxCpus='255'>pc-0.14</machine>
      <machine maxCpus='255'>pc-i440fx-1.5</machine>
      <machine canonical='pc-q35-2.0' maxCpus='255'>q35</machine>
      <machine maxCpus='255'>pc-i440fx-1.4</machine>
      <machine maxCpus='255'>pc-1.1</machine>
      <machine maxCpus='255'>pc-q35-1.7</machine>
      <machine canonical='pc-1.0' maxCpus='255'>pc-1.0-qemu-kvm</machine>
      <machine maxCpus='255'>pc-i440fx-2.0</machine>
      <machine maxCpus='255'>pc-0.13</machine>
      <domain type='qemu'>
      </domain>
      <domain type='kvm'>
        <emulator>/usr/bin/kvm</emulator>
        <machine canonical='pc-i440fx-trusty' maxCpus='255'>pc</machine>
        <machine maxCpus='255'>pc-1.3</machine>
        <machine maxCpus='255'>pc-0.12</machine>
        <machine maxCpus='255'>pc-q35-1.6</machine>
        <machine canonical='pc-1.0-qemu-kvm' maxCpus='255'>pc-1.0-precise</machine>
        <machine maxCpus='255'>pc-q35-1.5</machine>
        <machine maxCpus='1'>xenpv</machine>
        <machine maxCpus='255'>pc-i440fx-1.6</machine>
        <machine canonical='pc-i440fx-1.5-qemu-kvm' maxCpus='255'>pc-i440fx-1.5-saucy</machine>
        <machine maxCpus='255'>pc-i440fx-1.7</machine>
        <machine maxCpus='255'>pc-0.11</machine>
        <machine maxCpus='255'>pc-1.2</machine>
        <machine maxCpus='255'>pc-0.10</machine>
        <machine maxCpus='1'>isapc</machine>
        <machine maxCpus='255'>pc-q35-1.4</machine>
        <machine maxCpus='128'>xenfv</machine>
        <machine maxCpus='255'>pc-0.15</machine>
        <machine maxCpus='255'>pc-0.14</machine>
        <machine maxCpus='255'>pc-i440fx-1.5</machine>
        <machine maxCpus='255'>pc-i440fx-1.4</machine>
        <machine canonical='pc-q35-2.0' maxCpus='255'>q35</machine>
        <machine maxCpus='255'>pc-1.1</machine>
        <machine maxCpus='255'>pc-q35-1.7</machine>
        <machine canonical='pc-1.0' maxCpus='255'>pc-1.0-qemu-kvm</machine>
        <machine maxCpus='255'>pc-i440fx-2.0</machine>
        <machine maxCpus='255'>pc-0.13</machine>
      </domain>
    </arch>
    <features>
      <cpuselection/>
      <deviceboot/>
      <disksnapshot default='on' toggle='no'/>
      <acpi default='on' toggle='yes'/>
      <apic default='on' toggle='no'/>
      <pae/>
      <nonpae/>
    </features>
  </guest>

  <guest>
    <os_type>hvm</os_type>
    <arch name='x86_64'>
      <wordsize>64</wordsize>
      <emulator>/usr/bin/qemu-system-x86_64</emulator>
      <machine canonical='pc-i440fx-trusty' maxCpus='255'>pc</machine>
      <machine maxCpus='255'>pc-1.3</machine>
      <machine maxCpus='255'>pc-0.12</machine>
      <machine maxCpus='255'>pc-q35-1.6</machine>
      <machine canonical='pc-1.0-qemu-kvm' maxCpus='255'>pc-1.0-precise</machine>
      <machine maxCpus='255'>pc-q35-1.5</machine>
      <machine maxCpus='1'>xenpv</machine>
      <machine maxCpus='255'>pc-i440fx-1.6</machine>
      <machine canonical='pc-i440fx-1.5-qemu-kvm' maxCpus='255'>pc-i440fx-1.5-saucy</machine>
      <machine maxCpus='255'>pc-i440fx-1.7</machine>
      <machine maxCpus='255'>pc-0.11</machine>
      <machine maxCpus='255'>pc-1.2</machine>
      <machine maxCpus='255'>pc-0.10</machine>
      <machine maxCpus='1'>isapc</machine>
      <machine maxCpus='255'>pc-q35-1.4</machine>
      <machine maxCpus='128'>xenfv</machine>
      <machine maxCpus='255'>pc-0.15</machine>
      <machine maxCpus='255'>pc-0.14</machine>
      <machine maxCpus='255'>pc-i440fx-1.5</machine>
      <machine maxCpus='255'>pc-i440fx-1.4</machine>
      <machine canonical='pc-q35-2.0' maxCpus='255'>q35</machine>
      <machine maxCpus='255'>pc-1.1</machine>
      <machine maxCpus='255'>pc-q35-1.7</machine>
      <machine canonical='pc-1.0' maxCpus='255'>pc-1.0-qemu-kvm</machine>
      <machine maxCpus='255'>pc-i440fx-2.0</machine>
      <machine maxCpus='255'>pc-0.13</machine>
      <domain type='qemu'>
      </domain>
      <domain type='kvm'>
        <emulator>/usr/bin/kvm</emulator>
        <machine canonical='pc-i440fx-trusty' maxCpus='255'>pc</machine>
        <machine maxCpus='255'>pc-1.3</machine>
        <machine maxCpus='255'>pc-0.12</machine>
        <machine maxCpus='255'>pc-q35-1.6</machine>
        <machine canonical='pc-1.0-qemu-kvm' maxCpus='255'>pc-1.0-precise</machine>
        <machine maxCpus='255'>pc-q35-1.5</machine>
        <machine maxCpus='1'>xenpv</machine>
        <machine maxCpus='255'>pc-i440fx-1.6</machine>
        <machine canonical='pc-i440fx-1.5-qemu-kvm' maxCpus='255'>pc-i440fx-1.5-saucy</machine>
        <machine maxCpus='255'>pc-i440fx-1.7</machine>
        <machine maxCpus='255'>pc-0.11</machine>
        <machine maxCpus='255'>pc-1.2</machine>
        <machine maxCpus='255'>pc-0.10</machine>
        <machine maxCpus='1'>isapc</machine>
        <machine maxCpus='255'>pc-q35-1.4</machine>
        <machine maxCpus='128'>xenfv</machine>
        <machine maxCpus='255'>pc-0.15</machine>
        <machine maxCpus='255'>pc-0.14</machine>
        <machine maxCpus='255'>pc-i440fx-1.5</machine>
        <machine maxCpus='255'>pc-i440fx-1.4</machine>
        <machine canonical='pc-q35-2.0' maxCpus='255'>q35</machine>
        <machine maxCpus='255'>pc-1.1</machine>
        <machine maxCpus='255'>pc-q35-1.7</machine>
        <machine canonical='pc-1.0' maxCpus='255'>pc-1.0-qemu-kvm</machine>
        <machine maxCpus='255'>pc-i440fx-2.0</machine>
        <machine maxCpus='255'>pc-0.13</machine>
      </domain>
    </arch>
    <features>
      <cpuselection/>
      <deviceboot/>
      <disksnapshot default='on' toggle='no'/>
      <acpi default='on' toggle='yes'/>
      <apic default='on' toggle='no'/>
    </features>
  </guest>

</capabilities>

ixgbe driver issue

The orignal IXGBE driver coming with ubuntu has issue on multicast - the VMX "Getting Started Guide" does mention this:

Multicast promiscuous mode for Virtual Functions is needed to receive control traffic that comes with broadcast MAC addresses

So it looks like an issue of multicast support specifically on VF. A quick test here illustates the issue.

To test this I manually setup a VMX, with the original IXGBE driver coming with ubuntu. the setup is as simple as just a HP server with VMX installed connected to another L3 switch(QFX), showing below:

ixgbe multicast issue test diagram
   HP server
.............
.  +------+ .                    +------+
.  |      | .                    |      |
.  |VMX   | .TX RX               |QFX   |
.  |      | . |  |               |      |
.  +------+ . |  |               |      |
.           . |  |               |      |
.10.10.10.1 . |  X               |      |
.  NIC/VF   . |  |               |      |
.  +------+ . v  +-<-----------  |      |
.  +---+--+ . |                  |      |
.      |    . +---->-----------  |      |
.............                    +--+---+
       |                            | 10.10.10.2
       +----------------------------+

configuration on both end are simple: an IP interface with OSPF enabled on it.

QFX configuration
set groups test-ixgbe interfaces xe-0/0/18 unit 0 family inet address 10.10.10.2/24
set groups test-ixgbe routing-instances test instance-type virtual-router
set groups test-ixgbe routing-instances test interface xe-0/0/18.0
set groups test-ixgbe routing-instances test protocols ospf area 0.0.0.0 interface xe-0/0/18.0
VMX configuration
root# show | compare
[edit]
+  protocols {
+      ospf {
+          area 0.0.0.0 {
+              interface ge-0/0/0.0;
+          }
+      }
+  }
[edit]
root# run show ospf interface
Interface           State   Area            DR ID           BDR ID          Nbrs
ge-0/0/0.0          DR      0.0.0.0         10.10.10.1      0.0.0.0            0
the issue

the issue is that ping works, but OSPF adjacency does not come up:

vmx# run ping 10.10.10.2
PING 10.10.10.2 (10.10.10.2): 56 data bytes
64 bytes from 10.10.10.2: icmp_seq=0 ttl=64 time=13.889 ms
^C
--- 10.10.10.2 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max/stddev = 13.889/13.889/13.889/0.000 ms
root# root# run show ospf neighbor
root#
packet capture

packet capture on VMX host server shows physical port (p2p1) is able to:

  • receive ospf packets from peer device

  • deliver OSPF packets out once receiving from VMX

    ping@matrix:/home/vAVPN/images$ sudo tcpdump -ni p2p1
    tcpdump: WARNING: p2p1: no IPv4 address assigned
    tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
    listening on p2p1, link-type EN10MB (Ethernet), capture size 65535 bytes
    12:18:12.963197 IP 10.10.10.1 > 224.0.0.5: OSPFv2, Hello, length 56
    12:18:17.477019 IP 10.10.10.2 > 224.0.0.5: OSPFv2, Hello, length 60
    12:18:21.483668 IP 10.10.10.1 > 224.0.0.5: OSPFv2, Hello, length 56
    12:18:21.989157 LLDP, length 296: sonata
    ^C

but packet capture from VMX shows it doesn’t receive anything:

root# run monitor traffic interface ge-0/0/0 size 2000
verbose output suppressed, use <detail> or <extensive> for full protocol decode
Address resolution is ON. Use <no-resolve> to avoid any reverse lookup delay.
Address resolution timeout is 4s.
Listening on ge-0/0/0, capture size 2000 bytes
Reverse lookup for 224.0.0.5 failed (check DNS reachability).
Other reverse lookup failures will not be reported.
Use <no-resolve> to avoid reverse lookups on IP addresses.
20:17:20.318417 Out IP 10.10.10.1 > 224.0.0.5: OSPFv2, Hello, length 56
20:17:29.588411 Out IP 10.10.10.1 > 224.0.0.5: OSPFv2, Hello, length 56
20:17:37.958421 Out IP 10.10.10.1 > 224.0.0.5: OSPFv2, Hello, length 56
20:17:47.058547 Out IP 10.10.10.1 > 224.0.0.5: OSPFv2, Hello, length 56
20:17:54.598639 Out IP 10.10.10.1 > 224.0.0.5: OSPFv2, Hello, length 56
20:18:03.078845 Out IP 10.10.10.1 > 224.0.0.5: OSPFv2, Hello, length 56
20:18:12.408862 Out IP 10.10.10.1 > 224.0.0.5: OSPFv2, Hello, length 56
20:18:20.929166 Out IP 10.10.10.1 > 224.0.0.5: OSPFv2, Hello, length 56

meanwhile capture in QFX shows it is receiving and sending packets without a problem:

labroot@sonata# run monitor traffic interface xe-0/0/18
verbose output suppressed, use <detail> or <extensive> for full protocol decode
Address resolution is ON. Use <no-resolve> to avoid any reverse lookup delay.
Address resolution timeout is 4s.
Listening on xe-0/0/18, capture size 96 bytes
Reverse lookup for 224.0.0.5 failed (check DNS reachability).
Other reverse lookup failures will not be reported.
Use <no-resolve> to avoid reverse lookups on IP addresses.
16:00:25.375850  In IP 10.10.10.1 > 224.0.0.5: OSPFv2, Hello, length 56
16:00:28.511860 Out IP truncated-ip - 20 bytes missing! 10.10.10.2 > 224.0.0.5: OSPFv2, Hello, length 60
16:00:30.867355 Out LLDP, name sonata, length 60
        [|LLDP]
16:00:35.119801  In IP 10.10.10.1 > 224.0.0.5: OSPFv2, Hello, length 56
16:00:37.549066 Out IP truncated-ip - 20 bytes missing! 10.10.10.2 > 224.0.0.5: OSPFv2, Hello, length 60
16:00:42.633445  In IP 10.10.10.1 > 224.0.0.5: OSPFv2, Hello, length 56
16:00:47.589120 Out IP truncated-ip - 20 bytes missing! 10.10.10.2 > 224.0.0.5: OSPFv2, Hello, length 60
16:00:52.077874  In IP 10.10.10.1 > 224.0.0.5: OSPFv2, Hello, length 56
16:00:56.129699 Out IP truncated-ip - 20 bytes missing! 10.10.10.2 > 224.0.0.5: OSPFv2, Hello, length 60
16:00:58.003400 Out LLDP, name sonata, length 60
        [|LLDP]
16:01:01.841421  In IP 10.10.10.1 > 224.0.0.5: OSPFv2, Hello, length 56
16:01:05.589241 Out IP truncated-ip - 20 bytes missing! 10.10.10.2 > 224.0.0.5: OSPFv2, Hello, length 60
16:01:11.136212  In IP 10.10.10.1 > 224.0.0.5: OSPFv2, Hello, length 56
16:01:14.940515 Out IP truncated-ip - 20 bytes missing! 10.10.10.2 > 224.0.0.5: OSPFv2, Hello, length 60

So it just looks like OSPF packet coming in the server NIC does not get handed over to VMX. ethtool with -S option will print counters for all VFs, here is the capture when the issue is ongoing:

ping@matrix:~$ ethtool -S p2p1  | grep -iE "rx_packets|tx_packets|VF 0"
     rx_packets: 33685
     tx_packets: 8
     VF 0 Rx Packets: 315
     VF 0 Rx Bytes: 19546
     VF 0 Tx Packets: 20407
     VF 0 Tx Bytes: 1821654
     VF 0 MC Packets: 0

It shows VF 0 didn’t receive any packet. Given that physical NIC p2p1 (PF) actually received the packet, we can conclude all multicast packets were dropped at VF 0.

For physical NIC the multicast capability can be enabled by this command:

ifconfig p3p1 up promisc allmulti mtu 2000

unfortunately this does not apply to VFs. The current solution is Juniper modify the IXGBE code to fix this. So recompiling IXGBE driver from Juniper provided source code will solve this issue.

end


1. interface showing in host, and peering with guest VM interface, e.g. the vcp_ext-vmx1 interface is "peer interface" of fxp0 in VMX
2. the IP address specified here still needs to be configured manually from inside of the vRE guest VM, unless dhcp client is implemented from guest VM to request for an IP from the DHCP server running in host OS, this may be implemented in the future releases
3. in the same host there can be some other VMs that may not be spinned up by libvirt, in that case those VM won’t be under control of libvirt
4. the dns service here is not of our concern at this time, knowning we also have these "extra" services enabled does not harm anyway
5. or, it can be "logical" cpu core when hyperthreading is enabled
6. from SR-IOV spec