Oracle Cloud Reference Design
Distribution of the Oracle Cloud Cookbook or derivative of the work in any form for commercial purposes is prohibited unless prior permission is obtained from the copyright holder.
Author: Roddy Rodstein
Change Log
|
Revision
|
Change Description
|
Updated By
|
Date
|
|
1.0
|
First Release
|
Roddy Rodstein
|
11/21/11
|
| 1.1 |
Oracle VM for x86 Disaster Recovery |
Roddy Rodstein |
11/06/12 |
| 1.2 |
Oracle VM Servers backup and restoration |
Roddy Rodstein |
04/20/12 |
| 1.3 |
Oracle VM Fault Testing & Oracle VM Architecture |
Roddy Rodstein |
07/31/12 |
| 1.4 |
Hardware sizing and content refresh |
Roddy Rodstein |
05/20/13 |
Table of Contents
...Oracle VM for x86 Security Standards
...Oracle VM for x86 Administration and Monitoring Standards
...Virtual Machine Operating System Standards
...Support Service Standards
Oracle Cloud Implementation
The Oracle Cloud Reference Design Introduction
This chapter of the Oracle Cloud Cookbook presents the Oracle Cloud reference design. The Oracle Cloud reference designs encompass the software, hardware, storage, and network components required to deploy a scalable, secure, and supportable internal or external Oracle cloud.
The Oracle Cloud reference design is a field-tested best-practice standard, designed with simplicity, reproducibility, usability, scalability, supportability and security. The Oracle Cloud reference designs represent a complete Oracle Cloud standard that can be leveraged as a vanilla solution or modified to more accurately reflect organization-specific needs. The Oracle Cloud reference design includes the following categories and solutions:
|
Disaster Recovery as a Service DRaaS
|
Software as a Service SaaS
|
Oracle Applications
|
Oracle Enterprise Manager
Oracle VM Manager
Open Source
|
|
Platform as a Service PaaS
|
Oracle Fusion Middleware
|
|
Oracle Database (DaaS)
|
|
Infrastructure as a Service IaaS
|
Virtual Machines (Linux, Windows & Solaris) |
|
Oracle VM for x86
|
|
x86 64 Servers
|
|
Storage
|
Note: A detailed explanation of each category and solution in the Oracle Cloud reference design is presented in the architectural overview section.
The Oracle Cloud Reference Design Implementation Overview
The Oracle Cloud reference design provides a well defined starting point for each Oracle Cloud implementation. It also serves as a baseline upon which all solution additions, revisions, and tools will be based. As such, there is an increasing value to Oracle Cloud reference design in keeping implementations as close to the reference design as possible.
Prior to implementing an Oracle Cloud, it’s important that an infrastructure assessment (IA) and gap analysis (GA) be performed. During the IA/GA, the architecture of the solution will match the customer’s business needs while maintaining the integrity of the Oracle Cloud reference design. Implementation and support will follow the analysis phase after careful consideration has been given to any specific design modifications that deviate from the Oracle Cloud reference design.
This document outlines the decision points necessary for implementing the Oracle Cloud reference design. For decisions that rely on preexisting factors or specific organizational needs, the appropriate best practice will be discovered in the infrastructure assessment (IA) and gap analysis (GA). The best practices should be analyzed carefully and decisions should be made based on organizational needs, existing architecture, and budget resource availability.
The Oracle Cloud reference design is designed to be scalable and resilient for ease of implementation, high availability, and ease of maintenance for internal and external Oracle clouds. The complete solution is made up of four architectural components that work together to provide flexibility and options with respect to on-demand self-service, broad network access, resource pooling, elasticity, measured service, high availability, security and ease of maintenance. The design breaks down into the following four components:
- Disaster Recovery as a Service (DRaaS). Disaster Recovery as a Service provides business with a variety of cost-effective ways to replicate and recover data center infrastructure, applications, and data in internal, external and hybrid clouds. The Oracle Cloud reference design outlines the decision points necessary for implementing an Oracle cloud infrastructure with Oracle Disaster Recovery options.
- Software as a Service (SaaS). Software as a Service is the capability to host and deliver applications over the Internet, accessible from various client devices. The provider manages the cloud infrastructure and application portfolio that is accessed by the consumer. The Oracle Cloud reference design outlines the decision points necessary for implementing an Oracle cloud infrastructure and Oracle Software as a Service delivery model.
- Platform as a Service (PaaS). Platform as a Service is the capability to host and allow access to a computing platform and software stack for application development. The provider hosts the computing platform and software stack on the cloud infrastructure that is accessed by the consumer. The consumer manages the computing platform and software stack used for application development. The Oracle Cloud reference design outlines the decision points necessary for implementing the cloud infrastructure and the Oracle Platform as a Service delivery model.
- Infrastructure as a Service (IaaS). Infrastructure as a Service is the capability to provision and deliver fundamental computing resources as a service to the consumer. The Oracle Cloud reference design outlines the decision points necessary for implementing the cloud infrastructure to deliver Oracle Software as a Service and Oracle Infrastructure as a Service.
The next Figure shows a high-level overview of the Oracle Cloud reference design components.

The Oracle Cloud reference design isolates Oracle VM server pools into the following four security domains:
-
Controlled: A controlled security domain is used to restrict access between security domains. A controlled security domain could contain groups of users with their network equipment or a demilitarized zone (DMZ).
-
Uncontrolled: An uncontrolled security domain refers to any network not in control of an organization, such as the Internet.
-
Restricted: A restricted security domain can represent an organization’s production, test and development networks. Access is restricted to authorized personnel, and there is no direct access from the Internet.
-
Secured: A secured security domain is a network that is only accessible to a small group of highly trusted users, such as administrators and auditors.
Note: The classification of security domains is very similar to data classifications. FIPS PUB 199 is the Standards for Security Categorization of Federal Information and Information Systems. FIPS PUB 199 can be used to determine the security category of systems and within which security domain systems should reside.
The Oracle Cloud Reference Design Support Infrastructure
Support is an integral part of the Oracle Cloud reference design and includes a combination of Oracle support agreements and on-site and off-site support from the implementing party. Administrators will have several options for support, including live assistance, phone support, and forums.
This table outlines the decision points for the support infrastructure for the Oracle Cloud reference design. For decisions that rely on preexisting factors or specific organizational needs, the appropriate best practice will be discovered in the infrastructure assessment (IA) and gap analysis (GA). The best practices should be analyzed carefully and decisions should be made based on organizational needs, existing architecture, and budget resource availability.
| Decision Point |
Decision |
Justification |
| Oracle Support Agreements |
Oracle Support Agreements for the Oracle technologies will be active and up to date. |
Support is an integral part of every successful IT project. Oracle support agreements are necessary to be able to create and manage service requests as well as to be able to receive software patches and updates from Oracle Enterprise Manager and My Oracle Support.
|
| On-site and Off-site support |
On-site and off-site support from the implementing party will be used for maintenance, site reviews, upgrades, and security audits. |
On-site and off-site support from the implementing party for problem resolution, system maintenance, site reviews, upgrades, and security audits augments the Oracle support agreement and internal IT operations staff. |
Oracle Cloud Architectural Design
The following sections provides the decision matrices for the Oracle Cloud reference design. Implementers of the Oracle Cloud reference design can use the decision matrices as quick reference guide to identify settings and configuration decisions to be implemented in the environment. These decisions should be carefully analyzed during a gap analysis phase.
Oracle VM for x86 Hardware Architecture
The server hardware for your Oracle VM environment is a critical component in the success of your Oracle cloud project. The first step in selecting an Oracle VM hardware platform is to size the server hardware, followed by calculating the total number of servers required to be in each Oracle VM server pool. The formula to calculate Oracle VM server sizing is: The total aggregate virtual machine CPU, RAM and Storage requirements plus your N+x availability requirements provides the total server count along with the hardware requirements.
Oracle VM server sizing is calculated by adding the aggregate CPU, RAM and storage requirements for all of the virtual machines that could run on an Oracle VM server, and then selecting server hardware with ample CPU, RAM and storage resources. Once the server hardware has been selected, the number of servers in a server pool is calculated by selecting enough servers to support the aggregate CPU, RAM and storage requirements of all of the virtual machines within a server pool, including the number of additional servers for availability, i.e. HA, Live Migration and Distributed Resource Scheduling (DRS). Oracle VM server pools that use HA, Live Migration and DRS must have excess CPU and RAM capacity for hardware failures and virtual machine migrations. The number of network interfaces for an Oracle VM server is determined by the network switch VLAN setup and the total number of Oracle VM management network ports, and the virtual machine network ports for your environment.
Oracle VM server can be installed on an x86 64 bit server with up to 160 CPU cores or threads, up to 4 TB of RAM, with up to 40 network ports. The default behavior of the Oracle VM server installation program is to allocate only 3 GB of disk space for the entire Oracle VM installation, regardless of the amount of available disk space. Since Oracle VM server only requires 3 GB of storage, you might consider procuring disk-less hardware with a flash storage module or boot from SAN to reduce operating costs.
Tip: I have had the opportunity to support and benchmark Oracle VM server installations on a slow single 4 GB SSD Drive (18 MB/second Read Transfer Rate,17 MB/second Write Transfer Rate) as well as Oracle VM server installations using local 7k, 10k and 15k disks. The read and write performance from either type of Oracle VM server installation disk on the remote virtual machine storage (SAN, NFS or iSCSI) from the Oracle VM server and the virtual machines was identical. The disk speed from the Oracle VM server installation does not affect the remote storage read and write performance.
The next table shows the maximum number of CPUs, RAM and NICs for Oracle VM server release 3.2.x and above.
| Item |
Maximum |
|
CPU Cores or Threads
|
160
|
|
RAM
|
4 TB
|
|
NICs
|
40
|
Oracle VM Server CPU, RAM and storage hardware sizing is calculated by determining the total number of virtual machines CPU, RAM, and storage (I/O and disk) requirements per Oracle VM server. For example, if a single virtual machine with 16 CPUs, 128 GB RAM, 1 TB of disk space with 1500 IOPS will run on one Oracle VM server, the Oracle VM server hardware should have at least 16 CPU cores or threads, 130 GB RAM, 1 TB of disk space and the ability to support 1500 IOPS with local or remote storage. If two virtual machines each with 16 CPUs, 128 GB RAM, 1 TB of disk space with 1500 IOPS will run on one Oracle VM server, the Oracle VM server hardware must have at least 32 CPU cores or threads, 300G RAM, 2 TB of disk space and the ability to support 3000 IOPS with local or remote storage.
A single Oracle VM 3.2.x server can support up to 160 CPU cores or threads, 4 TB of memory with local or remote storage. An Oracle VM server with 4 TB of RAM and 160 CPU cores or threads could allocate the majority of the 4 TB of RAM and more than 160 CPU cores or threads to running virtual machines. Oracle VM server supports CPU oversubscription. CPU oversubscription means that an Oracle VM server with 160 CPU cores could overallocate the total number of CPU cores to virtual machines. Oracle VM server does not support memory oversubscription, which means that an Oracle VM server with 4 TB of RAM cannot overallocate RAM to virtual machines. By default, each Oracle VM server reserves 512 MB of RAM for Oracle VM server (dom0). The average memory overhead for each running virtual machine on an Oracle VM server is approximately 20 MB plus 1% of each virtual machine' memory allocation. The remaining RAM can be allocated to virtual machines.
A best practice is to avoid oversubscribing CPU-bound workloads such as the Oracle Database. CPU oversubscription with CPU-bound workloads negatively affects the performance and availability of an Oracle VM server along with all of the virtual machines running on the server. CPU oversubscription for non-CPU-bound workloads, such as Oracle Fusion Middleware products, is highly recommended. It is common to oversubscribe CPU cores 3-to-1 with non-CPU-bound workloads. For example, one CPU core could be allocated to 3 virtual CPUs for non-CPU-bound workloads without a performance penalty.
Note: Virtual machines cannot aggregate CPU and memory resources from more than one Oracle VM server. That is, a virtual machine consumes resources only from the Oracle VM server where the virtual machine is running.
Oracle VM has two high-availability features, HA and Live Migration. Oracle VM HA and Live Migration along with Distributed Resource Scheduling (DRS) must be considered to calculate the total number of servers required to respond to hardware failures and virtual machine migrations.
The next Figure shows Oracle VM server pool designed with excess CPU and RAM capacity to be able to use HA, DRS and Live Migration. Excess CPU and RAM capacity is a requirement for HA, DRS and Live Migration.
|
This image shows an Oracle VM server pool with excess capacity able to use HA, Live Migration and DRS.
|
This image shows an Oracle VM server pool responding to a HA event, with DRS and/or Live Migration moving running virtual machines.
|
This image shows an Oracle VM server pool migrating running virtual machines using DRS and/or Live Migration.
|

Oracle VM HA automatically restarts virtual machines when an Oracle VM pool member fails or restarts. Oracle VM HA minimizes unplanned downtime by restarting virtual machines when an Oracle VM server fails or restarts. Live Migration is used to eliminate planned downtime by migrating running virtual machines from one Oracle VM pool member to another during a maintenance event, for example, for repairs or an upgrade. DRS is an Oracle VM feature which provides policy based real-time utilization monitoring of Oracle VM servers with the goal to distribute virtual machine loads across a server pool. DRS migrates virtual machines from heavily utilized Oracle VM servers to less utilized Oracle VM servers. Both HA, Live Migration and DRS require a server pool with at least three servers with excess CPU and RAM capacity to be able to run and migrate virtual machines across the server the pool even if one Oracle VM servers fails.
Tip: There is a known limitation with OCFS2 two node cluster and network failures that cause the node with the higher node number to self-fence. For example, with a two node Oracle VM server pool, if one node has a network failure that triggers a HA event, both Oracle VM server will reboot. A best practice is to use a minimum of three Oracle VM servers for a server pool to eliminate the two node OCFS2 limitation.
Oracle VM HA monitors the status of each server pool member using a network and storage heartbeat. If a server pool member fails to update or respond to network and/or storage heartbeats due to hardware failure, the server pool member is fenced from the pool, promptly reboots, then all HA-enabled virtual machines are restarted on a live node in the pool. Oracle VM does not support memory oversubscription, which means that an Oracle VM server pool must have sufficient RAM capacity to be able to respond to a hardware failure using HA, or to support virtual machine migrations.
The Oracle VM Live Migration and DRS move running virtual machines between server pool members across a LAN without loss of availability. Live Migration and DRS have two primary use cases. The first use case is to eliminate planned downtime by Live Migrating running virtual machines from one server pool member to another during planned maintenance events. The second use case is to use DRS policies to load balance running virtual machines from heavily utilized Oracle VM servers to less utilized Oracle VM servers. Since Oracle VM does not support memory oversubscription, an Oracle VM server pool must have available RAM capacity to be able to migrate virtual machines between servers.
DRS is an Oracle VM feature which provides policy based real-time utilization monitoring of Oracle VM servers with the goal to distribute virtual machine loads across a server pool. DRS migrates virtual machines from heavily utilized Oracle VM servers to less utilized Oracle VM servers.
The exact number of network interfaces for an Oracle VM server is determined by the network switch VLAN setup and the number of Oracle VM management and virtual machine network ports. Oracle VM supports both 802.1Q trunk port VLANs as well as port based VLANs, with Linux bonding Modes 1 (Active-Backup), 4 (802.3ad) and 6 (Adaptive load balancing). 802.1q trunk ports can have two or more VLANs per port, in contrast to port based VLANS that are limited to one VLAN per port or port channel. 802.1Q uses fewer network switch ports and fewer Oracle VM server NICs compared to port based VLANs that require a dedicated switch port and NIC per network. A network switch VLAN configuration must first be selected to be able to calculate the exact number of network switch ports and NICs for your Oracle VM servers.
Oracle VM uses a total of five discrete networks for the Oracle VM server management functions; server management, cluster heartbeat, live migration, storage (only for NFS and iSCSI) and virtual machines. Each Oracle VM server pool should have a discrete network for each of the five aforementioned server management networks, as well as a discrete network for each virtual machine network. For example, an Oracle VM Server on a 1-gigabit copper network with NFS or iSCSI storage could easily use 12 or more bonded NICs with access ports just for the server management networks and one virtual machine network. In contrast to the latter 1-gigabit copper network example, an Oracle VM Server on a 10-gigabit fiber network using 802.1q trunk ports with NFS or iSCSI storage could easily use up to 4 bonded ports just for the server management and 2 bonded ports for the virtual machine networks.
Tip: In an clustered Oracle VM server pool, the loss of network connectivity for the Oracle VM cluster heartbeat network will causes a HA event. When a HA event occurs, the Oracle VM server that loses cluster heartbeat connectivity is fenced from the server pool and reboots, then all HA-enabled guests are restarted on a live Oracle VM pool member.
Prior to implementing an Oracle Cloud, it’s important that an infrastructure assessment (IA) and gap analysis (GA) be performed. During the IA/GA, the hardware specifications will be matched to the customer’s business needs.
This table outlines the decision points for the for Oracle VM for x86 server hardware. For decisions that rely on preexisting factors or specific organizational needs, the appropriate best practice will be discovered in the infrastructure assessment (IA) and gap analysis (GA). The best practices should be analyzed carefully and decisions should be made based on organizational needs, existing architecture, and budget resource availability.
|
Decision Point
|
Decision
|
Justification
|
| Certification |
The server hardware must be jointly supported by the hardware vendor and Oracle.
Note: The following link is the Oracle' hardware certification page. http://linux.oracle.com/pls/apex/f?p=117:1:5773793518142288::NO:RP::
|
Only jointly supported hardware product receive vendor support when problems occur and service tickets are created. The server hardware must be jointly supported by the hardware vendor and Oracle. |
| CPU |
Server hardware will be ordered with two socket Intel or AMD multiple-core CPUs for small and medium workloads and four socket multiple-core CPUs for large CPU-bound workloads. |
The Maximum Number of CPU cores or threads an Oracle VM server can support is 160. Oracle VM server maps a virtual CPU to a hardware thread on a CPU core in a CPU socket.
Oracle VM Server supports CPU oversubscription. CPU oversubscription allows an Oracle VM Server with 160 CPU cores to overallocate the total number of CPU cores to virtual machines. For example, a server with an Intel Xeon processor 5600-series CPU with hyperthreading can have up to six cores and twelve threads per socket. A two socket server with an Intel Xeon processor 5600-series CPU could allocate twenty four virtual CPUs without oversubscribing the physical CPUs.
CPU-bound workloads, such as Oracle Databases, should not be on Oracle VM Servers with oversubscribed CPUs.
|
| RAM |
Server hardware will be ordered with the maximum amount of physical memory.
Note: Oracle VM Server supports up to 4TB of RAM.
|
Oracle VM Server does not support memory oversubscription. For example, an Oracle VM Server with 1TB of RAM cannot overallocate RAM to virtual machines. By default, each Oracle VM Server reserves 512MB of RAM for dom0. The average memory overhead for each running guest on a dom0 is approximately 20MB plus 1% of the guest’s memory size. The remaining physical RAM can be allocated to guests.
An Oracle VM Server in a server pool with Live Migration, DRS, DPM and/or HA must have excess RAM capacity to accept virtual machines from a Live Migration, DRS, DPM and/or HA operation. Oracle VM pool members without available RAM can not support Live Migration, DRS, DPM and/or HA. Having available RAM on each server provides flexibility in terms of adding new virtual machines to the server pool, and to allow Live Migration, DRS, DPM and/or HA within a server pool.
|
| Storage |
Unless the Oracle VM server is booting from SAN, redundant SSD internal hard drives are recomended.
Virtual machine image and configuration files are hosted on shared SAN, iSCSI, or NFS repositories.
|
Oracle VM Server requires “only” 3 GB of local storage for the entire Oracle VM Server installation. The design goal for Oracle VM is to support multiple node Oracle VM Server pools with shared fibre channel SAN, iSCSI and/or NFS storage.
Oracle VM supports local storage without HA or Live Migration. With local storage, the OCFS2 virtual machine file system must be on a dedicated non SAS hard dirve. For example, a partition on same disk as Oracle VM server installation is not supported. Local SAS storage for virtual machines is not supported.
|
|
Network Interface Cards
|
A minimum of one Ethernet network interface (NIC) card is required just to install Oracle VM, although four or more 10G NICs is strongly recommended. NIC bonding with port-based VLANs and/or 802.1Q tag-based VLANs are supported and configured post Oracle VM Server installation with Oracle VM Manager or Enterprise Manager. Oracle VM 3.0.1 through 3.1.1 supports two NIC ports per network bond, and a total of five network bonds per Oracle VM Server. Oracle VM 3.2.x and above supports four NIC ports per network bond, and a total of ten network bonds per Oracle VM Server.
The exact number of network interfaces for an Oracle VM Server entirely depends on your organization’s business requirements and network and storage infrastructure capabilities. For example, an Oracle VM Server with four 10G NICs, configured with two 802.1Q bonds could support the most demanding network and storage requirements, with only four 10G NICs. By contrast, an Oracle VM Server using access ports/port-based VLANs or 802.1Q tag-based VLANS on a 1G copper network, could easily use the maximum number of supported NIC ports (<= 3.1.1 = 10 ports, >= 3.2 = 40 ports) to meet the minimum network requirements.
NAME Rate(bit/s) Rate(byte/s)
Gigabit Ethernet 1 Gbit/s 125 MB/s
10 Gigabit Ethernet 10 Gbit/s 1.25 GB/s Infiniband DDR 16 Gbit/s 2 GB/s
Tip: One thing to consider is NIC firmware levels between bonded internal NIC ports and PCI NIC ports. Consider only bonding internal NICs with internal NICs and PCI NICs with PCI NICs.
|
Both 802.3AD NIC bonds, port-based VLANs and/or 802.1Q tag-based VLANs are supported and configured post Oracle VM Server installation with Oracle VM Manager. Network redundancy, i.e. 802.3AD NIC bonding doubles the number of required NICs.
Oracle VM uses a total of five discrete networks; Server Management, Cluster Heartbeat, live Migration, Storage and Virtual Machines. All five networks can be supported using one or more 802.1Q tag-based VLANs (2 NICs) or using up to five 802.3AD bond (10 NICs).
Each Oracle VM server pool should have a discrete network for the Server Management, Cluster Heartbeat, Live Migration, Storage and Virtual Machines. Isolating the Server Management, Cluster Heartbeat, Live Migration and Storage networks protect the server pool from unexpected server reboots by eliminating OCSF2 heartbeat interruptions that could cause a pool member to loose network connectivity, fence from the pool and reboot.
Each Oracle VM Server will be assigned a unique IP address on the Server Management, Cluster Heartbeat, live Migration and Storage network.
|
| Host Bus Adapter Cards |
SAN Storage: 2 Host Bus Adapter Cards (HBAs).
NAME Line-Rate Throughput MBps
4GFC 4.25 800 8GFC 8.5 1600 10GFC 10.52 2550 16GFC 14.025 3200 20GFC 21.04 5100
|
2 HBAs are used to eliminate a single point of failure.
|
Oracle VM for x86 Server Pool Design
Oracle VM uses the concept of a "server pool" to group together and centrally manage one or more server pools with up to 32 Oracle VM servers. If more than one location exists, Oracle VM server pools may be dispersed to different locations. Oracle VM Manager with Oracle Enterprise Manager 12c provide a single point of administration for one or more dispersed Oracle VM server pools.
Oracle VM server pools can accommodates organization-specific needs, i.e., Oracle technology license management (hard and soft partitioning) , defense in depth, the principle of least privilege, compartmentalization of information, security domains and different applications and their performance, authentication, and security requirements.
The next Figure shows a high-level overview of how server pools can be used to implement security domains, defense in depth, the principle of least privilege and compartmentalization of information.

This table outlines the decision points for an Oracle VM server pool. For decisions that rely on preexisting factors or specific organizational needs, the appropriate best practice will be discovered in the infrastructure assessment (IA) and gap analysis (GA). The best practices should be analyzed carefully and decisions should be made based on organizational needs, existing architecture, and budget resource availability.
|
Decision Point
|
Decision
|
Justification
|
| Oracle VM Server Pool Design |
Prior to implementing an Oracle Cloud, it’s important that an infrastructure assessment (IA) and gap analysis (GA) be performed. During the IA/GA, the architecture of the solution will be matched to the customer’s business needs.
|
Server pool design is a strategic, architectural security decision. Server pools can be used to control Oracle licensing costs (hard and soft partitioning) and as a way to implement security domains, defense in depth, the principle of least privilege and compartmentalization of information.
|
| Oracle VM Manager |
The Oracle VM Manager installer provides two installation options. Oracle VM 3.0.1 up to 3.1.1 offers a Demo or Production installation. Oracle VM 3.2.1 and above offers a Simple or Custom installation.
Oracle VM Manager will be installed in Production, Simple or Custom mode on a dedicated physical or virtual server. Production and Custom mode uses a local or external Oracle 11g Enterprise or RAC database on a dedicated physical or virtual server. Simple mode uses a local MySQL database.
The Oracle VM Manager Database repository will not be shared with other production or test databases on the same server.
The Oracle Enterprise Manager Agent and the Virtualization plug-in will be installed to enable Oracle Enterprise Manager integration.
|
For large environments (>33 hosts), the Oracle VM Manager Database repository should be on dedicated virtual or physical servers. If your Oracle VM environment starts out small and scales out, make sure to have a plan to scale up Oracle VM Manager with more RAM and CPUs and scale out the Oracle VM Manager Database repository on dedicated virtual or physical servers with RAC.
For the Oracle VM Manager Database repository, scaling out means moving from a single server Database to a multi node RAC cluster. An important consideration when scaling out an Oracle VM Manager environment is to determine if the underlying hardware where the Oracle VM Manager Database repository runs is capable to transition to RAC. If the hardware is not capable to transition to RAC, it is possible to move and/or export the Oracle VM Manager Database repository to a different system with more resources.
|
| Oracle VM Server Agent Roles |
Each Oracle VM server pool has one server pool master agent with a Virtual IP address for failover. The Virtual IP address is a unique IP address on the Server Management network.
There are a total of three Oracle VM agent roles; 1) the Server Pool Master, 2) the Utility Server and 3) the Virtual Machine Server.
|
The server pool "Virtual IP" feature is a mandatory Oracle VM server pool property that detect the loss of the server pool master agent and responds with automatic failover to the first pool member that can lock the pool file systsm. The server pool "Virtual IP" feature removes the single point of failure (SPOF) for the server pool master agent role.
|
| Storage |
Back-end storage
Each Oracle VM server pool uses one dedicated OCFS2 12G mount point for the server pool's OCFS2 cluster configurations (the pool file system) and one or more shared OCFS2 or NFS repositories to host virtual machine configuration files and images.
Front-end storage
The virtual machine layer is where the storage is presented to virtual machines as either a flat file (UUID.img), as RAW disks (LUN), or as a combination of flat files and RAW disks.
|
An Oracle VM storage solution consists of three distinct layers. Each layer has its own unique requirements, configurations, dependencies and features. The first layer is the storage array, which is referred to as back-end storage. Oracle VM supports Fibre Channel and iSCSI SAN and NFS back-end storage. The second layer is the server layer consisting of the Oracle VM Server's Device-Mapper Multipath configurations and the shared Oracle Cluster File System 2 (OCFS2) or NFS virtual machine file system. The third layer is the guest front-end storage consisting of multiple guest storage (file and RAW) and driver options. RAW disks have the best performance of the two front-end storage storage options. In most cases, RAW disks are the best option for high I/O workloads like Oracle Databases.
|
| Networks |
Each Oracle VM server pool will have isolated Oracle VM management networks and isolated virtual machine networks.
Oracle VM uses a total of five discrete networks; Server Management, Cluster Heartbeat, live Migration, Storage and Virtual Machines.
The exact number of network interfaces for an Oracle VM Server entirely depends on your organization’s business requirements and network and storage infrastructure capabilities. For example, an Oracle VM Server with four 10G NICs, configured with two 802.1Q bonds could support the most demanding network and storage requirements, with only four 10G NICs. By contrast, an Oracle VM Server using access ports/port-based VLANs or 802.1Q tag-based VLANS on a 1G copper network, could easily use the maximum number of supported NIC ports (<= 3.1.1 = 10 ports, >= 3.2 = 40 ports) to meet the minimum network requirements.
|
Each Oracle VM server pool should have a discrete network for the Server Management, Cluster Heartbeat, live Migration, Storage and Virtual Machines. Isolating the Server Management, Cluster Heartbeat, live Migration and Storage networks protect the server pool from unexpected server reboots by eliminating OCSF2 heartbeat interruptions that cause pool members to fence from the pool and reboot.
Each Oracle VM Server will be assigned a unique IP address on the Server Management, Cluster Heartbeat, live Migration and Storage network.
Note: The heartbeat traffic is TCP on port 7777. Each Oracle VM server in a pool must be able to communicate to all of the pool members over TCP on port 7777.
|
| RAM |
The server pool must be designed with excess RAM capacity to accommodate the memory requirements of virtual machines that could migrate or start on any pool member.
|
Oracle VM server does not support memory oversubscription, which means that an Oracle VM server cannot accept a DRS, Live Migration or HA requests unless the server has available RAM for the virtual machines. Having excess RAM on each Oracle VM server provides flexibility in terms of adding new virtual machines to the server pool, and to allow DRS, Live Migration and HA to operate within a server pool.
|
Oracle VM for x86 Disaster Recovery
An Oracle VM disaster recovery architecture includes the design and process to maintain business continuity following a disastrous event affecting the availability of an organization's primary site. Failover to a disaster recovery site is prompted by the results of a disaster assessment. The failover process is the restoration of the primary site's services at the disaster recovery site.
Note: Disaster recovery requirements are calculated using Service-level Agreements (SLA), Recovery Time Objectives (RTOs) and Recovery Point Objectives (RPOs) objectives. SLA, RPO and RTO objectives and budget influence the disaster recovery architecture and design.
Oracle VM uses the concept of a server pool to group together and manage one or more clustered Oracle VM servers. Once an Oracle VM server pool is created, the physical and virtual resources are managed within the boundary of the server pool. Physical resources include server hardware, networks, storage, infrastructure services (DNS, NTP, LDAP, HTTP, etc..), operating system installation media and administrative accounts. The virtual resources include virtual disks, virtual network interfaces, and virtual machine configuration files. For example, an Oracle VM environment with multiple server pools located in one or more sites could be managed from a single Oracle VM Manager instance with each server pool's resources isolated to their respected server pool. An Oracle VM server pool's resources from one site can be replicated and restored to another site for disaster recovery.
Restoration of the primary site's services at a disaster recovery site requires a replica of the primary site's physical and virtual resources at the disaster recovery site. A disaster recovery site hosts a replica of the primary site's Oracle VM physical and virtual resources, i.e. server hardware, networks, storage, infrastructure services, virtual disks, and virtual machine configuration files. The failover process involves restoring the primary sites virtual machines at the disaster recovery site, then systematically starting the virtual machines and services.
Note: Oracle VM Servers are not backed up and restored at the DR site. The time required to backup and restore an Oracle VM Server is significantly greater then a PXE boot kickstart installation.
A disaster recovery site can be a warm failover site waiting idle to respond to a disastrous occurrence, or part of a multi-site high availability design. A multi-site design uses excess capacity with application high availability to mirror services across sites to handle the lose of one or more sites.
The next Figure shows a warm Oracle VM failover site waiting idle to respond to a disastrous occurrence.

The next Figure shows a warm Oracle VM failover site responding to a disastrous occurrence and running the primary sites services.

The next Figure shows a multi-site Oracle VM design with application high availability solutions to mirror services across sites as well as excess capacity to handle the lose of one or more sites.

Virtual machines that are restored at a disaster recovery site expect the same networks, storage, and infrastructure services as in the primary site. In the event that the disaster recovery site has different networks, storage, and infrastructure services, the properties of each virtual machines would need to be edited to use the new networks, storage and infrastructure services before services can be restored.
The virtual machine operating systems are typically installed in virtual disks that are actually flat files hosted on shared OCFS2 or NFS repositories. RAW disks such as ASM Disks, Log and Archive Files, etc.. are presented to the virtual machines from the Oracle VM Servers as local devices. Each virtual machine's virtual network interface card(s) (vNIC) are connected to one or more discrete networks using Xen bridges that are managed and presented to the virtual machines by the Oracle VM pool members. Virtual disks and virtual network interface card(s) allocations are managed using Oracle VM Manager and/or Oracle Enterprise Manager with the configurations saved in each virtual machines vm.cfg file.
The virtual machine vm.cfg files, virtual disk images and RAW disks (ASM disks) can be replicated between sites using storage array replication and/or mirroring solutions. Rsync is an option if an array does not have replication and/or mirroring functionality.
As soon as the replicated storage repositories are available, the failover process for a warm recovery site starts with the installation of Oracle VM Manager with the runInstall.sh --uuid option using the primary sites Oracle VM Manager UUID. An Oracle VM Manager --uuid installation allows Oracle VM Manager to use the primary site' replicated repositories with the virtual machines.
Tip: The Oracle VM Manager UUID is listed in the “.config ” file on the Oracle VM Manager host in the /u01/app/oracle/ovm-manager-3/ directory as well as in each server pool' .ovsrepo file in the pool file system.
The next example shows the content of the .config file with the UUID in bold.
# cat /u01/app/oracle/ovm-manager-3/.config
DBHOST=localhost
SID=orcl
LSNR=1521
APEX=None
OVSSCHEMA=ovs
WLSADMIN=weblogic
OVSADMIN=admin
COREPORT=54321
UUID=0004fb00000100009edfaa0f93184f44BUILDID=3.0.3.126
The next example shows the content of the .ovsrepo file with the UUID in bold.
# cat .ovsrepo
OVS_REPO_UUID=0004fb0000030000554308a6997a6b2f
OVS_REPO_MGR_UUID=0004fb00000100009edfaa0f93184f44OVS_REPO_VERSION=3.0
This table outlines the decision points for an Oracle VM disaster recovery solution. For decisions that rely on preexisting factors or specific organizational needs, the appropriate best practice will be discovered in the infrastructure assessment (IA) and gap analysis (GA). The best practices should be analyzed carefully and decisions should be made based on organizational needs, existing architecture, and budget resource availability.
|
Decision Point
|
Decision
|
Justification
|
|
Disaster Recovery
Design |
Prior to implementing an Oracle VM Disaster Recovery solution, it’s important that an infrastructure assessment (IA) and gap analysis (GA) be performed. During the IA/GA, the architecture of the solution will be matched to the customer’s SLA, Recovery Time Objectives (RTOs) and Recovery Point Objectives (RPOs) objectives.
|
Implementing a Disaster Recovery is a strategic decision. Disaster recovery requirements are calculated using SLA, Recovery Time Objectives (RTOs) and Recovery Point Objectives (RPOs) objectives. SLA, RPO and RTO objectives and budget influence the disaster recovery architecture and design.
|
| Oracle VM Manager |
Oracle VM Manager will be installed in Production mode using the runInstall.sh --uuid option with the primary site's Oracle VM Manager UUID.
Oracle VM Manager will be hosted on a dedicated physical server using an external or local Oracle 11g Standard, Enterprise or RAC database.
Once Oracle Enterprise Manager is restored, the Oracle Enterprise Manager Agent and Virtualization plug-in will be installed to enable Oracle Enterprise Manager integration.
|
As soon as the replicated storage repositories are available, the failover process for a warm recovery site starts with the installation of Oracle VM Manager with the runInstall.sh --uuid option using the primary sites Oracle VM Manager UUID. An Oracle VM Manager --uuid installation allows Oracle VM Manager to use the primary site' replicated repositories and virtual machines.
The Oracle VM Manager UUID is listed in the “.config ” file on the Oracle VM Manager host in the /u01/app/oracle/ovm-manager-3/ directory as well as in each server pool' .ovsrepo file in the pool file system.
|
| Oracle VM Server Builds |
Oracle VM Servers will be installed using an automated build process.
|
Oracle VM servers are installed using an automated PXE boot configuration to ensure that each server has a consistent installation configuration.
|
| Oracle VM Server Backups |
Oracle VM Servers will not backed up at the primary site and restored at the DR site. |
The time required to backup and restore an Oracle VM Server is significantly greater then an automated PXE boot kickstart installation.
Oracle VM servers are installed using an automated PXE boot configuration to ensure that each server has a consistent installation configuration.
|
| Storage |
A replica of the primary site's repositories with the virtual machine resources and RAW disks will be hosted at the disaster recovery site. |
As soon as the replicated storage repositories and RAW disks are available, the failover process for a warm recovery site starts with the installation of Oracle VM Manager with the runInstall.sh --uuid option using the primary sites Oracle VM Manager UUID. An Oracle VM Manager --uuid installation allows Oracle VM Manager to use the primary site' replicated repositories and virtual machines.
Virtual machines that are restored at a disaster recovery site expect the same storage as in the primary site. In the event that the disaster recovery site has different storage each virtual machine would need to be recreated or edited to use the new storage before services can be restored.
|
| Networks |
A replica of the primary site's Oracle VM networks will be maintained at the disaster recovery site. |
Virtual machines that are restored at a disaster recovery site expect the same networks as in the primary site. In the event that the disaster recovery site has different networks each virtual machine would need to be edited to use the new networks before services can be restored. |
| Infrastructure Services |
A replica of the primary site's infrastructure services will be maintained at the disaster recovery site.
|
Virtual machines that are restored at a disaster recovery site expect the same infrastructure services as in the primary site. In the event that the disaster recovery site has different infrastructure services, each virtual machine operating system would need to be edited to use the new infrastructure services before services can be restored.
|
Oracle VM Fault Testing
Oracle VM uses OCFS2 to manage up to 32 clustered Oracle VM Servers in an Oracle VM server pool. OCFS2 monitors the status of each server pool member using a network and storage heartbeat. If a server pool member fails to update or respond to network and/or storage heartbeats, the server pool member is fenced from the pool, promptly reboots, then all HA-enabled virtual machines are restarted on a live node in the pool. Fencing forcefully removes dead servers from a pool to ensure that active servers are not obstructed from accessing fenced servers cluster resources. The term “node eviction” is also used to describe servers fencing and reboots. A best practice is to design Oracle VM Server pools with dedicated network and storage network channels to avoid contention and unexpected server reboots.
Before an Oracle VM server pool is placed into production, both network and storage fault tests should be conducted to find compatible o2cb timeout values, an 802.3AD bond mode (1, 4 or 6) and network and storage configurations that provide predicable failure response. For example, Oracle VM Servers should be able to lose a bond port, redundant network or storage switch and/or an HBA without node evictions. Incompatible o2cb timeout values, 802.3AD bond modes, network switch and storage configurations can trigger node evictions and unexpected server reboots.
Oracle VM Architecture and Fault Testing
A slightly modified version of OCSF2 (o2dlm) is bundled with Oracle VM. The OCFS2 file system and cluster stack are installed and configured as part of an Oracle VM Server installation. The o2cb service manages the cluster stack and the ocfs2 service manages the OCSF2 file system. The o2cb cluster service is a set of modules and in-memory file systems that manage the ocfs2 file system service, network and storage heartbeats and node evictions.
The Oracle Cluster File System 2 (OCFS2) is a general-purpose journaling file system developed by Oracle. Oracle released OCFS2 under the GNU General Public License (GPL), version 2. The OCSF2 source code and its tool set are part of the mainline Linux 2.6 kernel and above. The OCSF2 source code and its tool set can be downloaded from kernel.org, the Oracle Public Yum Server and from the Unbreakable Linux Network.
Oracle VM facilitates centralized server pool management using an agent-based architecture. The Oracle VM agent is a python application that is installed by default with Oracle VM Server. Oracle VM Manager dispatches commands using XML RPC over a dedicated network called the Server Management network channel using TCP/8899 to each server pool's Master Server agent. Each Master Server agent dispatch commands to subordinate agent servers using TCP/8899. The Oracle VM agent is also responsible for propagating the /etc/ocfs2/cluster.conf file to subordinate agent servers. There is only one Master Server in a server pool at any one point in time. The Master Server is the only server in a server pool to communicate with Oracle VM Manager.
Note: When Oracle VM Server is installed, the IP address entered during the installation is assigned to the Server Management network channel.
Once an Oracle VM server pool is created, two cluster configuration files are shared across all nodes in the server pool that maintain the cluster layout and cluster timeouts configurations. The /etc/ocfs2/cluster.conf file maintains the cluster layout and the /etc/sysconfig/o2cb file maintains the cluster timeouts. Both configuration files are read by the user-space utility configfs. configfs communicates the list of nodes in the /etc/ocfs2/cluster.conf file to the in-kernel node manager, along with the resource used for the heartbeat to the in-kernel heartbeat thread.
An Oracle VM server must be online to be a member of an Oracle VM pool/cluster. Once the cluster is on-line, each pool member starts a process, o2net. The o2net process creates TCP/IP intra-cluster node communication channels on port 7777 and sends regular keepalive packages to each node in the cluster to validate if the nodes are alive. The intra-cluster node communication uses the Cluster Heartbeat network channel. If a pool member loses network connectivity the keepalive connection becomes silent causing the node to self-fence. The keepalive connection time out value is managed in each nodes /etc/sysconfig/o2cb file's O2CB_IDLE_TIMEOUT_MS setting.
Along with the keepalive packages that check for node connectivity, the cluster stack also employs a disk heartbeat check. o2hb is the process that is responsible for the disk heartbeat component of cluster stack that actively monitors the status of all pool members. The heartbeat system uses a file on the OCSF2 file system, that each pool member periodically writes a block to, along with a time stamp. The time stamps are read by each pool member and are used to check if a pool member is alive or dead. If a pool member’s block stops getting updated, the node is considered dead, and self-fences. The disk heartbeat time out value is managed in each nodes /etc/sysconfig/o2cb file's O2CB_HEARTBEAT_THRESHOLD setting.
The OCFS2 network and storage heartbeat time out values are managed in each Oracle VM Servers /etc/sysconfig/o2cb file. Each pool member must have the same /etc/sysconfig/o2cb values. The default timeout values should be tested and tuned to your network and storage infrastructure to provide predicable failure response.
Tip: If a SAN storage controller fail over takes 120 seconds, and OCFS2 is set to the default value of 60 seconds, Oracle VM Servers will reboot halfway through the controller fail over. The O2CB_HEARTBEAT_THRESHOLD timeout value must longer then the SAN storage controller fail over timeout value.
The next example shows the default /etc/sysconfig/o2cb timeout values.
O2CB_IDLE_TIMEOUT_MS=60000 (60 secs)
O2CB_HEARTBEAT_THRESHOLD=31 (60 secs)
O2CB_RECONNECT_DELAY_MS=2000 (2 secs)
O2CB_KEEPALIVE_DELAY_MS=2000 (2 secs)
The next list explains each o2cb timeout value.
1- O2CB_IDLE_TIMEOUT_MS: Default settings is 60000 = 60 secs
Time in ms before a network connection is considered dead.
2- O2CB_HEARTBEAT_THRESHOLD: Default 31 = 60 secs
The disk heartbeat timeout is the number of two-second iterations before a node is considered dead. The exact formula used to convert the timeout in seconds to the number of iterations is:
O2CB_HEARTBEAT_THRESHOLD = (((timeout in seconds) / 2) + 1)
For example, to specify a 60 sec timeout, set it to 31. For 120 secs, set it to 61. The default for this is 31 (60 secs).
3- O2CB_RECONNECT_DELAY_MS: Default 2000 = 2 secs
Min time in ms between connection attempts
4- O2CB_KEEPALIVE_DELAY_MS: Default 2000 = 2 secs
Max time in ms before a keepalive packet is sent
Note: If reboots are occurring and the root cause has not yet been identified, the following time-out values may provide a temporary solution.
O2CB_IDLE_TIMEOUT_MS=90000 (90 secs)
O2CB_HEARTBEAT_THRESHOLD=81 (160 secs)
O2CB_RECONNECT_DELAY_MS=4000 (4 secs)
O2CB_KEEPALIVE_DELAY_MS=4000 (4 secs)
A minimum of one Ethernet network interface (NIC) card is required to install Oracle VM, although at least four 10G or four or more 1G NICs are recommended for fault testing. Trunk ports/802.1Q and/or access ports with NIC bonding mode 1, 4 and 6 are supported and configured post Oracle VM Server installation with Oracle VM Manager and/or Oracle Enterprise Manager. Oracle VM supports two NIC ports per network bond and a total of five network bonds per Oracle VM Server.
The next Figures shows two different Oracle VM networking strategies.
The next image shows a four port 10G 802.1q/LACP trunk port design with two mode 4 bonds.

The next Figure shows a ten port 1G or 10G access port design with five mode 1 or 6 bonds.

Tip: I highly recommend the four port 10G 802.1q/LACP trunk port design with two mode 4 bonds. Trunk ports can have two or more VLANs per port. An access port is limited to one VLAN per port.
The Cluster Heartbeat, Storage and Virtual Machine network channels NIC bond modes (1, 4 and 6) should be fault tested with various network switch settings to confirm which NIC bonding mode and network switch setting combination provides predicable failure response. The next table shows each of the Oracle VM network channels, NIC bonding modes, and a variety of network switch options that should be fault tested.
|
Network Channel
|
Description
|
Network Type
|
Bond Modes
|
Network Switch Options
|
|
iLO (Integrated Lights Out Manager)
Note: iLo is not managed by Oracle VM Manager. ilo is included in this table for completeness.
|
iLO ports enable browser based access to servers for installations and management.
iLO ports should be on a network isolated from the server payload networks.
|
Class A, B or C
|
Not applicable
|
Trunk ports and/or Access ports.
•An access port is limited to one VLAN per port.
•A trunk port can have two or more VLANs per port.
Most network switches support the following standards: EtherChannel, Port Channels, Link Aggregation Control Protocol (LACP) / 802.3ad, etc...
Consult Cisco for network switch configuration details: http://www.cisco.com/en/US/tech/tk389/tk213/tsd_technology_support_protocol_home.html
|
|
Server Management
|
The Server Management network is the communication channel for Oracle VM Manager, Oracle VM Server agents as well as administrative ssh access to Oracle VM Servers, and http/https/VNC access to and from Oracle VM Manager.
Oracle VM Manager dispatches commands using XML RPC over the Server Management network using TCP/8899 to each server pool's master agent server. Each Master Server agent server dispatch commands to subordinate agent servers using the Server Management network with TCP/8899.
The Server Management ports should be on an isolated routable network.
Note: The Server Management network should be dedicated for Oracle VM not a shared corporate wide Server Management network.
|
Class A, B or C
|
Mode 1, 4 or 6
Mode 1 and 6 do not require special network switch settings.
Mode 4 requires the network switch to support 802.3ad/LACP.
|
Trunk ports and/or Access ports.
•An access port is limited to only "one" VLAN on the port, i.e. the port can carry traffic for one VLAN.
•A trunk port can have two or more VLANs the port, i.e. the port can carry traffic for multiple simultaneous VLANs.
Most network switches support the following standards: EtherChannel, Port Channels, Link Aggregation Control Protocol (LACP) / 802.3ad, etc...
Consult Cisco for network switch configuration details: http://www.cisco.com/en/US/tech/tk389/tk213/tsd_technology_support_protocol_home.html
|
|
Cluster Heartbeat
|
An Oracle VM server must be online to be a member of an Oracle VM pool/cluster. Once the cluster is on-line, each pool member starts a process, o2net. The o2net process creates TCP/IP intra-cluster node communication channels on port 7777 and sends regular keepalive packages to each node in the cluster to validate if the nodes are alive. The intra-cluster node communication uses the Cluster Heartbeat network channel. If a pool member loses network connectivity the keepalive connection becomes silent causing the node to self-fence. The keepalive connection time out value is managed in each nodes /etc/sysconfig/o2cb file's O2CB_IDLE_TIMEOUT_MS setting.
|
Class A, B or C
*The network could be a private non-routable network.
|
Mode 1, 4 or 6
Mode 1 and 6 do not require special network switch settings.
Mode 4 requires the network switch to support 802.3ad/LACP.
|
Trunk ports and/or Access ports.
•An access port is limited to only "one" VLAN on the port, i.e. the port can carry traffic for one VLAN.
•A trunk port can have two or more VLANs the port, i.e. the port can carry traffic for multiple simultaneous VLANs.
Most network switches support the following standards: EtherChannel, Port Channels, Link Aggregation Control Protocol (LACP) / 802.3ad, etc...
Consult Cisco for network switch configuration details: http://www.cisco.com/en/US/tech/tk389/tk213/tsd_technology_support_protocol_home.html
|
|
Live Migration
|
The Oracle VM Live Migration feature moves running virtual machines between server pool members across a LAN without loss of availability. Oracle VM uses an iterative precopy method to migrate running virtual machines between two pool members over the Live Migration network channel. A Live Migration event starts when the source server sends a migration request to the target server, which contains the virtual machines resource requirements. If the target accepts the migration request, the source starts the iterative precopy phase. The iterative precopy phase starts by iteratively copying the guest’s memory pages from the source to the target server over the Live Migration network channel. If a memory page changes during the precopy phase, it is marked dirty and resent. Once the majority of the pages are copied, the stop-and-copy phase begins. The stop-and-copy phase starts by pausing the guest while the remaining dirty pages are copied to the target, which usually takes 60 to 300 milliseconds. Once the pages are copied to the target, the virtual machine is started on target server.
|
Class A, B or C *The network could be a private non-routable network.
|
Mode 1, 4 or 6
Mode 1 and 6 do not require special network switch settings.
Mode 4 requires the network switch to support 802.3ad/LACP.
|
Trunk ports and/or Access ports.
•An access port is limited to only "one" VLAN on the port, i.e. the port can carry traffic for one VLAN.
•A trunk port can have two or more VLANs the port, i.e. the port can carry traffic for multiple simultaneous VLANs.
Most network switches support the following standards: EtherChannel, Port Channels, Link Aggregation Control Protocol (LACP) / 802.3ad, etc...
Consult Cisco for network switch configuration details: http://www.cisco.com/en/US/tech/tk389/tk213/tsd_technology_support_protocol_home.html
|
|
Storage
|
The Storage network channel is only used for iSCI and NFS storage. FC/SAN use a dedicated fibre fabric.
|
Class A, B, C or FC/SAN
|
Mode 1, 4 or 6
Mode 1 and 6 do not require special network switch settings.
Mode 4 requires the network switch to support 802.3ad/LACP.
|
Trunk ports and/or Access ports.
•An access port is limited to only "one" VLAN on the port, i.e. the port can carry traffic for one VLAN.
•A trunk port can have two or more VLANs the port, i.e. the port can carry traffic for multiple simultaneous VLANs.
Most network switches support the following standards: EtherChannel, Port Channels, Link Aggregation Control Protocol (LACP) / 802.3ad, etc...
Consult Cisco for network switch configuration details: http://www.cisco.com/en/US/tech/tk389/tk213/tsd_technology_support_protocol_home.html
|
|
Virtual Machine(s)
|
The virtual machine network channels provide access to one or more networks provisioned for virtual machines.
|
Class A, B or C
|
Mode 1, 4 or 6
Mode 1 and 6 do not require special network switch settings.
Mode 4 requires the network switch to support 802.3ad/LACP.
|
Trunk ports and/or Access ports.
•An access port is limited to only "one" VLAN on the port, i.e. the port can carry traffic for one VLAN.
•A trunk port can have two or more VLANs the port, i.e. the port can carry traffic for multiple simultaneous VLANs.
Most network switches support the following standards: EtherChannel, Port Channels, Link Aggregation Control Protocol (LACP) / 802.3ad, etc...
Consult Cisco for network switch configuration details: http://www.cisco.com/en/US/tech/tk389/tk213/tsd_technology_support_protocol_home.htm
|
Tip: A best practice is use a minimum of three nodes per cluster to ensure quorum and to be able to generate meaningful network heartbeat (O2CB_IDLE_TIMEOUT_MS timeout) fault tests. A known limitation with two node clusters and network failures causes the node with the higher node number to self-fence.
Oracle VM Network Fault Testing
Network Bond Mode, Network Bond failover, and O2CB_IDLE_TIMEOUT_MS Fault Tests:
If a server pool member fails to respond to its network heartbeat, the server pool member is fenced from the pool, promptly reboots, then all HA-enabled virtual machines are restarted on a live node in the pool. The network heartbeat should be placed on a routable or non-routable dedicated class A, B or C network. A best practice is to provision a dedicated network channel for the network heartbeat to avoid network contention and unexpected reboots. The network heartbeat is referred to as the Cluster Heartbeat network channel in Oracle VM Manager.
Before an Oracle VM server pool is placed into production, network fault testing should be conducted on the Cluster Heartbeat, Storage and Virtual Machine network channels to find a suitable O2CB_IDLE_TIMEOUT_MS o2cb timeout value, bond mode (1, 4 or 6) and network switch configuration that provides predicable failure response. For example, Oracle VM Servers should be able to lose a bond port/NIC and/or a redundant network switch without node evictions. Incompatible o2cb timeout values, bond modes and network switch configurations can trigger node evictions and unexpected server reboots.
Tip: When fault testing a two node cluster's O2CB_IDLE_TIMEOUT_MS time out value, the node with the higher node number will reboot when the network fails. A best practice is use a minimum of three nodes per cluster to ensure quorum and to be able to fault test the O2CB_IDLE_TIMEOUT_MS timeout value.
The next tables shows each of the fault test with the expected failure results. The example is four a four port 10G 802.1q/LACP trunk port design with two mode 4 bonds. Modify the table to reflect meet your design and fault tests.
-
Disable switch ports in various fault patterns to text NIC, Bond and Switch failures and OCFS2 compatibility.
-
Use the following commands to confirm the tests results.
-
watch cat /proc/net/bonding/bond0
-
watch cat /proc/net/bonding/bond1
-
tail -f /var/log/messages
Complete and document each test to confirm which settings provide the expected failure results.
|
Use Case
|
Port(s)
|
Bond Mode
|
Switch Mode
|
Expected Results
|
Results
|
|
1- Disable the port and wait +20 secs over the O2CB_IDLE_TIMEOUT_MS time out value.
Enable the port and wait +20 secs over the O2CB_IDLE_TIMEOUT_MS time out value.
|
eth0
|
|
Mode x
|
|
Bond stays active the node does not self-fence.
|
|
|
2- Disable the port and wait +20 secs over the O2CB_IDLE_TIMEOUT_MS time out value.
Enable the port and wait +20 secs over the O2CB_IDLE_TIMEOUT_MS time out value.
|
|
eth1
|
Mode x
|
|
Bond stays active the node does not self-fence.
|
|
|
3- Disable the port and wait +20 secs over the O2CB_IDLE_TIMEOUT_MS time out value.
Enable the port and wait +20 secs over the O2CB_IDLE_TIMEOUT_MS time out value.
|
eth2
|
|
Mode x
|
|
Bond stays active the node does not self-fence.
|
|
|
4- Disable the port and wait +20 secs over the O2CB_IDLE_TIMEOUT_MS time out value.
Enable the port and wait +20 secs over the O2CB_IDLE_TIMEOUT_MS time out value.
|
|
eth3
|
Mode x
|
|
Bond stays active the node does not self-fence.
|
|
|
5- Disable both ports and wait +20 secs over the O2CB_IDLE_TIMEOUT_MS time out value.
Enable both ports and wait +20 secs over the O2CB_IDLE_TIMEOUT_MS time out value.
|
eth0
|
eth2
|
Mode x
|
|
Both bonds stays active the node does not self-fence.
|
|
|
5- Disable both ports and wait +20 secs over the O2CB_IDLE_TIMEOUT_MS time out value.
Enable both ports and wait +20 secs over the O2CB_IDLE_TIMEOUT_MS time out value.
|
eth1
|
eth3
|
Mode x
|
|
Bonds stays active the node does not self-fence.
|
|
|
5- Disable both ports and wait +20 secs over the O2CB_IDLE_TIMEOUT_MS time out value.
Enable both ports and wait +20 secs over the O2CB_IDLE_TIMEOUT_MS time out value.
|
eth2
|
eth3
|
Mode x
|
|
Bond loses connectivity, VMs lose connectivity, the node does not self-fence.
|
|
|
5- Disable both ports and wait +20 secs over the O2CB_IDLE_TIMEOUT_MS time out value.
Enable both ports and wait +20 secs over the O2CB_IDLE_TIMEOUT_MS time out value.
|
eth0
|
eth1
|
Mode x
|
|
At the O2CB_IDLE_TIMEOUT_MS timeout value the node self-fences.
|
|
Oracle VM Storage Fault Testing
If a server pool member fails to read and write storage heartbeats, the server pool member is fenced from the pool, promptly reboots, then all HA-enabled virtual machines are restarted on a live node in the pool. The storage heartbeat should be on a dedicated IP or Fiber channel network. A best practice is to have dedicate “not” shared storage with performance monitoring and storage quotas alerts to avoid contention and full disks that may cause unexpected reboots. A best practice with iSCSI and NFS storage is to provision the storage on a dedicated class A,B or C network channel to avoid network contention and unexpected server reboots.
Oracle VM uses two different types of storage repositories. The first type of storage repository, called a pool file system, is used to host a server pool's cluster configurations including the storage heartbeat. There can only be one pool file system per server pool. The other type of storage repository, called a virtual machine file system, is used to host virtual machine configuration files and disks. There can be one or more virtual machine file system repositories in a server pool. Virtual machine file system repositories do not have a storage heartbeat.
The storage heartbeat, also known as a “quorum disk”, is used to monitor the status of each Oracle VM Server in a pool. With a quorum disk, every Oracle VM Server in a pool regularly reads and writes a small amount of status data to a reserved section of pool file system. Each Oracle VM Server writes its own status and reads the status of all the other Oracle VM Servers in the pool. If any Oracle VM Server in a pool fails to update its status within its O2CB_HEARTBEAT_THRESHOLD o2cb timeout value, the Oracle VM Server is fenced from the pool, promptly reboots, then all HA-enabled virtual machines are restarted on a live Oracle VM Server in the pool.
Before an Oracle VM server pool is placed into production, storage fault testing should be conducted on the quorum disk to find a suitable O2CB_HEARTBEAT_THRESHOLD o2cb timeout value, SAN and FC Switch configurations that provides predicable failure response. For example, Oracle VM Servers should be able to lose one HBA without node evictions. Incompatible o2cb timeout values, SAN and FC Switch configurations can trigger node evictions and unexpected server reboots.
The next tables shows each of the fault test with the expected failure results. Modify the table to reflect meet your design and fault tests.
Use the following commands to confirm the tests results.
Complete and document each test to confirm which settings provide the expected failure results.
|
Use Case
|
Port(s)
|
Bond Mode
|
Expected Results
|
Results
|
|
1- Disable the port and wait +20 secs over the O2CB_HEARTBEAT_THRESHOLD time out value.
|
HBA0
|
|
HBA0 paths go down, HBA1 paths stay active, the node does not self-fence.
|
|
|
2- Disable the port and wait +20 secs over the O2CB_HEARTBEAT_THRESHOLD time out value.
|
HBA1
|
|
HBA1 paths go down, HBA0 paths stay active, the node does not self-fence.
|
|
|
3- Disable both ports and wait +20 secs over the O2CB_HEARTBEAT_THRESHOLD time out value.
|
HBA0
HBA1
|
|
After the O2CB_HEARTBEAT_THRESHOLD timeout value the node self-fences.
|
|
Oracle VM Master Server VIP Failover Testing
Oracle VM facilitates centralized server pool management using an agent-based architecture. The Oracle VM agent is a python application that is installed by default with Oracle VM Server. Oracle VM Manager dispatches commands using XML RPC over a dedicated network using TCP/8899 to each server pool's Master Server. Each Master Server dispatchs commands to subordinate agent servers using TCP/8899. There is only one Master Server agent in a server pool at any one point in time. The Master Server agent is the only server in a server pool to communicate with Oracle VM Manager. Agent intra-component traffic should be isolated to a dedicated class A,B or C Server Management network channel.
To address the single point of failure for the Master Server agent, the server pool "Virtual IP" feature was introduced. The Virtual IP feature detects the loss of the Master Server agent and automatically failover the Master Server to the first node that can lock the cluster.
To test Virtual IP failover, first confirm which node is the Master Server by accessing Oracle VM Manager => Servers and VMs => Right Click the desired Server Pool => Confirm the host name in the Master Server drop down list. Next, as root access the Master Server, and stop the ovs-agent service to typing “service ovs-agent stop --disable-nowayout”. After 60 seconds, ssh to the Virtual IP address to confirm that the Master Server agent failed over to a new node.