Recommendations for Ensuring Resilient Environments on the OUTSCALE IaaS

This page aims at presenting deployment recommendations in order to minimize the risks of service interruption on 3DS OUTSCALE’s infrastructure.

Prerequisites

The usual recommendations apply to ensure a good level of resilience for the deployment of your services:

  • Make sure that adequate monitoring is implemented in order to be alerted in case of an incident on the machines directly connected or not to 3DS OUTSCALE. This allows you to react as soon as possible.

  • If possible, build your architecture so that the machines are fully or partially redundant (active-passive or active-active). Better still, prepare a business continuity plan (BCP) or a business recovery plan (BRP).

In addition to these recommendations, we present here the tags service paired with metadata to help you efficiently distribute the VMs of your architecture on the OUTSCALE Cloud.

Using repulse and attract Tags

VM Positioning Principle

At 3DS OUTSCALE, we use the TINA OS orchestrator, which has been developed by our research and development teams since 2010. Our orchestrator can be used to manage all Cloud resources. For VM, when a user deploys a VM, our orchestrator determines the most adequate position for the VM on a physical host according to several factors: the number of vCPUs, the number of gibibytes of RAM and the requested vCPU generation.

Tags can be used to indicate the desired position for the VM. We will go over two possible scenarios, repulse_server and attract_server, in a bid to reduce the risks of unavailability due to hypervisor failure. If the Region allows it, we recommend using several Subregions for your architectures deployed in the Cloud.

For example, if you want to place two VMs on different hypervisors (for a database cluster for example), you should add a repulse_server tag on each VM of the cluster, as follows:

osc.fcu.repulse_server = host-db

All VMs with the "host-db" tag value will try to repulse one another and position themselves on different hypervisors. By acting on the VM’s position, we therefore suppress the risk of having two VMs from the same cluster on the same physical server. In this example, the robustness of the database cluster is increased as a result of a finely tuned position.

The reverse mechanism to the repulse_server tag is the attract_server tag. Typically, if you want to reduce the network latency between VMs, you can place them on the same physical server to improve network performance.

A similar mechanism is available for Cisco UCS clusters. In this case, you need to use the repulse_cluster and attract_cluster tags.

Strict Application

These repulsion and attraction mechanisms work on a best-effort basis, meaning TINA OS tries as much as possible to position the VMs according to the specified tags. If the application of the tags is not possible at a given time, then TINA OS will still start the VMs regardless of the specified tags.

To prevent VMs from starting, you can add the _strict suffix to the chosen tags. This way, when tags cannot be applied, the VMs do not start and an InsufficientCapacity error is returned. The _strict suffix allows for a more precise management of your infrastructure.

Examples

$ cat repulse.clair
-----BEGIN OUTSCALE SECTION-----
tags.osc.fcu.repulse_server=host-db
-----END OUTSCALE SECTION-----

# Encoding of user data in Base64 with replacement of \n
$ openssl enc -base64 -in repulse.clair | tr -d "\n" > repulse
$ cat repulse
LS0tLS1CRUdJTiBPVVRTQ0FMRSBTRUNUSU9OLS0tLS0KdGFncy5vc2MuZmN1LnJlcHVsc2Vfc2VydmVyPXRvdG8KLS0tLS1FTkQgT1VUU0NBTEUgU0VDVElPTi0tLS0t=

# Creation of a VM with the repulse_server tag
$ osc-cli api CreateVms \
    --ImageId "ami-976177b8" \
    --KeypairName "MyKey" \
    --VmType "tinav4.c4r4" \
    --Placement '{"SubregionName": "eu-west-2a", "Tenancy": "default"}' \
    --UserData "LS0tLS1CRUdJTiBPVVRTQ0FMRSBTRUNUSU9OLS0tLS0KdGFncy5vc2MuZmN1LnJlcHVsc2Vfc2VydmVyPXRvdG8KLS0tLS1FTkQgT1VUU0NBTEUgU0VDVElPTi0tLS0t="

An efficient way to verify that position tags have been applied is to download, from the inside of the VM, a hash of the server or of the cluster (corresponding to a rack of 8 servers) on which the machine is positioned, and compare it with another VM. This information is available on the metadata server accessible from any VM in the Cloud:

# For a cluster
$ curl http://169.254.169.254/latest/meta-data/placement/cluster
042c0d0863ef30fcd4e8ae28d1b21021730738a

# For a server (= hypervisor)
$ curl http://169.254.169.254/latest/meta-data/placement/server
15e5a7d3ccf781868601140d46d2ad23588ff55b

If the hashes returned by both VMs are identical, then the VMs are on the same server (or in the same cluster). Otherwise, they are on different servers (or in different clusters).

Using the ShutDownBehavior Attribute

Each VM in the OUTSCALE Cloud has a list of attributes. We are going to focus on the vmInitiatedShutdownBehavior attribute (or instanceInitiatedShutdownBehavior, depending on the API you wish to use). Naturally, you can modify this attribute directly from the Cockpit web interface.

This attribute is used to define the behavior of a VM in case of an interruption of the hypervisor process that supports the VM. Requesting the shutdown of a VM via the API corresponds to an interruption of the process attached to the VM.

In case of a major malfunction, 3DS OUTSCALE may perform VM migration operations so that the services associated with your VMs can restart as soon as possible.

If possible, we recommend specifying a value of restart for this attribute. This way, if the VM is migrated to another hypervisor, the VM will be able to restart automatically without any intervention from you. If you want to shut down the VM via an API request at some point, you will have to modify the attribute’s value before doing so.

Here is an example of an API request that returns information on a given VM, including the value of the vmInitiatedShutdownBehavior attribute:

$ osc-cli api Read Vms --Filters '{"VmIds": ["i-7965861d"]}'

Conclusion

Regardless of which applications you deploy in the Cloud, it is highly recommended to use repulsion and attraction tags, and to deploy VMs across multiple Subregions, in order to improve the resilience of your applications. Not taking into consideration the position of your VMs is taking the risk of having them placed on the same hypervisor, thus potentially resulting in service interruption in the event of hypervisor hardware failure.

Related Pages