Hybrid and Multi-Cloud Overlay — Part 5— Challenges — AWS, Azure, GCP, OCI and Alicloud

Ramesh Rajendran
6 min readSep 24, 2020

--

AWS

Lets look at AWS public cloud environment.

I used terraform to bring up entire environment all public clouds. Provisioned the virtual machines with shell scripts.

In AWS, you can directly specify multiple network interfaces within virtual machine resources.

resource "aws_instance" "layer2-aws-router" {
instance_type = "t2.small"
ami = data.aws_ami.ubuntu.id
key_name = aws_key_pair.VM_SSH_KEY.key_name
availability_zone = data.aws_availability_zones.available.names[0]
tags = {
Name = "layer2-aws-router"
environment = "l2project"
}
network_interface {
network_interface_id = aws_network_interface.Router_Front_NIC.id
device_index = 0
}
network_interface {
network_interface_id = aws_network_interface.Router_Backend_NIC.id
device_index = 1
}

provisioner "remote-exec" {
connection {
type = "ssh"
user = "ubuntu"
host = aws_eip.router.public_ip
private_key = file(var.VM_SSH_KEY_FILE)
timeout = "5m"
}
inline = [
"sudo adduser --disabled-password --gecos \"\" ${var.VM_USER}",
"sudo mkdir /home/${var.VM_USER}/.ssh",
"sudo cp -a /home/ubuntu/.ssh/* /home/${var.VM_USER}/.ssh/",
"sudo echo '${var.VM_USER} ALL=(ALL) NOPASSWD: ALL' > ${var.VM_USER}",
"sudo mv ${var.VM_USER} /etc/sudoers.d/",
"sudo chown -R 0:0 /etc/sudoers.d/${var.VM_USER}",
"sudo chown -R ${var.VM_USER}:${var.VM_USER} /home/${var.VM_USER}/.ssh",
"sudo hostname layer2-aws-router",
]
}
provisioner "file" {
source = "../common/tools.sh"
destination = "/tmp/tools.sh"
connection {
type = "ssh"
user = var.VM_USER
host = aws_eip.router.public_ip
private_key = file(var.VM_SSH_KEY_FILE)
timeout = "5m"
}
}
provisioner "remote-exec" {
connection {
type = "ssh"
user = var.VM_USER
host = aws_eip.router.public_ip
private_key = file(var.VM_SSH_KEY_FILE)
timeout = "5m"
}
inline = [
"chmod +x /tmp/tools.sh",
"sudo bash -x /tmp/tools.sh",
"sudo pkill -u ubuntu",
"sudo deluser ubuntu",
]
}
provisioner "file" {
source = "../common/dhclient_metric.sh"
destination = "/tmp/dhclient_metric.sh"
connection {
type = "ssh"
user = var.VM_USER
host = aws_eip.router.public_ip
private_key = file(var.VM_SSH_KEY_FILE)
timeout = "5m"
}
}
provisioner "remote-exec" {
connection {
type = "ssh"
user = var.VM_USER
host = aws_eip.router.public_ip
private_key = file(var.VM_SSH_KEY_FILE)
timeout = "5m"
}
inline = [
"chmod +x /tmp/dhclient_metric.sh",
"sudo bash -x /tmp/dhclient_metric.sh ",
]
}
}

AWS also allow GRE and all IP based protocols. But, most of the public clouds are limited to TCP, UDP and ICMP traffic.

But, I had an issue with the routing. Router virtual machine had two interfaces. Both interfaces received default route. If the router picks up wrong interface, I will lose the management connectivity. I exercised a simple trick here. I wrote a small script to set the lower metric for the backend interface in the router. Since it got the lower metric, routing table will list the backend interface routes as the least preferred route when two routes are listed for the same destinations.

#!/bin/bash
#This script configures metric on the 2nd NIC and enables DHCP
sudo echo $(ls -t /sys/class/net/)
backend=$(sudo ip link | awk -F: '$0 !~ "ovs|br|docker|vxlan|lo|vir|wl|^[^0-9]"{print $2;getline}' | sed -n 2p)
sudo echo $backend
sudo ifconfig $backend up
sudo dhclient -e IF_METRIC=200 $backend

All other setups and configurations in AWS are a piece of cake. Easy to manage.

Azure

It was not hard to automate Azure. I wish Azure could make the API key generation quite straight forward like other clouds, API key generation is quite tricky. Originally, I started with GRE tunnelling and then I stumbled upon a document saying that IP based ACLs, GRE and IP tunnelling are blocked within Azure. After which, I shifted to GENEVE and VXLAN overlay protocols. I didn’t have any issue with configuring multiple interfaces. But, I noticed a few warning messages from terraform. Some of the resources syntax have been changed. Resource ”azurerm_virtual_machine“ has been superseded by “azurerm_linux_virtual_machine” and “azurerm_windows_virtual_machine“. I couldn’t specify the provisioners directly within the resources block. Once virtual machines are created, I used null resource to connect back the virtual machines and then provisioned the virtual machines. Dual interfaces in Azure is automatically handled. We don’t need explicitly employ any fix to handle the asymmetric routing. Your first interface in the network interfaces list always used as the primary interface.

resource "azurerm_linux_virtual_machine" "layer2-azure-router" {
name = "layer2-azure-router"
location = var.AZURE_LOCATION
resource_group_name = azurerm_resource_group.l2project_rg.name
network_interface_ids = [azurerm_network_interface.Router_Front_NIC.id, azurerm_network_interface.Router_Backend_NIC.id]
size = "Standard_B1s"
computer_name = "layer2-azure-router"
admin_username = var.VM_USER

os_disk {
name = "router-disk"
caching = "ReadWrite"
storage_account_type = "Premium_LRS"
}
source_image_reference {
publisher = "Canonical"
offer = "UbuntuServer"
sku = "18.04-LTS"
version = "latest"
}
admin_ssh_key {
username = var.VM_USER
public_key = file(var.VM_SSH_PUBLICKEY_FILE)
}
boot_diagnostics {
storage_account_uri = azurerm_storage_account.storage_account.primary_blob_endpoint
}
tags = {
environment = "l2project"
}
}
resource "null_resource" "router_config" {
depends_on = [ azurerm_linux_virtual_machine.layer2-azure-router, azurerm_public_ip.router]
connection {
type = "ssh"
user = var.VM_USER
host = azurerm_public_ip.router.fqdn
private_key = file(var.VM_SSH_KEY_FILE)
agent = false
timeout = "5m"
}
provisioner "file" {
source = "../common/tools.sh"
destination = "/tmp/tools.sh"
}
provisioner "remote-exec" {
inline = [
"chmod +x /tmp/tools.sh",
"sudo echo '${var.VM_USER} ALL=(ALL) NOPASSWD: ALL' > ${var.VM_USER}",
"sudo mv ${var.VM_USER} /etc/sudoers.d/",
"sudo chown -R 0:0 /etc/sudoers.d/${var.VM_USER}",
"sudo bash -x /tmp/tools.sh",
]
}
}

GCP

In GCP, I didn’t have any issue with configuring multiple interfaces. I could specify the provisioners’ directly within the resources block. Routing with Dual interfaces in GCP is automatically handled. No asymmetric routing issue. Your first interface in the network interfaces list always used as the primary interface.

resource "google_compute_instance" "layer2-gcp-router" {
name = "layer2-gcp-router"
machine_type = "f1-micro"
zone = var.GCP_ZONE
tags = ["l2-project", "external-ssh-tunnel", "internal-icmp-ssh-tunnel"]
boot_disk {
initialize_params {
image = data.google_compute_image.ubuntu.self_link
}
}
metadata = {
ssh-keys = "${var.VM_USER}:${file(var.VM_SSH_PUBLICKEY_FILE)}"
}
depends_on = [
google_compute_firewall.internal-icmp-ssh-tunnel,
google_compute_firewall.external-ssh-tunnel,
google_compute_address.router_backend,
google_compute_address.router_publicip,
]
network_interface {
subnetwork = google_compute_subnetwork.front-subnet.name
access_config {
nat_ip = google_compute_address.router_publicip.address
}
}
network_interface {
subnetwork = google_compute_subnetwork.underlay-subnet.name
network_ip = google_compute_address.router_backend.address
}

provisioner "file" {
source = "../common/tools.sh"
destination = "/tmp/tools.sh"
connection {
type = "ssh"
user = var.VM_USER
host = google_compute_address.router_publicip.address
private_key = file(var.VM_SSH_KEY_FILE)
timeout = "6m"
}
}
provisioner "remote-exec" {
connection {
type = "ssh"
user = var.VM_USER
host = google_compute_address.router_publicip.address
private_key = file(var.VM_SSH_KEY_FILE)
timeout = "6m"
}
inline = [
"chmod +x /tmp/tools.sh",
"sudo echo '${var.VM_USER} ALL=(ALL) NOPASSWD: ALL' > ${var.VM_USER}",
"sudo mv ${var.VM_USER} /etc/sudoers.d/",
"sudo chown -R 0:0 /etc/sudoers.d/${var.VM_USER}",
"sudo /tmp/tools.sh",
]
}
}

OCI

In OCI, virtual machine deployment and public IPs deployment sometimes 1–2 minutes faster than other clouds. Just like Azure, OCI API key generation is quite tricky. All other configurations are straight forward In OCI, Multiple interfaces creation involves two step process. I can directly create the primary interface within “oci_core_instance”. Any additional interfaces should be created outside resources block and should be attached to virtual machine. Interface tends to be down in virtual machine because you are attaching the interface after virtual machine build activity. However, you can fix this by using null resource, where you can log into virtual machine and bring up the interface. It should fix the issue. There is no limitation with asymmetric routing.

resource "oci_core_vnic_attachment" "backend_vnic_attachment" {
create_vnic_details {
subnet_id = oci_core_subnet.underlay-subnet.id
display_name = "backendvnic"
assign_public_ip = false
nsg_ids = [
oci_core_network_security_group.ssh_icmp_tunnel_web.id
]
}
instance_id = oci_core_instance.layer2-oci-router.id
}
resource "null_resource" "backend_interface_config" {
depends_on = [oci_core_vnic_attachment.backend_vnic_attachment]
connection {
type = "ssh"
user = var.VM_USER
host = oci_core_instance.layer2-oci-router.public_ip
private_key = file(var.VM_SSH_KEY_FILE)
timeout = "5m"
}
provisioner "file" {
source = "../common/interface.sh"
destination = "/tmp/interface.sh"
}
provisioner "remote-exec" {
inline = [
"chmod +x /tmp/interface.sh",
"/tmp/interface.sh ${oci_core_vnic_attachment.backend_vnic_attachment.create_vnic_details[0].private_ip} ${var.UNDERLAY_SUBNETMASK}",
]
}
}

Alicloud

In ALI cloud, I had an asymmetric routing issue with router virtual machines. Once virtual machines are provisioned, I created an interface and attached the interface with the router virtual machine. Just like AWS, the new backend interface was down. With the use of “null resource provisioning”, I could bring up the new interface with less optimal dhclient metric. This fixed the asymmetric routing issue.

resource "alicloud_network_interface" "Router_Backend_NIC" {
name = "Router_Backend_NIC"
vswitch_id = alicloud_vswitch.backend-vswitch.id
security_groups = [alicloud_security_group.ssh_icmp_tunnel_web.id]
}
resource "alicloud_network_interface_attachment" "router_backend_nic_attachment" {
instance_id = alicloud_instance.layer2-ali-router.id
network_interface_id = alicloud_network_interface.Router_Backend_NIC.id
}
resource "null_resource" "backend_interface_config" {
depends_on = [alicloud_network_interface_attachment.router_backend_nic_attachment]
connection {
type = "ssh"
user = var.ALI_VM_USER
host = alicloud_instance.layer2-ali-router.public_ip
private_key = file(var.VM_SSH_KEY_FILE)
timeout = "5m"
}
provisioner "file" {
source = "../common/dhclient_metric.sh"
destination = "/tmp/dhclient_metric.sh"
}
provisioner "remote-exec" {
inline = [
"chmod +x /tmp/dhclient_metric.sh",
"sudo bash -x /tmp/dhclient_metric.sh ",
]
}
#!/bin/bash
#This script configures metric on the 2nd NIC and enables DHCP
sudo echo $(ls -t /sys/class/net/)
backend=$(sudo ip link | awk -F: '$0 !~ "ovs|br|docker|vxlan|lo|vir|wl|^[^0-9]"{print $2;getline}' | sed -n 2p)
sudo echo $backend
sudo ifconfig $backend up
sudo dhclient -e IF_METRIC=200 $backend
Part 5 — Video blog

--

--

Ramesh Rajendran

Freelancer with 16 years of experience in Hybrid & multi-cloud, security, networking & Infrastructure. Working with C-level execs. Founder zerolatency.solutions