How to Fix Ansible Timeout Error on Azure VM


Troubleshooting Guide: Tackling Ansible Timeout Errors on Azure VMs

As Senior DevOps Engineers, we’ve all encountered the dreaded Timeout (12s) waiting for privilege escalation password or Timeout waiting for SSH connection messages when automating deployments. On Azure Virtual Machines, these Ansible timeout errors often point to specific underlying issues that, once understood, are straightforward to resolve. This guide provides a direct, systematic approach to diagnosing and fixing these common problems.


1. The Root Cause: Why Ansible Timeouts Occur on Azure VMs

Ansible timeouts typically stem from a failure to establish or maintain an SSH connection within the configured timeframe. On Azure, this is frequently compounded by its robust networking security model.

  • Azure Network Security Groups (NSGs): The most common culprit. NSGs act as a virtual firewall for your Azure resources. If NSG rules block inbound SSH traffic (port 22 by default) from your Ansible control node’s IP address to the target Azure VM, Ansible will endlessly attempt to connect until it times out.
  • Azure Firewall or VNet Routing Issues: If you have an Azure Firewall in place, or complex VNet peering/routing, it might also be blocking SSH traffic.
  • Target VM Performance/SSH Daemon:
    • Resource Exhaustion: An undersized or heavily loaded Azure VM might be slow to respond to new SSH connection requests, causing the SSH daemon to lag or even crash.
    • sshd_config Misconfiguration: Less common, but an improperly configured sshd_config (e.g., too restrictive MaxStartups or incorrect ClientAliveInterval) on the target VM can prevent connections.
  • Ansible/SSH Client Timeout Settings:
    • Default Ansible Timeout: Ansible has a default connection timeout (typically 10-15 seconds) which might be too short for environments with higher latency or slower VM provisioning.
    • SSH Client Timeout: The underlying SSH client also has its own connection timeout settings that can be overridden.
  • Long-Running Tasks: While not a “connection” timeout, if a specific Ansible task runs longer than the timeout set for that task or the global timeout, it will also fail with a timeout error.

2. Quick Fix (CLI)

Before diving into configuration files, let’s use the CLI for immediate diagnosis and potential temporary fixes.

  1. Direct SSH Connectivity Test (Crucial First Step): The most fundamental test. Attempt to SSH directly from your Ansible control node to the target Azure VM. This bypasses Ansible entirely and tells you if the underlying network path and SSH daemon are working.

    ssh -v your_azure_user@your_azure_vm_public_ip
    • -v (verbose) is essential. Look for clues like “Connection timed out,” “No route to host,” or where the connection hangs. This output often directly points to NSG issues or network routing.
    • If this fails, Ansible will always fail. Focus on fixing the direct SSH connection first (likely NSG).
  2. Increase Ansible Timeout (Environment Variable): For a quick test without modifying files, temporarily increase the global Ansible timeout.

    export ANSIBLE_TIMEOUT=60 # Set to 60 seconds (or higher)
    ansible -m ping your_target_vm -i inventory.ini
    # Or run your playbook
    ansible-playbook your_playbook.yml -i inventory.ini
    • This applies the timeout globally for the current shell session.
  3. Ad-Hoc Command with Specific Timeout: Test with a higher timeout for a specific command, overriding defaults.

    ansible -m ping your_target_vm -i inventory.ini -u your_azure_user --timeout 60
    • This checks basic connectivity to your_target_vm with a 60-second timeout.
  4. Specify SSH Client Connection Timeout: Use ANSIBLE_SSH_ARGS to pass specific options to the underlying SSH client. This can be more effective for initial connection issues than just ANSIBLE_TIMEOUT.

    export ANSIBLE_SSH_ARGS="-o ConnectTimeout=30 -o ServerAliveInterval=10"
    ansible -m ping your_target_vm -i inventory.ini
    • ConnectTimeout=30: Instructs the SSH client to wait 30 seconds for the initial TCP connection to establish.
    • ServerAliveInterval=10: Sends a keep-alive message every 10 seconds to prevent idle connections from timing out on firewalls or routers.

3. Configuration Check

Systematic inspection of your Ansible and Azure configurations is key for a permanent solution.

3.1. Ansible Configuration (ansible.cfg)

Check your global Ansible configuration file (usually /etc/ansible/ansible.cfg or ./ansible.cfg in your project root).

# Example ansible.cfg
[defaults]
# Global connection timeout in seconds for Ansible.
# Increase this if connections are frequently timing out due to latency.
timeout = 30 # Default is usually 10-15s, try 30 or 60.

[ssh_connection]
# Pass custom arguments to the SSH client.
# Crucial for fine-tuning connection behavior.
# -o ConnectTimeout: Specifies the timeout for the initial connection.
# -o ServerAliveInterval: Sends null packets to keep the connection alive.
# -o ControlMaster/ControlPersist: SSH connection multiplexing for faster subsequent connections (advanced).
ssh_args = -o ConnectTimeout=30 -o ServerAliveInterval=10

3.2. Ansible Inventory File

You can set timeouts specific to hosts or groups within your inventory.

# Example hosts.ini or inventory.yml
[azure_vms]
webserver1 ansible_host=20.10.10.10 ansible_user=azureuser ansible_port=22
dbserver1 ansible_host=20.10.10.11 ansible_user=azureuser ansible_port=22 ansible_connect_timeout=60

[all:vars]
# You can also set a default connect timeout for all hosts in the inventory
# This overrides the ssh_args ConnectTimeout for this inventory.
# ansible_connect_timeout=45
  • ansible_connect_timeout: This variable directly influences the SSH connection timeout for the specific host/group.

3.3. Azure Network Security Groups (NSGs)

This is the most critical Azure-specific check. You need an inbound rule allowing TCP port 22 (SSH) from your Ansible control node’s public IP address.

Check via Azure CLI:

# First, find the NSG associated with your VM's network interface
# Replace <resource-group-name> and <vm-name>
vm_nic_id=$(az vm show -g <resource-group-name> -n <vm-name> --query "networkProfile.networkInterfaces[0].id" -o tsv)
nsg_id=$(az network nic show --ids "$vm_nic_id" --query "networkSecurityGroup.id" -o tsv)
nsg_name=$(az network nsg show --ids "$nsg_id" --query "name" -o tsv)
nsg_resource_group=$(az network nsg show --ids "$nsg_id" --query "resourceGroup" -o tsv)

echo "VM's NSG: $nsg_name in RG: $nsg_resource_group"

# Then, list the inbound rules for that NSG
az network nsg rule list -g "$nsg_resource_group" --nsg-name "$nsg_name" --query "[?direction=='Inbound']" -o table

What to look for:

  • A rule with Access=Allow, Protocol=Tcp, DestinationPortRange=22.
  • Crucially, check SourceAddressPrefixes:
    • It should either be * (any) (less secure for production), or
    • It should include the public IP address of your Ansible control node.
    • If your control node is inside the same VNet, it might be the private IP or a service tag.

If missing or incorrect, add/update the rule:

# Get your control node's public IP (from the machine running Ansible)
my_ip=$(curl -s ifconfig.me)
echo "Your current public IP: $my_ip"

az network nsg rule create -g "$nsg_resource_group" --nsg-name "$nsg_name" -n AllowSSHFromControlNode \
    --priority 100 \
    --direction Inbound \
    --source-address-prefixes "$my_ip" \
    --source-port-ranges "*" \
    --destination-address-prefixes "*" \
    --destination-port-ranges 22 \
    --protocol Tcp \
    --access Allow

Check via Azure Portal:

  1. Navigate to your Virtual Machine.
  2. Go to Networking.
  3. Under Inbound port rules, ensure there’s a rule allowing SSH (Port 22) from your control node’s IP.

3.4. Target VM Health and SSH Daemon Configuration

If direct SSH works but is slow, or if the VM is generally unresponsive:

  • Azure Portal: Check VM metrics (CPU, Memory, Disk I/O) for signs of high utilization. Consider resizing to a larger SKU.
  • Inside VM (if you can get in):
    • Check system load: uptime, top, htop.
    • Check memory: free -h.
    • Check disk I/O: iostat -x 1.
    • Review /etc/ssh/sshd_config: Ensure ClientAliveInterval and ClientAliveCountMax are set reasonably (e.g., ClientAliveInterval 30, ClientAliveCountMax 2). Restart sshd after changes (sudo systemctl restart sshd).

4. Verification

After applying any changes, verify your fix.

  1. Rerun the Direct SSH Test:

    ssh your_azure_user@your_azure_vm_public_ip

    Confirm you can connect swiftly and reliably.

  2. Execute the Failing Ansible Playbook or Ad-Hoc Command: Run the original Ansible command or playbook that was previously timing out.

    ansible-playbook your_playbook.yml -i inventory.ini
    # Or
    ansible -m ping your_target_vm -i inventory.ini
  3. Monitor Output: Look for successful connection messages and task execution. If it still fails, review the verbose output from ssh -v or add -vvv to your Ansible command for deeper debugging.


By systematically working through network connectivity, Ansible’s timeout configurations, and underlying VM health, you can efficiently diagnose and resolve Ansible timeout errors on Azure VMs. Remember, the direct SSH test is always your most valuable diagnostic tool.