Troubleshooting RHEL EC2 Boot Failure and Instance Status Check Issues

Checo4/6/26About 2 min

Troubleshooting RHEL EC2 Boot Failure and Instance Status Check Issues

When an EC2 Linux instance fails to boot or the instance status check fails, the problem may exist simultaneously in the AWS control plane and within the operating system. In this case, part of the issue was KMS permissions preventing encrypted EBS decryption, and another part was /etc/fstab using unstable device names causing the system to enter maintenance mode.

Symptoms

The failures fall into two categories:

Instance fails to boot, and CloudTrail shows KMS CreateGrant or Decrypt permission errors.
Instance attempts to boot, but the system log is stuck at:

Give root password for maintenance
(or press Control-D to continue):

The second case causes the OS to fail to fully start, the instance cannot respond to underlying health checks, and the final manifestation is an instance status check failure.

Problem 1: Insufficient KMS Permissions

If the EBS volume uses a KMS CMK for encryption, the IAM role used to launch the instance must have permissions to use that KMS key. At minimum, the following are required:

{
  "Effect": "Allow",
  "Action": [
    "kms:Decrypt",
    "kms:GenerateDataKey",
    "kms:CreateGrant"
  ],
  "Resource": "arn:aws-cn:kms:<region>:<account-id>:key/<key-id>"
}

If the role lacks permissions, EC2 cannot decrypt the system or data volumes during the boot phase, and the instance will fail to start.

Problem 2: fstab Blocking Boot

On Linux, NVMe device names may change with reboots or underlying changes. If /etc/fstab contains a hardcoded entry like:

/dev/nvme2n1p1 /data ext4 defaults 1 2

When the device name changes or the volume does not exist, the system will wait for the mount during boot and eventually enter emergency / maintenance mode.

A more stable approach is to use UUID with nofail:

UUID=<volume-uuid> /data ext4 defaults,nofail 1 2

Network filesystems should also add _netdev:

server:/share /mnt/share nfs defaults,_netdev,nofail 0 0

Fixing fstab via a Rescue Instance

1. Prepare a Rescue Instance

Launch a Linux rescue instance in the same availability zone. Stop the original instance, detach the original root volume, and attach it to the rescue instance.

2. Mount the Original System Root Partition

If the original system uses LVM, first install the tools and activate the volume group:

sudo dnf install lvm2 -y
sudo vgscan
sudo vgchange -ay
sudo lvs

Mount the original root partition:

sudo mkdir -p /mnt/rescue
sudo mount -o nouuid /dev/<vg-name>/<root-lv> /mnt/rescue

3. Modify fstab

sudo vi /mnt/rescue/etc/fstab

First comment out the risky data volume mount entries so the system can boot. After booting, use lsblk -f to get the UUID and switch to a stable configuration.

4. Unmount and Reattach to the Original Instance

sudo umount /mnt/rescue
sudo vgchange -an

Then in the console, reattach the root volume to the original instance, start it, and verify.

Verification

After the instance boots, execute:

lsblk -f
sudo systemctl daemon-reload
sudo mount -a
df -h

If mount -a completes without errors, the fstab configuration is essentially correct.

Summary

EC2 Linux boot failures should be distinguished at two layers:

AWS control plane: KMS, IAM, EBS status, volume attachment relationships.
OS internals: fstab, LVM, filesystem, network mounts.

For encrypted volume boot failures, prioritize checking CloudTrail and KMS permissions; for systems stuck in maintenance mode, prioritize checking the console system log and /etc/fstab. For data volume mounts, it is recommended to consistently use UUID + nofail to prevent device name changes from blocking system startup.