Troubleshooting RHEL EC2 Boot Failure and Instance Status Check Issues
Troubleshooting RHEL EC2 Boot Failure and Instance Status Check Issues
When an EC2 Linux instance fails to boot or the instance status check fails, the problem may exist simultaneously in the AWS control plane and within the operating system. In this case, part of the issue was KMS permissions preventing encrypted EBS decryption, and another part was /etc/fstab using unstable device names causing the system to enter maintenance mode.
Symptoms
The failures fall into two categories:
- Instance fails to boot, and CloudTrail shows KMS
CreateGrantorDecryptpermission errors. - Instance attempts to boot, but the system log is stuck at:
Give root password for maintenance
(or press Control-D to continue):The second case causes the OS to fail to fully start, the instance cannot respond to underlying health checks, and the final manifestation is an instance status check failure.
Problem 1: Insufficient KMS Permissions
If the EBS volume uses a KMS CMK for encryption, the IAM role used to launch the instance must have permissions to use that KMS key. At minimum, the following are required:
{
"Effect": "Allow",
"Action": [
"kms:Decrypt",
"kms:GenerateDataKey",
"kms:CreateGrant"
],
"Resource": "arn:aws-cn:kms:<region>:<account-id>:key/<key-id>"
}If the role lacks permissions, EC2 cannot decrypt the system or data volumes during the boot phase, and the instance will fail to start.
Problem 2: fstab Blocking Boot
On Linux, NVMe device names may change with reboots or underlying changes. If /etc/fstab contains a hardcoded entry like:
/dev/nvme2n1p1 /data ext4 defaults 1 2When the device name changes or the volume does not exist, the system will wait for the mount during boot and eventually enter emergency / maintenance mode.
A more stable approach is to use UUID with nofail:
UUID=<volume-uuid> /data ext4 defaults,nofail 1 2Network filesystems should also add _netdev:
server:/share /mnt/share nfs defaults,_netdev,nofail 0 0Fixing fstab via a Rescue Instance
1. Prepare a Rescue Instance
Launch a Linux rescue instance in the same availability zone. Stop the original instance, detach the original root volume, and attach it to the rescue instance.
2. Mount the Original System Root Partition
If the original system uses LVM, first install the tools and activate the volume group:
sudo dnf install lvm2 -y
sudo vgscan
sudo vgchange -ay
sudo lvsMount the original root partition:
sudo mkdir -p /mnt/rescue
sudo mount -o nouuid /dev/<vg-name>/<root-lv> /mnt/rescue3. Modify fstab
sudo vi /mnt/rescue/etc/fstabFirst comment out the risky data volume mount entries so the system can boot. After booting, use lsblk -f to get the UUID and switch to a stable configuration.
4. Unmount and Reattach to the Original Instance
sudo umount /mnt/rescue
sudo vgchange -anThen in the console, reattach the root volume to the original instance, start it, and verify.
Verification
After the instance boots, execute:
lsblk -f
sudo systemctl daemon-reload
sudo mount -a
df -hIf mount -a completes without errors, the fstab configuration is essentially correct.
Summary
EC2 Linux boot failures should be distinguished at two layers:
- AWS control plane: KMS, IAM, EBS status, volume attachment relationships.
- OS internals: fstab, LVM, filesystem, network mounts.
For encrypted volume boot failures, prioritize checking CloudTrail and KMS permissions; for systems stuck in maintenance mode, prioritize checking the console system log and /etc/fstab. For data volume mounts, it is recommended to consistently use UUID + nofail to prevent device name changes from blocking system startup.
