Counting Objects with Specific Extensions in an S3 Bucket by Size and Count
Counting Objects with Specific Extensions in an S3 Bucket by Size and Count
Sometimes you need to quickly count the number and total size of a certain type of file in an S3 bucket, such as .jpg or .png image objects. If the bucket has versioning enabled, historical versions also need to be included. This kind of requirement may not be suitable for waiting on S3 Inventory — streaming the count directly with AWS CLI is faster.
Scenario
The goal is to count the following in a versioning-enabled S3 bucket:
- Number of objects with a specified file extension.
- Total size of objects with a specified file extension.
- Including all historical versions, not just the current version.
If the business requires "results right now," S3 Inventory may not be suitable because it is an asynchronous report and the first generation typically has a delay.
Why Use list-object-versions
The regular list-objects only looks at the current object version and cannot cover historical versions. For buckets with versioning enabled, use:
aws s3api list-object-versionsThen use --query 'Versions[*].[Key, Size]' to extract only the object Key and Size, reducing downstream processing cost.
Counting Command
The following example counts .jpg and .png objects:
aws s3api list-object-versions \
--bucket <bucket-name> \
--region <region> \
--query 'Versions[*].[Key, Size]' \
--output text |
grep -Ei "\.(jpg|png)[[:space:]]+[0-9]+$" |
awk '
BEGIN {
fmt = "Total image objects (including historical versions): %d\n"
}
{
count++;
size += $NF;
}
END {
print "======================";
printf fmt, count;
printf "Total size (including historical versions): %.2f GB\n", size/1024/1024/1024;
print "======================";
}'Example output:
======================
Total image objects (including historical versions): 120446
Total size (including historical versions): 56.33 GB
======================Notes
- It is recommended to execute this on an EC2 instance in the same region to reduce network latency.
- If there are many objects, CLI calls will incur List request costs.
- If you only need the current version, do not use
list-object-versions— uselist-objects-v2instead. - If the object scale is very large and some delay is acceptable, S3 Inventory is more suitable for periodic reporting.
- If keys contain special characters such as newlines, text pipeline processing will have edge-case issues. For rigorous scenarios, use JSON + jq.
Summary
When you need to urgently count objects with specific extensions in an S3 bucket, list-object-versions + grep + awk is a simple and effective solution. Its advantages are that it is real-time, lightweight, and requires no waiting for Inventory; its disadvantage is that it is more suited for one-off counting, not long-term periodic reporting.
