BASH
Extract Unique Domain Names from a List of URLs
A powerful bash one-liner to parse a file containing URLs and extract only the unique domain names. Ideal for analyzing web server access logs, link lists, or creating whitelists/blacklists.
#!/bin/bash
URL_FILE="urls.txt" # Your file containing one URL per line
if [ ! -f "$URL_FILE" ]; then
echo "Error: URL file '$URL_FILE' not found." >&2
exit 1
fi
echo "Extracting unique domains from '$URL_FILE':"
# Use sed, cut, and sort/uniq to extract unique domains
cat "$URL_FILE" | \
sed -E 's/^[[:alpha:]]+:\/\/([[:alnum:]_.-]+).*/\1/' | \
sed -E 's/^www\.//' | \
sort -u
# Explanation of sed regex:
# ^[[:alpha:]]+:\/\/ - Matches 'http://', 'https://', etc.
# ([[:alnum:]_.-]+) - Captures the domain name (alphanumeric, underscore, dot, hyphen)
# .* - Matches the rest of the URL path/query
How it works: This script processes a file of URLs to extract unique domain names. It first `cat`s the file content. The first `sed` command uses a regular expression to strip the protocol (http/https) and any path/query parameters, leaving just the hostname. The second `sed` removes 'www.' prefixes for cleaner output. Finally, `sort -u` sorts the resulting hostnames alphabetically and removes duplicates, providing a list of unique domain names. This is very useful for web analysis tasks.