BASH

Extract Unique Domain Names from a List of URLs

A powerful bash one-liner to parse a file containing URLs and extract only the unique domain names. Ideal for analyzing web server access logs, link lists, or creating whitelists/blacklists.

#!/bin/bash

URL_FILE="urls.txt" # Your file containing one URL per line

if [ ! -f "$URL_FILE" ]; then
  echo "Error: URL file '$URL_FILE' not found." >&2
  exit 1
fi

echo "Extracting unique domains from '$URL_FILE':"

# Use sed, cut, and sort/uniq to extract unique domains
cat "$URL_FILE" | \
  sed -E 's/^[[:alpha:]]+:\/\/([[:alnum:]_.-]+).*/\1/' | \
  sed -E 's/^www\.//' | \
  sort -u

# Explanation of sed regex:
# ^[[:alpha:]]+:\/\/       - Matches 'http://', 'https://', etc.
# ([[:alnum:]_.-]+)        - Captures the domain name (alphanumeric, underscore, dot, hyphen)
# .*                      - Matches the rest of the URL path/query
How it works: This script processes a file of URLs to extract unique domain names. It first `cat`s the file content. The first `sed` command uses a regular expression to strip the protocol (http/https) and any path/query parameters, leaving just the hostname. The second `sed` removes 'www.' prefixes for cleaner output. Finally, `sort -u` sorts the resulting hostnames alphabetically and removes duplicates, providing a list of unique domain names. This is very useful for web analysis tasks.

Need help integrating this into your project?

Our team of expert developers can help you build your custom application from scratch.

Hire DigitalCodeLabs