PYTHON

Extract All URLs from a String

A powerful Python regex pattern to find and extract all valid URLs (HTTP/HTTPS) embedded within a larger text string for content parsing.

import re

def extract_urls(text):
    # Regex to find http(s):// URLs and also www. or domain-only patterns
    url_regex = r'https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|https?:\/\/[a-zA-Z0-9]+\.[^\s]{2,}|[a-zA-Z0-9]+\.[^\s]{2,}'
    urls = re.findall(url_regex, text)
    return urls

# Example Usage:
text_content = "Visit our website at https://example.com or find more info at http://www.anothersite.org/page.html. Also, check out example.net."
found_urls = extract_urls(text_content)
print(found_urls)
# Expected: ['https://example.com', 'http://www.anothersite.org/page.html', 'example.net']

How it works: This Python function utilizes a regular expression to search for and extract all occurrences of valid URL patterns (including 'http://', 'https://', 'www.', and domain-only references) from a given text string. It's highly useful for parsing user-generated content, extracting links from articles, or for web scraping tasks.

Extract All URLs from a String

Need help integrating this into your project?