PYTHON
Extract All URLs from a String
A powerful Python regex pattern to find and extract all valid URLs (HTTP/HTTPS) embedded within a larger text string for content parsing.
import re
def extract_urls(text):
# Regex to find http(s):// URLs and also www. or domain-only patterns
url_regex = r'https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|https?:\/\/[a-zA-Z0-9]+\.[^\s]{2,}|[a-zA-Z0-9]+\.[^\s]{2,}'
urls = re.findall(url_regex, text)
return urls
# Example Usage:
text_content = "Visit our website at https://example.com or find more info at http://www.anothersite.org/page.html. Also, check out example.net."
found_urls = extract_urls(text_content)
print(found_urls)
# Expected: ['https://example.com', 'http://www.anothersite.org/page.html', 'example.net']
How it works: This Python function utilizes a regular expression to search for and extract all occurrences of valid URL patterns (including 'http://', 'https://', 'www.', and domain-only references) from a given text string. It's highly useful for parsing user-generated content, extracting links from articles, or for web scraping tasks.