JAVASCRIPT
Extracting All Links and HREFs from HTML String
A JavaScript regex solution to parse an HTML string and extract all anchor (<a>) tags along with their href attributes, useful for web scraping or content analysis.
const extractLinksAndHrefs = (htmlString) => {
const links = [];
// This regex captures the href attribute and the inner text of <a> tags.
// It's a simple approach and might not handle complex or malformed HTML perfectly.
const linkRegex = /<a\\s+(?:[^>]*?\\s+)?href=(["'])(.*?)\\1(?:[^>]*?)?>(.*?)<\\/a>/gi;
let match;
while ((match = linkRegex.exec(htmlString)) !== null) {
links.push({
href: match[2],
text: match[3].trim()
});
}
return links;
};
const htmlContent = `
<p>Visit <a href="https://example.com/page1">Page One</a> or
<a class="nav-link" href="/local/path/to/page2.html" target="_blank">Page Two</a>.</p>
<img src="/image.jpg">
<a href="mailto:[email protected]">Email Us</a>
`;
console.log(extractLinksAndHrefs(htmlContent));
/* Expected Output:
[
{ href: "https://example.com/page1", text: "Page One" },
{ href: "/local/path/to/page2.html", text: "Page Two" },
{ href: "mailto:[email protected]", text: "Email Us" }
]
*/
How it works: The `extractLinksAndHrefs` function uses a powerful regular expression to iterate through an HTML string and identify all `<a>` (anchor) tags. For each tag found, it extracts both the value of the `href` attribute and the visible inner text of the link. The `g` flag ensures all matches are found, and the `i` flag makes the search case-insensitive. The results are stored in an array of objects, providing a structured way to analyze the links within the HTML.