Dan Levy's Avatar DanLevy.net

From Zero to Regex Hero

Extract & Parse URL-like Strings with a Single Regex

From Zero to Regex Hero

Table of Contents

TL;DR: Jump ahead to the 120+ Byte Regex.

🚀 Introduction

Extracting URLs from raw text can sometimes feel like playing a tedious game of whack-a-mole. Punctuation, parenthetical wrappers, and ambiguous formatting all conspire to frustrate your efforts. Whether you’re building a web scraper, data analyzer, or a chat application, accurately extracting URLs is essential.

In this post, we’ll tackle the problem head-on with a flexible, two-step approach. Our goal is to capture all potential URL-like strings first and then handle validation in a subsequent process.

💡 Note: This pattern is not for validating URLs! It’s intentionally permissive with punctuation and bad spelling.

🔍 Goal: Extract URLs from Text

When extracting URLs from raw text, a two-step approach is effective:

  1. Capture Everything URL-like: Cast a wide net to grab all strings that could be URLs. This is where our “120+ byte regex” shines.
  2. Validate: Once you’ve captured these candidates, use secondary checks (e.g., DNS resolution, comparison against known domains) to weed out invalid entries.

Visualizing the Challenge

Terms like extract and parse are often used interchangeably, however they refer to distinct processes. Extracting URLs involves identifying and capturing potential URLs from a larger body of text. Parsing, on the other hand, involves breaking down these URLs into their constituent parts.

When I mention parsing or ‘URL parts’, I’m referring to the following components:

The 5 Parts of all URLs
URL anatomy, visualized

Click to see a screenshot of RegEx101’s substring matching.

Before we get too deep into the regex, let’s use a visual tool to see how well my pattern captures many matches:

Using RegEx101.com to visualize multi-line matches
Preview 'bulk' multi-line matches

🛳️ The 120+ Byte Regex

Below is a concise regex designed to extract and parse URLs in a single shot. It supports various protocols, domains, paths, and optional query/fragment sections.

Don’t worry—we’ll break it down step by step!

120+ Byte URL Regex
const urlRegex = /([-.a-z0-9]+:\/{1,3})([^-\/\.[\](|)\s?][^`\/\s\]?]+)([-_a-z0-9!@$%^&*()=+;/~\.]*)[?]?([^#\s`?]*)[#]?([^#\s'"`\.,!]*)/gi;
// Compatibility: ES5+
// Same pattern, split on newlines for readability:
([-.a-z0-9]+:\/{1,3})
([^-\/\.[\](|)\s?][^`\/\s\]?]+)
([-_a-z0-9!@$%^&*()=+;/~\.]*)
[?]?([^#\s`?]*)
[#]?([^#\s'"`\.,!]*)
Share the wildest regex’s you’ve encountered (OR authored) in the comments below! 🚀

🧩 Breaking It Down Step by Step

Let’s dissect the regex into its components to understand how it works:

1. Protocol (Group 1): ([-.a-z0-9]+:/{1,3})

2. Domain (Group 2): ([^-/.[](|)s?][^`/s]?]+)

3. Path (Group 3): ([-_a-z0-9!@$%^&*()=+;/~\.]*)

4. Query (Group 4): [?]?([^#\s`?]*)

5. Fragment (Group 5): [#]?([^#\s’”`.,!]*)

🛠️ Parsing Example

Here’s how you can put this monster regex to work, with a bit of JavaScript:

☑️ Next Steps

Depending on your use case, you might need to refine this regex or add more validation and post-processing steps.

Different Projects, Different Needs

Projects have varied requirements and security concerns:

  1. Web Scraping: Validate URLs to ensure they’re reachable and trustworthy.
  2. Data Processing: Extract URLs from user-generated content while ensuring safety.
  3. Data Analysis: Filter out duplicates or irrelevant links for research or marketing purposes.
  4. User-facing Applications: Automatically hyperlink URLs in chat apps or forums.

Post-Processing and Validation

After gathering potential URLs, apply additional checks:

📝 Summary

Extracting semi-structured string data just might be the most satisfying part of regex mastery.

Here’s a recap of the key takeaways:

By following these steps, you can effectively extract any semi-structured string data, setting the foundation for further processing and validation.

📚 Further Learning

Edit on GitHubGitHub