Python RegEx

Learn how to use regular expressions in Python for pattern matching and text processing.

RegEx in Python

A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern.

RegEx can be used to check if a string contains the specified search pattern.

RegEx Module

Python has a built-in package called re, which can be used to work with Regular Expressions.

Example - Import the re module:

import re

RegEx in Python

When you have imported the re module, you can start using regular expressions:

Example - Search the string to see if it starts with "The" and ends with "Spain":

import re

txt = "The rain in Spain"
x = re.search("^The.*Spain$", txt)

if x:
    print("YES! We have a match!")
else:
    print("No match")

RegEx Functions

The re module offers a set of functions that allows us to search a string for a match:

findall
Returns a list containing all matches
search
Returns a Match object if there is a match anywhere in the string
split
Returns a list where the string has been split at each match
sub
Replaces one or many matches with a string

Metacharacters

Metacharacters are characters with a special meaning:

[]
A set of characters
"[a-m]"
\
Signals a special sequence (can also be used to escape special characters)
"\d"
.
Any character (except newline character)
"he..o"
^
Starts with
"^hello"
$
Ends with
"planet$"
*
Zero or more occurrences
"he.*o"
+
One or more occurrences
"he.+o"
?
Zero or one occurrences
"he.?o"
Exactly the specified number of occurrences
"he.2o"
|
Either or
"falls|stays"
()
Capture and group
"(he|she)"

Special Sequences

A special sequence is a \ followed by one of the characters in the list below, and has a special meaning:

\A
Returns a match if the specified characters are at the beginning of the string
"\AThe"
\b
Returns a match where the specified characters are at the beginning or at the end of a word
r"\bain" r"ain\b"
\B
Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word
r"\Bain" r"ain\B"
\d
Returns a match where the string contains digits (numbers from 0-9)
"\d"
\D
Returns a match where the string DOES NOT contain digits
"\D"
\s
Returns a match where the string contains a white space character
"\s"
\S
Returns a match where the string DOES NOT contain a white space character
"\S"
\w
Returns a match where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character)
"\w"
\W
Returns a match where the string DOES NOT contain any word characters
"\W"
\Z
Returns a match if the specified characters are at the end of the string
"Spain\Z"

Sets

A set is a set of characters inside a pair of square brackets [] with a special meaning:

[arn]
Returns a match where one of the specified characters (a, r, or n) are present
[a-n]
Returns a match for any lower case character, alphabetically between a and n
[^arn]
Returns a match for any character EXCEPT a, r, and n
[0123]
Returns a match where any of the specified digits (0, 1, 2, or 3) are present
[0-9]
Returns a match for any digit between 0 and 9
[0-5][0-9]
Returns a match for any two-digit numbers from 00 and 59
[a-zA-Z]
Returns a match for any character alphabetically between a and z, lower case OR upper case
[+]
In sets, +, *, ., |, (), $, has no special meaning, so [+] means: return a match for any + character in the string

The findall() Function

The findall() function returns a list containing all matches.

Example - Print a list of all matches:

import re

txt = "The rain in Spain"
x = re.findall("ai", txt)
print(x)

The list contains the matches in the order they are found.

If no matches are found, an empty list is returned:

Example - Return an empty list if no match was found:

import re

txt = "The rain in Spain"
x = re.findall("Portugal", txt)
print(x)

The search() Function

The search() function searches the string for a match, and returns a Match object if there is a match.

If there is more than one match, only the first occurrence of the match will be returned:

Example - Search for the first white-space character in the string:

import re

txt = "The rain in Spain"
x = re.search("\s", txt)

print("The first white-space character is located in position:", x.start())

If no matches are found, the value None is returned:

Example - Make a search that returns no match:

import re

txt = "The rain in Spain"
x = re.search("Portugal", txt)
print(x)

The split() Function

The split() function returns a list where the string has been split at each match:

Example - Split at each white-space character:

import re

txt = "The rain in Spain"
x = re.split("\s", txt)
print(x)

You can control the number of occurrences by specifying the maxsplit parameter:

Example - Split the string only at the first occurrence:

import re

txt = "The rain in Spain"
x = re.split("\s", txt, 1)
print(x)

The sub() Function

The sub() function replaces the matches with the text of your choice:

Example - Replace every white-space character with the number 9:

import re

txt = "The rain in Spain"
x = re.sub("\s", "9", txt)
print(x)

You can control the number of replacements by specifying the count parameter:

Example - Replace the first 2 occurrences:

import re

txt = "The rain in Spain"
x = re.sub("\s", "9", txt, 2)
print(x)

Match Object

A Match Object is an object containing information about the search and the result.

Note: If there is no match, the value None will be returned, instead of the Match Object.

Example - Do a search that will return a Match Object:

import re

txt = "The rain in Spain"
x = re.search("ai", txt)
print(x) #this will print an object

The Match object has properties and methods used to retrieve information about the search, and the result:

  • .span() returns a tuple containing the start-, and end positions of the match.
  • .string returns the string passed into the function
  • .group() returns the part of the string where there was a match

Example - Print the position (start- and end-position) of the first match occurrence:

import re

txt = "The rain in Spain"
x = re.search(r"\bS\w+", txt)
print(x.span())

Example - Print the string passed into the function:

import re

txt = "The rain in Spain"
x = re.search(r"\bS\w+", txt)
print(x.string)

Example - Print the part of the string where there was a match:

import re

txt = "The rain in Spain"
x = re.search(r"\bS\w+", txt)
print(x.group())

Practical RegEx Examples

Email Validation

import re

def validate_email(email):
    """Validate email address using regex."""
    pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
    return re.match(pattern, email) is not None

# Test emails
emails = [
    "user@example.com",
    "test.email+tag@domain.co.uk",
    "invalid.email",
    "user@",
    "@domain.com",
    "valid_email@test-domain.org"
]

for email in emails:
    if validate_email(email):
        print(f"✓ {email} is valid")
    else:
        print(f"✗ {email} is invalid")

Phone Number Extraction

import re

def extract_phone_numbers(text):
    """Extract phone numbers from text."""
    # Pattern for various phone number formats
    patterns = [
        r'\b\d{3}-\d{3}-\d{4}\b',           # 123-456-7890
        r'\b\(\d{3}\)\s*\d{3}-\d{4}\b',     # (123) 456-7890
        r'\b\d{3}\.\d{3}\.\d{4}\b',         # 123.456.7890
        r'\b\d{10}\b',                      # 1234567890
        r'\+1\s*\d{3}\s*\d{3}\s*\d{4}\b'   # +1 123 456 7890
    ]
    
    phone_numbers = []
    for pattern in patterns:
        matches = re.findall(pattern, text)
        phone_numbers.extend(matches)
    
    return phone_numbers

text = """
Contact us at 123-456-7890 or (555) 123-4567.
You can also reach us at 555.987.6543 or 9876543210.
International: +1 800 555 0199
"""

phones = extract_phone_numbers(text)
print("Found phone numbers:")
for phone in phones:
    print(f"- {phone}")

Password Strength Checker

import re

def check_password_strength(password):
    """Check password strength using regex."""
    criteria = {
        'length': len(password) >= 8,
        'lowercase': bool(re.search(r'[a-z]', password)),
        'uppercase': bool(re.search(r'[A-Z]', password)),
        'digit': bool(re.search(r'\d', password)),
        'special': bool(re.search(r'[!@#$%^&*(),.?":{}|<>]', password))
    }
    
    score = sum(criteria.values())
    
    if score == 5:
        strength = "Very Strong"
    elif score == 4:
        strength = "Strong"
    elif score == 3:
        strength = "Medium"
    elif score == 2:
        strength = "Weak"
    else:
        strength = "Very Weak"
    
    return strength, criteria

# Test passwords
passwords = [
    "password",
    "Password123",
    "P@ssw0rd!",
    "MySecureP@ssw0rd2023",
    "12345678"
]

for pwd in passwords:
    strength, criteria = check_password_strength(pwd)
    print(f"\nPassword: {pwd}")
    print(f"Strength: {strength}")
    print("Criteria met:")
    for criterion, met in criteria.items():
        status = "✓" if met else "✗"
        print(f"  {status} {criterion}")

URL Extraction and Validation

import re

def extract_urls(text):
    """Extract URLs from text."""
    url_pattern = r'https?://(?:[-\w.])+(?:[:\d]+)?(?:/(?:[\w/_.])*(?:\?(?:[\w&=%.])*)?(?:#(?:\w)*)?)?'
    return re.findall(url_pattern, text)

def validate_url(url):
    """Validate URL format."""
    pattern = r'^https?://(?:[-\w.])+(?:[:\d]+)?(?:/(?:[\w/_.])*(?:\?(?:[\w&=%.])*)?(?:#(?:\w)*)?)?$'
    return re.match(pattern, url) is not None

def extract_domain(url):
    """Extract domain from URL."""
    pattern = r'https?://(?:www\.)?([^/]+)'
    match = re.search(pattern, url)
    return match.group(1) if match else None

text = """
Visit our website at https://www.example.com or check out
http://blog.example.com/posts/2023/python-tutorial?ref=homepage#section1
Also see: https://api.service.com:8080/v1/data
Invalid: htp://broken-url.com
"""

print("Extracted URLs:")
urls = extract_urls(text)
for url in urls:
    print(f"- {url}")
    print(f"  Valid: {validate_url(url)}")
    print(f"  Domain: {extract_domain(url)}")
    print()

Log File Parser

import re
from datetime import datetime

def parse_log_entry(log_line):
    """Parse a log file entry."""
    # Common log format: IP - - [timestamp] "method path protocol" status size
    pattern = r'(\d+\.\d+\.\d+\.\d+) - - \[(.*?)\] "(.*?)" (\d+) (\d+|-)'
    
    match = re.match(pattern, log_line)
    if match:
        ip, timestamp, request, status, size = match.groups()
        
        # Parse request
        request_parts = request.split()
        method = request_parts[0] if len(request_parts) > 0 else ""
        path = request_parts[1] if len(request_parts) > 1 else ""
        
        return {
            'ip': ip,
            'timestamp': timestamp,
            'method': method,
            'path': path,
            'status': int(status),
            'size': int(size) if size != '-' else 0
        }
    return None

def analyze_logs(log_lines):
    """Analyze log entries."""
    parsed_logs = []
    status_counts = {}
    ip_counts = {}
    
    for line in log_lines:
        entry = parse_log_entry(line.strip())
        if entry:
            parsed_logs.append(entry)
            
            # Count status codes
            status = entry['status']
            status_counts[status] = status_counts.get(status, 0) + 1
            
            # Count IPs
            ip = entry['ip']
            ip_counts[ip] = ip_counts.get(ip, 0) + 1
    
    return parsed_logs, status_counts, ip_counts

# Sample log data
log_data = [
    '192.168.1.1 - - [25/Dec/2023:10:00:00 +0000] "GET /index.html HTTP/1.1" 200 1234',
    '192.168.1.2 - - [25/Dec/2023:10:01:00 +0000] "POST /api/login HTTP/1.1" 401 567',
    '192.168.1.1 - - [25/Dec/2023:10:02:00 +0000] "GET /dashboard HTTP/1.1" 200 2345',
    '192.168.1.3 - - [25/Dec/2023:10:03:00 +0000] "GET /nonexistent HTTP/1.1" 404 -',
]

parsed, status_counts, ip_counts = analyze_logs(log_data)

print("Parsed log entries:")
for entry in parsed:
    print(f"  {entry['ip']} - {entry['method']} {entry['path']} - {entry['status']}")

print(f"\nStatus code distribution:")
for status, count in sorted(status_counts.items()):
    print(f"  {status}: {count}")

print(f"\nTop IPs:")
for ip, count in sorted(ip_counts.items(), key=lambda x: x[1], reverse=True):
    print(f"  {ip}: {count} requests")

RegEx Flags

Regular expression flags modify how the pattern matching works:

Example - Using regex flags:

import re

text = "Hello WORLD\nPython Programming"

# Case insensitive matching
pattern = r'hello'
print("Case sensitive:", re.findall(pattern, text))
print("Case insensitive:", re.findall(pattern, text, re.IGNORECASE))

# Multiline matching
pattern = r'^Python'
print("Without MULTILINE:", re.findall(pattern, text))
print("With MULTILINE:", re.findall(pattern, text, re.MULTILINE))

# Dot matches newline
pattern = r'WORLD.*Python'
print("Without DOTALL:", re.findall(pattern, text))
print("With DOTALL:", re.findall(pattern, text, re.DOTALL))

# Verbose mode for readable patterns
verbose_pattern = r'''
    \b              # Word boundary
    [a-zA-Z0-9._%+-]+   # Username part
    @               # @ symbol
    [a-zA-Z0-9.-]+      # Domain name
    \.              # Dot
    [a-zA-Z]{2,}        # Top-level domain
    \b              # Word boundary
'''

email_text = "Contact: user@example.com or admin@test.org"
emails = re.findall(verbose_pattern, email_text, re.VERBOSE)
print("Emails found:", emails)