Python RegEx
Learn how to use regular expressions in Python for pattern matching and text processing.
RegEx in Python
A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern.
RegEx can be used to check if a string contains the specified search pattern.
RegEx Module
Python has a built-in package called re, which can be used to work with Regular Expressions.
Example - Import the re module:
import re RegEx in Python
When you have imported the re module, you can start using regular expressions:
Example - Search the string to see if it starts with "The" and ends with "Spain":
import re
txt = "The rain in Spain"
x = re.search("^The.*Spain$", txt)
if x:
print("YES! We have a match!")
else:
print("No match") RegEx Functions
The re module offers a set of functions that allows us to search a string for a match:
Metacharacters
Metacharacters are characters with a special meaning:
Special Sequences
A special sequence is a \ followed by one of the characters in the list below, and has a special meaning:
Sets
A set is a set of characters inside a pair of square brackets [] with a special meaning:
The findall() Function
The findall() function returns a list containing all matches.
Example - Print a list of all matches:
import re
txt = "The rain in Spain"
x = re.findall("ai", txt)
print(x) The list contains the matches in the order they are found.
If no matches are found, an empty list is returned:
Example - Return an empty list if no match was found:
import re
txt = "The rain in Spain"
x = re.findall("Portugal", txt)
print(x) The search() Function
The search() function searches the string for a match, and returns a Match object if there is a match.
If there is more than one match, only the first occurrence of the match will be returned:
Example - Search for the first white-space character in the string:
import re
txt = "The rain in Spain"
x = re.search("\s", txt)
print("The first white-space character is located in position:", x.start()) If no matches are found, the value None is returned:
Example - Make a search that returns no match:
import re
txt = "The rain in Spain"
x = re.search("Portugal", txt)
print(x) The split() Function
The split() function returns a list where the string has been split at each match:
Example - Split at each white-space character:
import re
txt = "The rain in Spain"
x = re.split("\s", txt)
print(x) You can control the number of occurrences by specifying the maxsplit parameter:
Example - Split the string only at the first occurrence:
import re
txt = "The rain in Spain"
x = re.split("\s", txt, 1)
print(x) The sub() Function
The sub() function replaces the matches with the text of your choice:
Example - Replace every white-space character with the number 9:
import re
txt = "The rain in Spain"
x = re.sub("\s", "9", txt)
print(x) You can control the number of replacements by specifying the count parameter:
Example - Replace the first 2 occurrences:
import re
txt = "The rain in Spain"
x = re.sub("\s", "9", txt, 2)
print(x) Match Object
A Match Object is an object containing information about the search and the result.
Note: If there is no match, the value None will be returned, instead of the Match Object.
Example - Do a search that will return a Match Object:
import re
txt = "The rain in Spain"
x = re.search("ai", txt)
print(x) #this will print an object The Match object has properties and methods used to retrieve information about the search, and the result:
.span()returns a tuple containing the start-, and end positions of the match..stringreturns the string passed into the function.group()returns the part of the string where there was a match
Example - Print the position (start- and end-position) of the first match occurrence:
import re
txt = "The rain in Spain"
x = re.search(r"\bS\w+", txt)
print(x.span()) Example - Print the string passed into the function:
import re
txt = "The rain in Spain"
x = re.search(r"\bS\w+", txt)
print(x.string) Example - Print the part of the string where there was a match:
import re
txt = "The rain in Spain"
x = re.search(r"\bS\w+", txt)
print(x.group()) Practical RegEx Examples
Email Validation
import re
def validate_email(email):
"""Validate email address using regex."""
pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
return re.match(pattern, email) is not None
# Test emails
emails = [
"user@example.com",
"test.email+tag@domain.co.uk",
"invalid.email",
"user@",
"@domain.com",
"valid_email@test-domain.org"
]
for email in emails:
if validate_email(email):
print(f"✓ {email} is valid")
else:
print(f"✗ {email} is invalid") Phone Number Extraction
import re
def extract_phone_numbers(text):
"""Extract phone numbers from text."""
# Pattern for various phone number formats
patterns = [
r'\b\d{3}-\d{3}-\d{4}\b', # 123-456-7890
r'\b\(\d{3}\)\s*\d{3}-\d{4}\b', # (123) 456-7890
r'\b\d{3}\.\d{3}\.\d{4}\b', # 123.456.7890
r'\b\d{10}\b', # 1234567890
r'\+1\s*\d{3}\s*\d{3}\s*\d{4}\b' # +1 123 456 7890
]
phone_numbers = []
for pattern in patterns:
matches = re.findall(pattern, text)
phone_numbers.extend(matches)
return phone_numbers
text = """
Contact us at 123-456-7890 or (555) 123-4567.
You can also reach us at 555.987.6543 or 9876543210.
International: +1 800 555 0199
"""
phones = extract_phone_numbers(text)
print("Found phone numbers:")
for phone in phones:
print(f"- {phone}") Password Strength Checker
import re
def check_password_strength(password):
"""Check password strength using regex."""
criteria = {
'length': len(password) >= 8,
'lowercase': bool(re.search(r'[a-z]', password)),
'uppercase': bool(re.search(r'[A-Z]', password)),
'digit': bool(re.search(r'\d', password)),
'special': bool(re.search(r'[!@#$%^&*(),.?":{}|<>]', password))
}
score = sum(criteria.values())
if score == 5:
strength = "Very Strong"
elif score == 4:
strength = "Strong"
elif score == 3:
strength = "Medium"
elif score == 2:
strength = "Weak"
else:
strength = "Very Weak"
return strength, criteria
# Test passwords
passwords = [
"password",
"Password123",
"P@ssw0rd!",
"MySecureP@ssw0rd2023",
"12345678"
]
for pwd in passwords:
strength, criteria = check_password_strength(pwd)
print(f"\nPassword: {pwd}")
print(f"Strength: {strength}")
print("Criteria met:")
for criterion, met in criteria.items():
status = "✓" if met else "✗"
print(f" {status} {criterion}") URL Extraction and Validation
import re
def extract_urls(text):
"""Extract URLs from text."""
url_pattern = r'https?://(?:[-\w.])+(?:[:\d]+)?(?:/(?:[\w/_.])*(?:\?(?:[\w&=%.])*)?(?:#(?:\w)*)?)?'
return re.findall(url_pattern, text)
def validate_url(url):
"""Validate URL format."""
pattern = r'^https?://(?:[-\w.])+(?:[:\d]+)?(?:/(?:[\w/_.])*(?:\?(?:[\w&=%.])*)?(?:#(?:\w)*)?)?$'
return re.match(pattern, url) is not None
def extract_domain(url):
"""Extract domain from URL."""
pattern = r'https?://(?:www\.)?([^/]+)'
match = re.search(pattern, url)
return match.group(1) if match else None
text = """
Visit our website at https://www.example.com or check out
http://blog.example.com/posts/2023/python-tutorial?ref=homepage#section1
Also see: https://api.service.com:8080/v1/data
Invalid: htp://broken-url.com
"""
print("Extracted URLs:")
urls = extract_urls(text)
for url in urls:
print(f"- {url}")
print(f" Valid: {validate_url(url)}")
print(f" Domain: {extract_domain(url)}")
print() Log File Parser
import re
from datetime import datetime
def parse_log_entry(log_line):
"""Parse a log file entry."""
# Common log format: IP - - [timestamp] "method path protocol" status size
pattern = r'(\d+\.\d+\.\d+\.\d+) - - \[(.*?)\] "(.*?)" (\d+) (\d+|-)'
match = re.match(pattern, log_line)
if match:
ip, timestamp, request, status, size = match.groups()
# Parse request
request_parts = request.split()
method = request_parts[0] if len(request_parts) > 0 else ""
path = request_parts[1] if len(request_parts) > 1 else ""
return {
'ip': ip,
'timestamp': timestamp,
'method': method,
'path': path,
'status': int(status),
'size': int(size) if size != '-' else 0
}
return None
def analyze_logs(log_lines):
"""Analyze log entries."""
parsed_logs = []
status_counts = {}
ip_counts = {}
for line in log_lines:
entry = parse_log_entry(line.strip())
if entry:
parsed_logs.append(entry)
# Count status codes
status = entry['status']
status_counts[status] = status_counts.get(status, 0) + 1
# Count IPs
ip = entry['ip']
ip_counts[ip] = ip_counts.get(ip, 0) + 1
return parsed_logs, status_counts, ip_counts
# Sample log data
log_data = [
'192.168.1.1 - - [25/Dec/2023:10:00:00 +0000] "GET /index.html HTTP/1.1" 200 1234',
'192.168.1.2 - - [25/Dec/2023:10:01:00 +0000] "POST /api/login HTTP/1.1" 401 567',
'192.168.1.1 - - [25/Dec/2023:10:02:00 +0000] "GET /dashboard HTTP/1.1" 200 2345',
'192.168.1.3 - - [25/Dec/2023:10:03:00 +0000] "GET /nonexistent HTTP/1.1" 404 -',
]
parsed, status_counts, ip_counts = analyze_logs(log_data)
print("Parsed log entries:")
for entry in parsed:
print(f" {entry['ip']} - {entry['method']} {entry['path']} - {entry['status']}")
print(f"\nStatus code distribution:")
for status, count in sorted(status_counts.items()):
print(f" {status}: {count}")
print(f"\nTop IPs:")
for ip, count in sorted(ip_counts.items(), key=lambda x: x[1], reverse=True):
print(f" {ip}: {count} requests") RegEx Flags
Regular expression flags modify how the pattern matching works:
Example - Using regex flags:
import re
text = "Hello WORLD\nPython Programming"
# Case insensitive matching
pattern = r'hello'
print("Case sensitive:", re.findall(pattern, text))
print("Case insensitive:", re.findall(pattern, text, re.IGNORECASE))
# Multiline matching
pattern = r'^Python'
print("Without MULTILINE:", re.findall(pattern, text))
print("With MULTILINE:", re.findall(pattern, text, re.MULTILINE))
# Dot matches newline
pattern = r'WORLD.*Python'
print("Without DOTALL:", re.findall(pattern, text))
print("With DOTALL:", re.findall(pattern, text, re.DOTALL))
# Verbose mode for readable patterns
verbose_pattern = r'''
\b # Word boundary
[a-zA-Z0-9._%+-]+ # Username part
@ # @ symbol
[a-zA-Z0-9.-]+ # Domain name
\. # Dot
[a-zA-Z]{2,} # Top-level domain
\b # Word boundary
'''
email_text = "Contact: user@example.com or admin@test.org"
emails = re.findall(verbose_pattern, email_text, re.VERBOSE)
print("Emails found:", emails)