Regular expressions have a reputation for looking like an unintelligible mishmash of random symbols. While it’s true that most regular expressions tend toward this form, they don’t have to: it’s possible to write an elegant regular expression.
Background
If you’re already familiar with the basics of regular expressions, you can skip this section.
In their most basic form, regular expressions, or regexes, provide a simple way to explain to a computer how to process a piece of text. They don’t do anything to fancy on their own, but when combined with a good text editor, they can be used to extract or transform text in bulk, easily saving hours of manual editing.
If you’ve heard of wildcards, regexes are similar, they’re glorified wildcards. They can be used for find-and-replace operations in text editors, stream-based processing and filtering of large files, and within programming languages for all sorts of text manipulation and validation.
Regexes define patterns. For example, take the following regex:
cat
This is a simple regex with nothing fancy in it. It will match the literal text cat
. Nothing too complicated so far.
Let’s say I want to replace all occurrences of cat
with dog
in a document. Using my text editor of choice, I could perform a find-replace operation with a regex of cat
and a replacement value of dog
. That’s not particularly interesting, I could’ve done that without regexes.
Now, let’s say I have a massive text document containing phone numbers for a list of professionals. Each line follows a standard format, here’s an example:
Dr. Jane Doe: +15550001337
The phone numbers are all in E.123 format, a lovely, unambiguous format for writing phone numbers that nobody ever uses because it’s a pain to read. You’ve been tasked with making the document a bit more friendly. Management has requested that you format all 7,326 phone numbers in document as (###) ###-####
.
You could do it manually, but that sounds horribly tedious. So you toggle regex mode and instruct your editor to replace all occurrences of \+1(\d{3})(\d{3})(\d{4})$
with (\1) \2-\3
. Problem solved, off to lunch.
In a regex, most characters simply represent themselves. We saw that with cat
: c
meant c
, a
meant a
, t
meant t
. That doesn’t seem to be the case with our new regex, though. There are a lot of odd symbols, and aside from the +1
bit, none of them are actually part of the text we’re replacing.
Before we go into more detail, it’s important to note that there are many different regex “flavors,” much like programming languages, regex languages can vary widely. This blog uses Perl Compatible Regular Expressions (PCRE). The basic examples here will work just fine in most other regex engines, but you may encounter some quirks, and the more advanced examples may require modification.
Let’s start with a simple wildcard character: .
. In regexes, periods don’t represent periods, instead, they act as placeholders for any single character except a line break. A regex of .
will match every non-line-break character in document in the same way a
would match every character in a document containing only the text aaaaaaaaaaaaaaaa
. If I were to replace .
with m
in a document containing the text cat
, I would end up with a document containing mmm
.
What if we want to match an arbitrary number of characters? That’s where *
and +
come in. *
repeats the preceding character zero or more times, while +
repeats the preceding character one or more times. Both are greedy, they will favor matching more characters whenever possible. For example, mo*
would match m
, mo
, and moo
. mo+
would match mo
and moo
, but not m
. mo*
would match mm
twice, once for each m
, but mo+
wouldn’t match at all.
We can combine these special symbols. .*waffle.*
would match any line containing waffle
. Unlike if we were to use a regex of waffle
, this would match the whole line, from start to end. It wouldn’t extend beyond the line, though, since .
doesn’t match line breaks by default.
There are many other special symbols, but there are already a number of great, non-technical resources for learning basic regex syntax, such as Regular-Expressions.info. If you’d rather a more programmer-oriented discussion, PHP has thorough yet concise documentation for PCRE syntax.
Taking regexes to the next level
My previous blog post presented a solution to a log analysis challenge that relied heavily on regexes in Python. The relevant code was reasonably elegant:
entry_regex = re.compile(r"""
^ (?P<ip> [0-9a-f.:]+ ) # Support both IPv4 and IPv6.
[ ] (?P<identd> - ) # This is a historical field that Nginx doesn't support; it will always be a single hyphen.
[ ] (?P<user> .*? ) # The username could contain anything, and, unfortunately, it's not quoted. We can only have one "anything goes" field like this.
[ ] \[ (?P<timestamp> [0-9a-z :/_+-]{26} ) \] # We don't need to be too strict about the date; weparse it to a proper date later.
[ ] " (?P<request> [^"]* ) " # Not all web servers escape log fields, but Nginx does. Quotes and control codes will never appear.
[ ] (?P<status> \d{3} ) # Should always be exactly 3 digits.
[ ] (?P<bytes_sent> \d+ | - ) # This field can be a hyphen instead of a number when no response is sent.
[ ] " (?P<referrer> [^"]* ) " # The header is "Referer", but that's a misspelling, and newer standards spell it correctly.
[ ] " (?P<user_agent> [^"]* ) " # This can be just about anything; fortunately, it's quoted and escaped.
$
""", re.VERBOSE | re.IGNORECASE)
request_regex = re.compile(r"""
^ (?P<method> [\w.:-]+ ) # GET
[ ] (?P<uri> .+ ) # /index.html
[ ] (?P<protocol> [a-z]+ (?:/[0-9.]+)? ) # HTTP, HTTP/1.1; some clients attempt without the version specifier
$
""", re.VERBOSE | re.IGNORECASE)
This probably isn’t what you would expect regular expressions to look like. They make judicious use of white space, are formatted nicely, and contain comments. With Visual Studio Code’s syntax highlighting, they look quite nice. The secret here is the re.VERBOSE
flag. Many regex engines offer something similar; PCRE calls it extended mode. With such a flag enabled, white space is ignored unless escaped or enclosed in a character class, and comments are permitted with #
. If that flag is removed, the regexes are forced to be far more dense:
entry_regex = re.compile(r"""^(?P<ip>[0-9a-f.:]+) (?P<identd>-) (?P<user>.*?) \[(?P<timestamp>[0-9a-z :/_+-]{26})\] "(?P<request>[^"]*)" (?P<status>\d{3}) (?P<bytes_sent>\d+|-) "(?P<referrer>[^"]*)" "(?P<user_agent>[^"]*)"$""", re.IGNORECASE)
request_regex = re.compile(r"""^(?P<method>[\w.:-]+) (?P<uri>.+) (?P<protocol>[a-z]+(?:/[0-9.]+)?)$""", re.IGNORECASE)
I opted to use named captures in my initial code, but that’s uncommon. When we remove the named captures:
entry_regex = re.compile(r"""^([0-9a-f.:]+) (-) (.*?) \[([0-9a-z :/_+-]{26})\] "([^"]*)" (\d{3}) (\d+|-) "([^"]*)" "([^"]*)"$""", re.IGNORECASE)
request_regex = re.compile(r"""^([\w.:-]+) (.+) ([a-z]+(?:/[0-9.]+)?)$""", re.IGNORECASE)
That’s quite a mess.
We might be able to do even better than the original Python example by switching to PCRE with the x
flag, though. While we’re at it, let’s try to improve validation.
Notice that there are a large number of captures within a (?(DEFINE)) construct. These aren’t actually part of the pattern, but they can be referenced later, allowing us to reuse common sub-patterns.
(?(DEFINE)
(?<ipv4_octet> 2[0-4]\d | 25[0-5] | [01]?\d{1,2} )
(?<ipv4> (?:(?&ipv4_octet)\.){3}(?&ipv4_octet) )
(?<ipv6_hextet> [a-f0-9]{1,4} )
(?<ipv6_pure> (?:::)? (?:(?&ipv6_hextet)::?){0,7}(?&ipv6_hextet) (?:::)? )
(?<ipv6_ipv4> ::ffff:(?&ipv4) )
(?<ipv6> (?&ipv6_pure) | (?&ipv6_ipv4) )
(?<ip_t> (?&ipv4) | (?&ipv6) )
(?<year> \d{4} )
(?<month> Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec )
(?<day> 0[1-9] | 2\d | 3[01] )
(?<hour> [01]\d | 2[0-3] )
(?<min> [0-5]\d )
(?<sec> [0-5]\d | 60 ) # Don't forget leap seconds! :60 does happen.
(?<tz> [+-]\d{4} )
(?<timestamp_t> (?&day)/(?&month)/(?&year):(?&hour):(?&min):(?&sec)[ ](?&tz) )
(?<empty> - )
(?<number> \d+ )
(?<optional_number> (?&number) | - )
(?<quoted_str> (?<=") [^"]* (?=") )
(?<status_t> \d{3} )
)
^ (?<ip> (?&ip_t) )
[ ] (?<identd> (?&empty) )
[ ] (?<user> .*? )
[ ] \[ (?<timestamp> (?×tamp_t) ) \]
[ ] " (?<request>
# Ideally, we want to parse into the request line:
(?<method> [^" \\]+ )
[ ] (?<uri> [^"]+ )
[ ] (?<protocol> [^" ]+ )
# But if that fails, we can just grab the whole thing as one big string:
| (?"ed_str)
) "
[ ] (?<status> (?&status_t) )
[ ] (?<bytes_sent> (?&optional_number) )
[ ] " (?<referrer> (?"ed_str) ) "
[ ] " (?<user_agent> (?"ed_str) ) "
$
This is verbose and would allow us to extend the syntax later. However, it presents a few problems:
- Regexes are meant to work with structured text, not numbers, dates, or IP addresses. For example, it’s difficult to ensure that a number is within a specific range.
- It’s possible to over-engineer a regex. There are a lot of unnecessary captures here that are only used once, and they’re not likely to ever be changed.
- As regexes get more complex, quirks can arise as a result of recursion, greediness, and precedence. For example, try defining
ipv4_octet
as[01]?\d{1,2} | 2[0-4]\d | 25[0-5]
; it seems like it should work, but it doesn’t.
Access logs aren’t a great fit for regexes of this verbosity. Whenever possible, IP addresses and dates shouldn’t be validated by regexes, there’s almost always a better choice. The original Python script does a better job of balancing regexes with more specialized parsers.
There are legitimate uses for this syntax, though. When the regex is likely to change in the future, breaking the regex into smaller pieces and annotating it can transform it from a maintainability nightmare into a pleasant snippet. It’s also easier to translate to this format from notations such as BNF.
In 2018, it became public knowledge that certain combinations of characters would cause macOS and iOS devices to crash. Even after patches were released, many devices remained out-of-date and vulnerable to attacks that could masquerade as innocent text on social media. Websites and services that permitted user-generated content needed to be able to detect problematic text, and they needed to be able to adapt their methods are more information about the vulnerability became available. This was a good use case for extended PCRE with (?(DEFINE))
:
(?#
# Adapted for PCRE by Paul Buonopane
# Based heavily on the following works:
# - https://manishearth.github.io/blog/2018/02/15/picking-apart-the-crashing-ios-string/
# - https://github.com/hackbunny/viramarama
#
# This PCRE regex will match any string that contains Indic character
# combinations known to crash many iOS applications as of 11.2.5 and certain
# macOS applications as of 10.13.3.
#
# IMPORTANT: This will only work if you use the `x` and `u` modifiers.
# Example: "/regex-goes-here/xu"
#
# COMPATIBILITY: The `u` modifier isn't technically Perl-compatible. For this
# to work without the `u` modifier, the regex would need to be
# adapted to use | instead of [], as all of these characters
# are multibyte in all popular web encodings, e.g. UTF-8.
#
# Tracked By: rdar://37458268
# https://openradar.appspot.com/37458268
# Date Originated: 2018-02-12
# CVE ID: Unknown as of writing
#)
(?(DEFINE)
(?<zwnj> (?# You can't see me, but I'm here!))
(?<devanagari_virama> ्)
(?<devanagari_consonants> [कखगघङहचछजझञयशटठडढणरषतथदधनलसपफबभमव])
(?<devanagari_sj_consonants> [र])
(?<devanagari_pj_pairs> र(?&devanagari_virama)र)
(?<devanagari_vowels> [ऺऻािीुूृॄॅॆेैॉॊोौॎॏॕॖॗॢॣ])
(?<bengali_virama> ্)
(?<bengali_consonants> [কখগঘঙচছজঝঞটঠডঢণতথদধনপফবভমযরৰলৱশষসহ])
(?<bengali_sj_consonants> [যর])
(?<bengali_pj_pairs> [রৰ](?&bengali_virama)র)
(?<bengali_vowels> [ািীুূৃৄেৈৢৣ])
(?<telugu_virama> ్)
(?<telugu_consonants> [కఖగఘఙచఛజఝఞటఠడఢణతథదధనపఫబభమయరలవళశషసహఱ])
(?<telugu_sj_consonants> (?&telugu_consonants))
(?<telugu_vowels> [ాిీుూృౄెేొోౌౢౣ])
)
(?!(?&devanagari_pj_pairs))
(?&devanagari_consonants)
(?&devanagari_virama)
(?&devanagari_sj_consonants)
(?&zwnj)
(?&devanagari_vowels)
|
(?!(?&bengali_pj_pairs))
(?&bengali_consonants)
(?&bengali_virama)
(?&bengali_sj_consonants)
(?&zwnj)
(?&bengali_vowels)
|
(?&telugu_consonants)
(?&telugu_virama)
(?&telugu_sj_consonants)
(?&zwnj)
(?&telugu_vowels)
(?#
# The license for the work on which this regex is based is provided below, with an
# additional note regarding adaptation.
#
# Copyright 2018 hackbunny <hackbunny@gmail.com>
# Adapted for PCRE by Paul Buonopane
#
# Permission is hereby granted, free of charge, to any person obtaining a copy of this
# software and associated documentation files [the "Software"], to deal in the Software
# without restriction, including without limitation the rights to use, copy, modify,
# merge, publish, distribute, sublicense, and/or sell copies of the Software, and to
# permit persons to whom the Software is furnished to do so, subject to the following
# conditions:
#
# The above copyright notice and this permission notice shall be included in all copies or
# substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,
# INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR
# PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
# LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT
# OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
# OTHER DEALINGS IN THE SOFTWARE.
#)
This was a maintainable regex. It was easy to see at a glance which combinations the regex handled as well as the exceptions that it excluded. It handles three scenarios, each for a different script (note that “script” in this context refers to the set of characters used by a writing system, not the category of programming languages):
- Devanagari
- Bengali
- Telugu
For the first two, it excludes portions of text that would otherwise match if they’re known not to cause crashes. The relevant groups of characters are kept at the top and helpfully labelled; no knowledge of the affected scripts is necessary to maintain the regex.