Select That Data! - Regular Expressions
I’ll start off this post by saying that I DO NOT claim to be a “RegExpert” (Regular Expression Expert). I claim to be an expert in only a couple of things in life:
- Knowing how to give good doggos belly rubs.
- Not thinking of anything in particular.
But because neither of these things pays the bills (yet), I’ve found that a little knowledge in regular expressions is a good thing to have. Better, sometimes having resources in your back pocket can be useful.
I’ll show a couple of small examples here, but will also give some examples of resources that can help with some of the tricky situations where regex is needed.
So Regular Expressions (hereby mentioned as “RE” or “regex”) is a great way to programmatically identify a pattern of text. Programmatic may be a little heavy though, as I’m not a software developer and usually use regex in short Python scripts, PHP, or just the Bash command line. Even with these limited uses, regex is extremely powerful, and there are lots of great resources, even for the uninitiated.
It’s important to mention that the syntax of regex will be determined by the language you’re using, or the interpreter it’s being utilized with. For example Python may have some small differences than PCRE (Perl-Compatible Regular Expressions). PCRE is fairly common however, but like I said, there are variations so you will need to ensure you are familiar with the “flavor” of regex you are using before you start writing your expression.
So here’s a small example of RE using Bash:
$echo “John Smith: 555-555-5555” | grep -Po “\d{3}-\d{3}-\d{4}”
$555-555-5555
In the above example I echoed the text “John Smith: 555-555-5555” and redirected it (via “|”) to the grep utility (specifying “P” for PCRE regex, and “o” to output matches only).
The quoted portion can be broken down like so: “\d” - a digit “{3}” - literally 3 of these “-” - a hyphen The pattern repeats itself until the last where we’re matching against 4 digits instead of the previous 3.
The cool thing about regex is that there isn’t just one way to skin this cat:
$echo “John Smith: 555-555-5555” | grep -Po “\d.+”
$555-555-5555
This is shorter and produces the same match right? So why would we go to the trouble of the first one? This is considered a “greedy” or “lazy” match. Here’s why:
$echo “John Smith: 555-555-5555 ab” | grep -Po “\d.+”
$555-555-5555 ab
See what I mean? Referring to the regex syntax and knowing your data set here are crucial to ensuring you have the correct pattern for your match.
What’s the worse that can happen? Regex can search, and search, and search by endless greedy matches that produce more and more and more or just go on forever and ever. This eats up resources, which in turn lead to a type of denial of service condition (called ReDoS). More information on ReDoS can be found on the OWASP site here.
Sometimes greedy matches are the only ones that work in the situation you are in though. For that it’s still important to know as much about the data set as possible and at least attempt at some boundaries:
$echo “John Smith: 555-555-5555 ab” | grep -Po “\d.+\s”
$555-555-5555
The above match works because of the space after the phone number. We start matching at the digit, and match everything between that position and the space.
If scripting the above text in Python, the script could look something like the following:
#!/usr/bin/python3
import re
fh = open(r"test.txt", “r”).read()
test = ‘\d.+\s’
p = re.compile(test);
matches = p.finditer(fh)
for match in matches:
print(match[0])
When this script is run the results are simply the phone number as before, you can see the pattern is defined in the “test” variable and is the same as done in the grep example previous to that. The most important part here is the import re which instructs Python to import the regex module.
Resources
Regex101 - A small space to play with values and syntax. It’s one of my goto sites.
If using for scripting, any of the online guides such as Python - Make sure you’re on the latest supported version!
Expresso - Windows-based installable tool for testing regex. Would not recommend for large data sets as it can bog down your memory loading.
Regular expressions are huge time savers, especially if you are looking for that needle in a haystack of other needley-type things. They can look complicated on the surface, and can become so at times, but having a basic understanding can help to demystify them to become a useful resource and tool in your arsenal.