Using Reg. Expressions to Extract Useful Information in Python.

Having just completed chapter 7 of Automate the Boring Stuff. I wanted to test my skills. I’m writing down some code that will carry out the following tasks.

  • Use text copied onto the clipboard
  • Go through text and store, or save URLs, email addresses, phone numbers and ZIP codes.
  • Paste the information found into the clipboard, and print on the CLI as well.

The full code is available here on Github

Firstly, to do this, we’ll need two modules. The re module for writing and using regular expressions, as well as the pyperclip module for using our clipboard. Firstly, let’s import these modules. and write down the some rough formats for the stuff we want.

#! python3
import pyperclip, re
# These are our formats.
#
#   PHONE         | EMAIL           | ZIP      | URL
#   --------------|-----------------|----------|------------
#   area code*    | username        | 6 digits | protocol
#   separator*    | @               |          | server
#   1st 3 digits  | domain          |          | file name
#   separator     | .(com)          |          |
#   last 3 digits |                 |          |
#   extension*    |                 |          |
#
#   *optional

Now that we have this information, let’s make a regular expression, or regex for short for each category.

For phone numbers, we want three optional items and three necessary items, as shown above. To make items optional, we can group them using brackets and then place a question mark so that the number is read even if the contents of the bracket are not present. Like this: (<regex>)?. The ? operator specifies a group that either occur once or not at all. Now, let’s write our regex for phone numbers. Do note that I will use re.VERBOSE to spread the regex over multiple lines, for ease of readability.

phoneRegex = re.compile(r'''(
    (\d{3}|\(\d{3}\))?
    (\s|-|\.)?
    (\d{3})
    (\s|-|\.)
    (\d{4})
    (\s*(ext|x|ext.)\s*(\d{2,5}))?
    )''', re.VERBOSE)

Let’s go over this line by line. The first line, (\d{3} | \(\d{3}\))?, is for scanning any area/region/country codes. We’re assuming these codes to be 3 digits in length. The | is added to ensure that we can read codes regardless of whether or not they are enclosed within brackets. Some possible extensions are 761 or (342). If there is a code, there will be a separator between it and the rest of the number, which is why the second line is required. We are allow for either a space, or a hyphen or a dot. For example, these numbers (with codes) would be read: 342-345-5454, 342.345-5454, and (342) 345-5454. From then onwards, the regex is pretty simple. The third line just scans three consecutive digits (as is mandatory in phone numbers). The fourth line is just a repetition of the second, as elements within a number are always separated by a hyphen or a space in most cases.

The last line could be a bit tricky. It is meant to include any extensions that the owner of the phone number has. \s* is added as there can an arbitrary amount of space between the number and extension. Then, ext|x\ext., where . is the wildcard character accounts for the way extensions are denoted. Lastly \d{2,5}accounts for extensions that are 2-5 characters long. Here are some sample phone numbers with extensions: 345-6571 x 453, (423) 341-9872 ext 45, and 221-986-1034 ext*86298.

Now, time to write down our regex for email addresses.

emailRegex = re.compile(r'''(
    [a-zA-Z0-9._%+-]+
    @
    [a-zA-Z.-_]+
    (\.[a-zA-Z]{2,4})
    )''', re.VERBOSE)
    

[a-zA-Z0-9._%+-]+ represents an expression that occurs at least once (may occur more than once but no 0 times). The content of the square brackets represents our own class of characters. a-z A-Z 0-9 represents any alphanumeric character. Email addresses may also contain periods, underscores, % signs, + signs or hyphens (terribly common), which is why they are included. Then we have the @ sign. After that, we need to identify the domain, which could be made up of any alphanumeric character along with hyphens, underscores and periods. All of these are included inside our square brackets. Lastly, \. represents a period, to start .com or .net. This period is followed by a string of alphabets that is 2 to 4 digits long. Examples – .nl, .in, .com, .gov.

The ZIP code is really easy. We just need six consecutive digits.

zipRegex = re.compile(r'\d{6}')

The URL regex is also easy, we just need to find all ‘words’ startin with http, which covers https in its own, and ending with a .<something>.

urlRegex = re.compile(r'http.*?\.[a-zA-Z]{2,4}')

Using the pyperclip module, we will copy our text from the clipboard. Although this may vary from person to person, I want to include all search results in one list instead of different lists for different types of searches. Here is the code.

text = str(pyperclip.paste())

# <...
#       various Regexes
# ...>

res = []

for number in phoneRegex.findall(text):
    phoneNum = '-'.join([number[1], number[3], number[5]])
    if number[8] != '':
        phoneNum += ' x' + number[8]
    res.append(phoneNum)

for email in emailRegex.findall(text):
    res.append(email[0])

for url in urlRegex.findall(text):
    res.append(url)

for ZIP in zipRegex.findall(text):
    res.append(ZIP)

The functionality pertaining to searching for and storing a valid phone number is complex as our output changes with whether or not the phone number contains the area code and/or extension. We then copy all information gathered to the Clipboard, and print ‘no information found’ if the list of results is empty.

if len(res) > 0:
    pyperclip.copy('\n'.join(res))
    print('Copied to clipboard...')
else:
    print('no information found')

The entire code can be found here. Also, there are some helpful links for understanding Regexes below.

There is an issue with WordPress, links randomly open in new tabs or the same tab, so be careful.