Extracting Passwords Using Regular Expressions in Python.

This is my second post on regular expressions. Here, I’m attempting to solve one of the practice tasks given at the end of Chapter 7, Automate The Boring Stuff. Instead of merely identifying whether or not a given string is a strong password, I will write a program that takes a list of strings, and returns a list containing those strings which will make valid ‘strong passwords.’

  • Extract potential strong passwords from text in Clipboard
  • Strong password must have both uppercase and lowercase characters, as well as digits. It must also be at least eight characters long, and must not have any spaces.

So for this piece of code, we will be using ‘lookaheads,’ a new concept for me pertaining to the usage of regexes. Firstly, here is some basic information about ‘lookaheads.’ A ‘lookahead’ is like an if condition. It follows the following format, (note that not many websites refer to it in this way, but I find this intuitive) (?=Regex1)(Regex2). Here, Regex2 is considered if and only if (iff) Regex1 is true. Now, this means that like in an if ... else conditional statement, we can write code that could mean: (?=if digit is present)(read string). Also, do note that Regex1 being true doesn’t affect how Regex2 is read. Another thing we need to do is ensure that the we scan only one password, and not the entire text.

We can do this by either specifying all possible characters except spaces inside square brackets as follows. [a-zA-Z0-9_+$%...]*. Or by simply specifying ‘all non-space characters,’ which is much simpler.[^\s]*. One could also use \w*, but this would not scan characters used commonly in passwords, like %, #, @, ! et cetera.

Then, we need three lookbacks. One to check for a lowercase character. One to check for an uppercase character. One to check for a digit. We also need to use {} brackets to ensure that the password is a least 8 characters long. Here is the code:

strongPassword = re.compile(r'''(
    (?=.*\d) 
    (?=.*[a-z])
    (?=.*[A-Z])
    [^\s]{8,}
    )''', re.VERBOSE) 

res = []                            
for potentialPassword in strongPassword.findall(text):
    res.append(potentialPassword)                        
    print(potentialPassword)
    

The easiest one is the last statement, which says accept all non-space containing strings that follow the lookaheads above and are at least 8 characters long. Each lookahead follows the same format.

(?=.*\d), which is equivalent to (?=.*[0-9]) says match any character that is a digit. Similarly, the other two regex components say match any character that is a lowercase alphabet and uppercase alphabet respectively.

We then just output our results.

You can find the full code here at Github.

Personally, I found this really difficult mainly because I didn’t know how to chain multiple lookaheads together. Simply putting them one after the other without the .* required the characters to be ordered (lowercase first etc.). Nesting them like a bad if..else statement provided similar results. I ultimately found the correct solution here on stackoverflow, through an implementation for jquery, although not without surfing for over 1.5 hours.

Moreover, the implementation in the official regex website was absolutely nightmarish. A long, single line that was overflowing onto the right side of their webpage. Turns out that their implementation doesn’t even work, for reasons I couldn’t comprehend.

Using Reg. Expressions to Extract Useful Information in Python.

Having just completed chapter 7 of Automate the Boring Stuff. I wanted to test my skills. I’m writing down some code that will carry out the following tasks.

  • Use text copied onto the clipboard
  • Go through text and store, or save URLs, email addresses, phone numbers and ZIP codes.
  • Paste the information found into the clipboard, and print on the CLI as well.

The full code is available here on Github

Firstly, to do this, we’ll need two modules. The re module for writing and using regular expressions, as well as the pyperclip module for using our clipboard. Firstly, let’s import these modules. and write down the some rough formats for the stuff we want.

#! python3
import pyperclip, re
# These are our formats.
#
#   PHONE         | EMAIL           | ZIP      | URL
#   --------------|-----------------|----------|------------
#   area code*    | username        | 6 digits | protocol
#   separator*    | @               |          | server
#   1st 3 digits  | domain          |          | file name
#   separator     | .(com)          |          |
#   last 3 digits |                 |          |
#   extension*    |                 |          |
#
#   *optional

Now that we have this information, let’s make a regular expression, or regex for short for each category.

For phone numbers, we want three optional items and three necessary items, as shown above. To make items optional, we can group them using brackets and then place a question mark so that the number is read even if the contents of the bracket are not present. Like this: (<regex>)?. The ? operator specifies a group that either occur once or not at all. Now, let’s write our regex for phone numbers. Do note that I will use re.VERBOSE to spread the regex over multiple lines, for ease of readability.

phoneRegex = re.compile(r'''(
    (\d{3}|\(\d{3}\))?
    (\s|-|\.)?
    (\d{3})
    (\s|-|\.)
    (\d{4})
    (\s*(ext|x|ext.)\s*(\d{2,5}))?
    )''', re.VERBOSE)

Let’s go over this line by line. The first line, (\d{3} | \(\d{3}\))?, is for scanning any area/region/country codes. We’re assuming these codes to be 3 digits in length. The | is added to ensure that we can read codes regardless of whether or not they are enclosed within brackets. Some possible extensions are 761 or (342). If there is a code, there will be a separator between it and the rest of the number, which is why the second line is required. We are allow for either a space, or a hyphen or a dot. For example, these numbers (with codes) would be read: 342-345-5454, 342.345-5454, and (342) 345-5454. From then onwards, the regex is pretty simple. The third line just scans three consecutive digits (as is mandatory in phone numbers). The fourth line is just a repetition of the second, as elements within a number are always separated by a hyphen or a space in most cases.

The last line could be a bit tricky. It is meant to include any extensions that the owner of the phone number has. \s* is added as there can an arbitrary amount of space between the number and extension. Then, ext|x\ext., where . is the wildcard character accounts for the way extensions are denoted. Lastly \d{2,5}accounts for extensions that are 2-5 characters long. Here are some sample phone numbers with extensions: 345-6571 x 453, (423) 341-9872 ext 45, and 221-986-1034 ext*86298.

Now, time to write down our regex for email addresses.

emailRegex = re.compile(r'''(
    [a-zA-Z0-9._%+-]+
    @
    [a-zA-Z.-_]+
    (\.[a-zA-Z]{2,4})
    )''', re.VERBOSE)
    

[a-zA-Z0-9._%+-]+ represents an expression that occurs at least once (may occur more than once but no 0 times). The content of the square brackets represents our own class of characters. a-z A-Z 0-9 represents any alphanumeric character. Email addresses may also contain periods, underscores, % signs, + signs or hyphens (terribly common), which is why they are included. Then we have the @ sign. After that, we need to identify the domain, which could be made up of any alphanumeric character along with hyphens, underscores and periods. All of these are included inside our square brackets. Lastly, \. represents a period, to start .com or .net. This period is followed by a string of alphabets that is 2 to 4 digits long. Examples – .nl, .in, .com, .gov.

The ZIP code is really easy. We just need six consecutive digits.

zipRegex = re.compile(r'\d{6}')

The URL regex is also easy, we just need to find all ‘words’ startin with http, which covers https in its own, and ending with a .<something>.

urlRegex = re.compile(r'http.*?\.[a-zA-Z]{2,4}')

Using the pyperclip module, we will copy our text from the clipboard. Although this may vary from person to person, I want to include all search results in one list instead of different lists for different types of searches. Here is the code.

text = str(pyperclip.paste())

# <...
#       various Regexes
# ...>

res = []

for number in phoneRegex.findall(text):
    phoneNum = '-'.join([number[1], number[3], number[5]])
    if number[8] != '':
        phoneNum += ' x' + number[8]
    res.append(phoneNum)

for email in emailRegex.findall(text):
    res.append(email[0])

for url in urlRegex.findall(text):
    res.append(url)

for ZIP in zipRegex.findall(text):
    res.append(ZIP)

The functionality pertaining to searching for and storing a valid phone number is complex as our output changes with whether or not the phone number contains the area code and/or extension. We then copy all information gathered to the Clipboard, and print ‘no information found’ if the list of results is empty.

if len(res) > 0:
    pyperclip.copy('\n'.join(res))
    print('Copied to clipboard...')
else:
    print('no information found')

The entire code can be found here. Also, there are some helpful links for understanding Regexes below.

There is an issue with WordPress, links randomly open in new tabs or the same tab, so be careful.