Using Reg. Expressions to Extract Useful Information in Python.

Having just completed chapter 7 of Automate the Boring Stuff. I wanted to test my skills. I’m writing down some code that will carry out the following tasks.

  • Use text copied onto the clipboard
  • Go through text and store, or save URLs, email addresses, phone numbers and ZIP codes.
  • Paste the information found into the clipboard, and print on the CLI as well.

The full code is available here on Github

Firstly, to do this, we’ll need two modules. The re module for writing and using regular expressions, as well as the pyperclip module for using our clipboard. Firstly, let’s import these modules. and write down the some rough formats for the stuff we want.

import pyperclip, re

# These are our formats.
#
#   PHONE         | EMAIL        | ZIP      | URL
#   --------------|--------------|----------|------------
#   area code*    | username     | 6 digits | protocol
#   separator*    | @            |          | server
#   1st 3 digits  | domain       |          | file name
#   separator     | .<something> |          |
#   last 3 digits |              |          |
#   extension*    |              |          |
#
#   *optional

Now that we have this information, let’s make a regular expression, or regex for short for each category.

For phone numbers, we want three optional items and three necessary items, as shown above. To make items optional, we can group them using brackets and then place a question mark so that the number is read even if the contents of the bracket are not present. Like this: (<regex>)?. The ? operator specifies a group that either occur once or not at all. Now, let’s write our regex for phone numbers. Do note that I will use re.VERBOSE to spread the regex over multiple lines, for ease of readability.

phoneRegex = re.compile(r'''(
    (\d{3}|\(\d{3}\))?
    (\s|-|\.)?
    (\d{3})
    (\s|-|\.)
    (\d{4})
    (\s*(ext|x|ext.)\s*(\d{2,5}))?
    )''', re.VERBOSE)

Let’s go over this line by line. The first line, (\d{3} | \(\d{3}\))?, is for scanning any area/region/country codes. We’re assuming these codes to be 3 digits in length. The | is added to ensure that we can read codes regardless of whether or not they are enclosed within brackets. Some possible extensions are 761 or (342). If there is a code, there will be a separator between it and the rest of the number, which is why the second line is required. We are allow for either a space, or a hyphen or a dot. For example, these numbers (with codes) would be read: 342-345-5454, 342.345-5454, and (342) 345-5454. From then onwards, the regex is pretty simple. The third line just scans three consecutive digits (as is mandatory in phone numbers). The fourth line is just a repetition of the second, as elements within a number are always separated by a hyphen or a space in most cases.

The last line could be a bit tricky. It is meant to include any extensions that the owner of the phone number has. \s* is added as there can an arbitrary amount of space between the number and extension. Then, ext|x\ext., where . is the wildcard character accounts for the way extensions are denoted. Lastly \d{2,5}accounts for extensions that are 2-5 characters long. Here are some sample phone numbers with extensions: 345-6571 x 453, (423) 341-9872 ext 45, and 221-986-1034 ext*86298.

Now, time to write down our regex for email addresses.

emailRegex = re.compile(r'''(
    [a-zA-Z0-9._%+-]+
    @
    [a-zA-Z.-_]+
    (\.[a-zA-Z]{2,4})
    )''', re.VERBOSE)
    

[a-zA-Z0-9._%+-]+ represents an expression that occurs at least once (may occur more than once but no 0 times). The content of the square brackets represents our own class of characters. a-z A-Z 0-9 represents any alphanumeric character. Email addresses may also contain periods, underscores, % signs, + signs or hyphens (terribly common), which is why they are included. Then we have the @ sign. After that, we need to identify the domain, which could be made up of any alphanumeric character along with hyphens, underscores and periods. All of these are included inside our square brackets. Lastly, \. represents a period, to start .com or .net. This period is followed by a string of alphabets that is 2 to 4 digits long. Examples – .nl, .in, .com, .gov.

The ZIP code is really easy. We just need six consecutive digits.

zipRegex = re.compile(r'\d{6}')

The URL regex is also easy, we just need to find all ‘words’ startin with http, which covers https in its own, and ending with a .<something>.

urlRegex = re.compile(r'http.*?\.[a-zA-Z]{2,4}')

Using the pyperclip module, we will copy our text from the clipboard. Although this may vary from person to person, I want to include all search results in one list instead of different lists for different types of searches. Here is the code.

text = str(pyperclip.paste())

# <...
#       various Regexes
# ...>

res = []

for number in phoneRegex.findall(text):
    phoneNum = '-'.join([number[1], number[3], number[5]])
    if number[8] != '':
        phoneNum += ' x' + number[8]
    res.append(phoneNum)

for email in emailRegex.findall(text):
    res.append(email[0])

for url in urlRegex.findall(text):
    res.append(url)

for ZIP in zipRegex.findall(text):
    res.append(ZIP)

The functionality pertaining to searching for and storing a valid phone number is complex as our output changes with whether or not the phone number contains the area code and/or extension. We then copy all information gathered to the Clipboard, and print ‘no information found’ if the list of results is empty.

if len(res) > 0:
    pyperclip.copy('\n'.join(res))
    print('Copied to clipboard...')
else:
    print('no information found')

The entire code can be found here. Also, there are some helpful links for understanding Regexes below.

There is an issue with WordPress, links randomly open in new tabs or the same tab, so be careful.

Common Applications of Time Series Data (Python x Pandas)

Firstly, time in Python can be represented as either a string or a timedelta or datetime object. We’ll have more on those two below. Whenever reading data from a .csv file using the pd.read_csv function, time is generally read as a string, which can be converted to a datetime object with a very simply function. But more importantly, it is imperative to understand how time is actually represented.

YYYY-MM-DD HH:mm:ss.msmsms (m is minute, s is seconds, ms is milliseconds).

This notation is both simple and effective when dealing with time. Now, while reading, your code sees this format as a string, so it doesn’t whether this represents time or not. The pd.to_datetime is a vectorized function that converts all values in a Series (for dataframes, a specific column) to datetime objects. Intuitively, the first half is the ‘date’, and the second part the ‘time.’

Perhaps the most important element of dealing with time in python is your purpose. Generally, one would use this data to either calculate the amount of time passed between readings, or to calculate the amount of time between a starting time and current time. Let’s look at each separately.

Between Readings

To start off, I’ll write down an example. Say you have a pH sensor that gives you the following values: 7, 8, 9, 8. Now, your computer also logs the time at which these values are received: 2019-06-19 4:03:12, 2019-06-19 9:06:10, 2019-06-19 23:56:55 and 2019-06-20 1:01:04. Now, we have to find the amount of time passed between readings. Thus, you have a pandas dataframe as follows. I will explain why the last row is as such below.

        pH       time
0 7 2019-06-19 4:03:12
1 8 2019-06-19 9:06:10
2 9 2019-06-19 23:56:55
3 8 2019-06-20 1:01:04
Now, what we will do here is shift each time value down by 1 and store it in a series. This is done using the shift() function from pandas that takes in 1 required argument, n, that determines the number of spaces to shift by and whether to shift from left to right or top to bottom (default is top to bottom).
 
shifted_time = df.time.shift(1)

creating this series

0    NaN
1    2019-06-19 4:03:12
2    2019-06-19 9:06:15
3    2019-06-19 23:56:55
name: shifted_time

now, creating a last row of NaN and 0 respectively ensures that our last value is taken into consideration. The difference can now be calculated as follows.

diff = [df.datetime[i] - diff[i] for i in diff[]
diff[0] = 0    # as initial value has nothing to be compared against
df['difference'] = diff

creating the following data frame and giving us the amount of time between reading

    pH           time               difference
0 7 2019-06-19 4:03:12 ~
1 8 2019-06-19 9:06:15 0 days 05:03:03
2 9 2019-06-19 23:56:55 0 days 14:50:30
3 8 2019-06-20 1:01:04 0 days 01:05:09
This difference can be converted to seconds by specific functions, such as dt.total_seconds This just does this:

    pH           time               difference
0 7 2019-06-19 4:03:12 ~
1 8 2019-06-19 9:06:15 18183
2 9 2019-06-19 23:56:55 53430
3 8 2019-06-20 1:01:04 3909
difference is in seconds now. note: subtracting two datatime objects automatically creates a timedelta object, no need to do manually for latest version of pandas

Now, these sort of computations have many real world applications. For example, formula 1. If you’ve ever noticed, the person who comes first (Hamilton most of the time) has his time written in absolute terms, i.e, his time is written as it is. This is done by initializing the time column of the first row to 0. Then, from second place onwards, their time is represented as the difference by which they lost to the first person. I suggest you google F1 and have a look at it yourself.

Differences from starting point

This methodology isn’t so prominent to customers, but is extremely useful to data scientists looking to check whether the time data they receive has breaks or not. This is done by taking the minimum time value (normally the first one), and then subtracting it from each value – giving the time passed from the start to that specific point. These ‘differences’ are then plotted on the vertical axis, with the index of the time data in the entire data set taken on the horizontal axis.

Consequently, you get graphs like these that display the continuity of the data you have.

A log scale has been taken on the y-axis to ensure that small difference values are noticeable.

Now, Python, and especially pandas has many functions and methods to deal with time series data. This link would be the best to explore these functions –> https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html. One functionality that I absolutely adore is the ability to convert data as per timezone – adding both flexibility and adaptability to your code. This is extremely prominent on the web. For example, if you’ve ever noticed, most websites display the timings for a sport event as per your timezone. This is independent from python and pandas, but has the same functionality. Using your location, the time data employed by the server is converted to your timezone

Iterative and Recursive approaches to implementing the Fibonacci Series (Python)

So I was writing a program to implement the following task. For any value of n greater than or equal to 2, F(n) equals the sum of F(n-1) and F(n-2), where F(x) returns the sum of all elements in a fibonacci series of length x.

For example, F(4) would equal F(3) + F(2). Now, F(3) is 0, 1, 1; which sums to 2; and F(2) is 0,1; which sums to 1. Thus, F(4) = F(3) + F(2) = 2 + 1 = 3. Now, this is the original algorithm I came up with. Now, turns out that this algorithm is perfectly applicable for small values of n, say upto 30. However, the ‘runtime cost’ of this algorithm is exponential, which is why I wanted to try out the iterative approach as well (linear runtime cost).

# fibonacci sequence --> 0 1 1 2 3 5 8 13 21

def fibonacci(n):
	# f(n) = f(n-1) + f(n-2) where n > 1
	a, b = 0, 1
	if n == 0:
		return a
	elif n == 1:
		return b
	elif n > 1 and n < 30: 
		return fibonacci(n-1) + fibonacci(n-2)
	else:
		return 'n should be between 2 and 29'

print (fibonacci(6))  # input here

# methodology, every time, f(n) breaks down further and further
# eg. f(4) = f(3) + f(2)
#		   = f(2) + f(1) + f(1)      f(0) not written because it is = 0, no effect on output
#          = f(1) + 2f(1)
#          = 3f(1)
     

The above approach, or rather the one I originally came up with, has a runtime cost of O(2^n). How?

F(n) => F(n – 1) + F(n – 2) + c.
% c is a constant. for the exercise above, it is = 0.

% we can assume both to be approximately equal F(n – 1) ~= F(n – 2)
F(n) => F(n – 2) + F(n – 2) + c
=> 2F(n – 2) + c (F(n – 2) = F(n – 3) + F(n – 4) + c ~= 2F(n – 4) + c
=> 2( 2F(n – 4) +c) +c
=> 4F(n – 4) + 3c
repeat process
=> 4(2F(n – 6) + c) + 3c
=> 8F(n – 6) + 7c

now, set 2^k = T, where T is the coefficient of F(n-…). T = 8 above *eg.

F(n) ==> 2^k * F(n – 2k) + (2^k – 1)c

Now, we will set n – 2k equal to 0. This means that k should equal n/2. Thus, the equation simplifies as follows and the exponential cost is revealed.

k = n / 2
F(n) ==> 2 ^ (n / 2) * F(0) + 2 ^ (n / 2) * c
This simplifies to 2 ^ (n / 2), as other elements can be approximated to having either no cost, or a constant cost that does not change with n. This means that the cost of the algorithm is O(2 ^ (n / 2)), which shows that it increases exponentially with n.

To avoid an exponentially increasing cost, we can adopt an iterative approach, which avoids stacking function calls on top of each other as in a recursive approach.

def fibs(n):    # iterative approach
                                                                                                 
    fibs = [0, 1, 1]                                                                                           
    for f in range(2, n):                                                                                      
        fibs.append(fibs[-1] + fibs[-2])                                                                         
    return fibs[n]

Now, the cost of this algorithm is O(n), as it will only be executed n times, or rather n – 2 times to be exact, as the for loop starts iterating from 2, so as to avoid touching the values that correspond to f(0), f(1), and f(2). To conclude, the iterative algorithm is simpler and much more efficient for large values of n.

https://stackoverflow.com/questions/15047116/an-iterative-algorithm-for-fibonacci-numbers this post on stack overflow dives deeper into the implementation of the iterative algorithm.