Implementing Univariate Linear Regression in Python

The objective of this post is to explain the steps I took to implementing univariate linear regression in Python. Do note that I’m not using libraries with inbuilt ML models like sklearn and sci-py here.

Here is our gradient descent function, utilising mean squared error as the cost function.

def gradientDescent(theta, alpha, iterations):
    m = ex1data.shape[0]  # finding the number of trial examples
    n = ex1data.shape[1]  # finding the number of features + 1
    for iteration in range(iterations):
        total0 = 0
        total1 = 0
        for row in range(m): # iterating over each training example
            hypothesis = 0
            for val in range(n-1):
                hypothesis += ex1data.values[row][val] * theta[val][0] 
            load = hypothesis - ex1data.values[row][n-1]
            total0 += load*ex1data.values[row][0]
            total1 += load*ex1data.values[row][1]
        temp0 = theta[0][0] - ((alpha*total0)/m)
        temp1 = theta[1][0] - ((alpha*total1)/m)
        theta = [[round(temp0, 4)], [round(temp1, 4)]]
    return theta

We carry out gradient descent 1500 times here, by setting iterations equal to 1500. Our starting values of theta are 0. Now, for those of you who don’t know how gradient descent works, here’s a short explanation that attempts to cover the crux of it. Intuitively, we subtract each of our output values as given by the hypothesis function by the target value we’re trying to predict, then square the difference. The gradient descent update rule subtracts the partial derivative of this (beyond us mortals for now) from the existing values of theta – updating them. This entire process repeats 1500 times until gradient descent converges to an optimal value of theta, or the minimum point of the cost function.

Now, we plot our data to see what it looks like.

m = ex1data.shape[0]
print ('No. of training examples --> {}'.format(m)) # outputting the number of traning examples for the user
eye = []

for i in range(0,m): 
    eye.append(1)  # creating an array of 1s and adding it to X
if len(ex1data.columns) == 2:  # to avoid an error wherein ex1data already has the column of 1s
    ex1data.insert(0, "feature1", eye)
print ('here is theta (initial)')
theta = [[0], [0]]

Now, we firstly add a column vector consisting entirely of 1s, of dimensions m by 1, to our feature matrix. Hence, we have a feature matrix of m by 2, where one column pertains to our variable data and another to a column vector of 1s. We then initialise both values of theta to 0. Our learning rate, alpha, will be set to 0.01 for gradient descent, and we will execute gradient descent 1500 times. After running gradient descent and plotting our predicted values against the actual dataset, this is what we get:

Pretty cool, right?

This entire example was based on solving the Week 1 problem set of Andrew Ng’s machine learning course on through Python. So credits to Stanford University. Stay quarantined, stay safe!

r/dailyprogrammer – Yahtzee Upper Section Scoring

Here’s the source.


The game of Yahtzee is played by rolling five 6-sided dice, and scoring the results in a number of ways. You are given a Yahtzee dice roll, represented as a sorted list of 5 integers, each of which is between 1 and 6 inclusive. Your task is to find the maximum possible score for this roll in the upper section of the Yahtzee score card. Here’s what that means.

For the purpose of this challenge, the upper section of Yahtzee gives you six possible ways to score a roll. 1 times the number of 1’s in the roll, 2 times the number of 2’s, 3 times the number of 3’s, and so on up to 6 times the number of 6’s. For instance, consider the roll [2, 3, 5, 5, 6]. If you scored this as 1’s, the score would be 0, since there are no 1’s in the roll. If you scored it as 2’s, the score would be 2, since there’s one 2 in the roll. Scoring the roll in each of the six ways gives you the six possible scores:

0 2 3 0 10 6

The maximum here is 10 (2×5), so your result should be 10.Examples

yahtzee_upper([2, 3, 5, 5, 6]) => 10
yahtzee_upper([1, 1, 1, 1, 3]) => 4
yahtzee_upper([1, 1, 1, 3, 3]) => 6
yahtzee_upper([1, 2, 3, 4, 5]) => 5
yahtzee_upper([6, 6, 6, 6, 6]) => 30

Opional Bonus

Efficiently handle inputs that are unsorted and much larger, both in the number of dice and in the number of sides per die. (For the purpose of this bonus challenge, you want the maximum value of some number k, times the number of times k appears in the input.)

    [1654, 1654, 50995, 30864, 1654, 50995, 22747, 1654, 1654, 1654, 1654, 1654, 30864, 4868, 1654, 4868, 1654, 30864, 4868, 30864]
) => 123456


Answering this problem is pretty trivial. All you need to do is a create another list that contains all numbers from 1 through the maximum number in the inputted list. Then multiply each number in the new list by the number of time it appears in the old list using the count function native to python. Here’s how it goes:

def yaht(t):
    return max([t.count(j) * j  for j in [i for i in range(max(t)+1)]])
# that's all there is to it!

Locking and Unlocking .pdf files

Here, I’ll be showing you how to set the same password for all .pdf files in a directory simultaneously using python, instead of manually setting passwords for each pdf. This comes in handy for files with over hundreds of .pdf documents. The full code for setting passwords can be found here, and the same for unlocking files is available here.

We will be using the PyPDF2 and os modules. To start off, we’ll get the user to input their password and root directory. Then, we’ll use the os.walk()function to iterate over each file in that directory and its sub-directories.

for (root, dirs, files) in os.walk(path):
	for filename in files:
		if filename.endswith('.pdf'):
		    pdfFile = open(root + '/' + filename, 'rb')
			pdfReader = pypdf2.PdfFileReader(pdfFile)

Above, we store the result of each ‘walk’ as a three-element tuple. We then access each file in the current directory, and use an if statement to check whether or not the file concerned has a .pdf extension. Then, we open the file in a read binary format, and also initialise a pdfFileReader object for the file.

Now, to make an encrypted pdf using PyPDF2, we’ll read one file and copy over its contents into another file, then encrypt the second file and delete the first. Do note that opening the first file in a write binary format won’t delete it, it’ll delete the data inside the file, but not the actual file. To delete files, we can use the os.remove() function.

# inside 'endswith('.pdf')' if

if pdfReader.isEncrypted == False:
    pdfWriter = pypdf2.PdfFileWriter()
    for pageNum in range(pdfReader.numPages):


	if filename.endswith('decrypted.pdf'):
		resultPdf = open(root + '/' + filename[:-13] + 'encrypted.pdf', 'wb')
		resultPdf = open(root + '/' + filename[:-4] + 'encrypted.pdf', 'wb')

	os.remove(root + '/' + filename)

Now, we firstly check whether the file in question is encrypted or not. If not, we initialise a PdfFileWriter object, and copy over the contents of the file to the object. Using the user-inputted password, we encrypt the new file and then save it with a suffix of ‘_encrypted.pdf.’ Then, the os.remove function is used to delete the starting file.

Unlocking files has the same methodology. Using a python program also saves a lot of time when say unlocking hundreds of files. Once again, we use os.walk() to traverse our root directory. Here, we also check for whether the file is encrypted, and proceed only if it is.

if pdfReader.decrypt(password) == 1:
    pdfWriter = pypdf2.PdfFileWriter()
    for pageNum in range(pdfReader.numPages):
	if filename.endswith('encrypted.pdf'):
	    resultPdf = open(root + '/' + filename[:-13] + 'decrypted.pdf', 'wb')
	    resultPdf = open(root + '/' + filename[:-4] + 'decrypted.pdf', 'wb')
	os.remove(root + '/' + filename)
    print('password incorrect for {}'.format(root + '/' + filename))

If the inputted password works for a file, the file is unlocked, then its contents are copied over into a new file with no password. Then, the encrypted file is deleted, and the new file replaces it. If the inputted password is incorrect, the program outputs the name of the file.

Application-wise, these programs are very efficient. For example, if there is a website with hundreds of free resources (many of which are password protected PDFs), you can download all these files and use the program above to unlock all files simultaneously. Using a brute-force attack (coming soon), we can also download hundreds of files and try to hack them using a spin-off of the program above.

Using Python to make Multiplication Tables in Excel

A really easy problem. Here’s the full code.

Now, we need to create an n by n multiplication table in excel using a python program, where n is an arbitrary, positive integer. We will do this using the openpyxl module. Firstly, we’ll initialise our notebook and set our current sheet to Sheet, which is the default active sheet.

wb = openpyxl.Workbook()
sheet = wb['Sheet'] # or
n = int(input('enter n| '))

For the purposes, a ‘bold’ font style object has also been created. I will assign this to the row and column headings later.

for i in range(2, 2+n):
	sheet['A' + str(i)].value = i-1
	sheet['A' + str(i)].font = boldFont
	sheet[get_column_letter(i) + '1'].value = i-1
	sheet[get_column_letter(i) + '1'].font = boldFont

Above, all I’ve done is create row and column headings for all integers upto and including n, with each heading being written in bold characters. Now, time to create the table.

for i in range(2, sheet.max_row + 1):
	for j in range(2, sheet.max_column+1):
		sheet[get_column_letter(j) + str(i)].value = sheet[get_column_letter(j) \ 
		        + '1'].value * sheet['A' + str(i)].value

Making the table is really simple. What I’ve done is iterate over each row heading, and then over each column heading inside the row, then find the product of these two and assign it to the required cell. You could also do this using excel formulas, but I think this solution is reasonably efficient.

Here’s another solution that I found on github.

for rowNum in range(1, number+2):
	for colNum in range(1, number+2):
		if rowNum==1 and colNum==1:
			sheet.cell(row=rowNum, column=colNum).value=''
		elif rowNum==1:
			sheet.cell(row=rowNum, column=colNum).value = colNum-1
			sheet.cell(row=rowNum, column=colNum).font = boldFont
		elif colNum==1:
			sheet.cell(row=rowNum, column=colNum).value=rowNum-1
			sheet.cell(row=rowNum, column=colNum).font = boldFont
			sheet.cell(row=rowNum, column=colNum).value = (rowNum-1)*(colNum-1)

It is shorter, but I don’t think it’s that elegant due to the usage of if/else statements. But that’s just personal bias. Instead of iterating over headings and content cells differently, the author here iterates every single cell (including 0,0) in one for loop.

Here’s a sample multiplication table for n = 18:

Extracting Passwords Using Regular Expressions in Python.

This is my second post on regular expressions. Here, I’m attempting to solve one of the practice tasks given at the end of Chapter 7, Automate The Boring Stuff. Instead of merely identifying whether or not a given string is a strong password, I will write a program that takes a list of strings, and returns a list containing those strings which will make valid ‘strong passwords.’

  • Extract potential strong passwords from text in Clipboard
  • Strong password must have both uppercase and lowercase characters, as well as digits. It must also be at least eight characters long, and must not have any spaces.

So for this piece of code, we will be using ‘lookaheads,’ a new concept for me pertaining to the usage of regexes. Firstly, here is some basic information about ‘lookaheads.’ A ‘lookahead’ is like an if condition. It follows the following format, (note that not many websites refer to it in this way, but I find this intuitive) (?=Regex1)(Regex2). Here, Regex2 is considered if and only if (iff) Regex1 is true. Now, this means that like in an if ... else conditional statement, we can write code that could mean: (?=if digit is present)(read string). Also, do note that Regex1 being true doesn’t affect how Regex2 is read. Another thing we need to do is ensure that the we scan only one password, and not the entire text.

We can do this by either specifying all possible characters except spaces inside square brackets as follows. [a-zA-Z0-9_+$%...]*. Or by simply specifying ‘all non-space characters,’ which is much simpler.[^\s]*. One could also use \w*, but this would not scan characters used commonly in passwords, like %, #, @, ! et cetera.

Then, we need three lookbacks. One to check for a lowercase character. One to check for an uppercase character. One to check for a digit. We also need to use {} brackets to ensure that the password is a least 8 characters long. Here is the code:

strongPassword = re.compile(r'''(
    )''', re.VERBOSE) 

res = []                            
for potentialPassword in strongPassword.findall(text):

The easiest one is the last statement, which says accept all non-space containing strings that follow the lookaheads above and are at least 8 characters long. Each lookahead follows the same format.

(?=.*\d), which is equivalent to (?=.*[0-9]) says match any character that is a digit. Similarly, the other two regex components say match any character that is a lowercase alphabet and uppercase alphabet respectively.

We then just output our results.

You can find the full code here at Github.

Personally, I found this really difficult mainly because I didn’t know how to chain multiple lookaheads together. Simply putting them one after the other without the .* required the characters to be ordered (lowercase first etc.). Nesting them like a bad if..else statement provided similar results. I ultimately found the correct solution here on stackoverflow, through an implementation for jquery, although not without surfing for over 1.5 hours.

Moreover, the implementation in the official regex website was absolutely nightmarish. A long, single line that was overflowing onto the right side of their webpage. Turns out that their implementation doesn’t even work, for reasons I couldn’t comprehend.

Using Reg. Expressions to Extract Useful Information in Python.

Having just completed chapter 7 of Automate the Boring Stuff. I wanted to test my skills. I’m writing down some code that will carry out the following tasks.

  • Use text copied onto the clipboard
  • Go through text and store, or save URLs, email addresses, phone numbers and ZIP codes.
  • Paste the information found into the clipboard, and print on the CLI as well.

The full code is available here on Github

Firstly, to do this, we’ll need two modules. The re module for writing and using regular expressions, as well as the pyperclip module for using our clipboard. Firstly, let’s import these modules. and write down the some rough formats for the stuff we want.

#! python3
import pyperclip, re
# These are our formats.
#   PHONE         | EMAIL           | ZIP      | URL
#   --------------|-----------------|----------|------------
#   area code*    | username        | 6 digits | protocol
#   separator*    | @               |          | server
#   1st 3 digits  | domain          |          | file name
#   separator     | .(com)          |          |
#   last 3 digits |                 |          |
#   extension*    |                 |          |
#   *optional

Now that we have this information, let’s make a regular expression, or regex for short for each category.

For phone numbers, we want three optional items and three necessary items, as shown above. To make items optional, we can group them using brackets and then place a question mark so that the number is read even if the contents of the bracket are not present. Like this: (<regex>)?. The ? operator specifies a group that either occur once or not at all. Now, let’s write our regex for phone numbers. Do note that I will use re.VERBOSE to spread the regex over multiple lines, for ease of readability.

phoneRegex = re.compile(r'''(
    )''', re.VERBOSE)

Let’s go over this line by line. The first line, (\d{3} | \(\d{3}\))?, is for scanning any area/region/country codes. We’re assuming these codes to be 3 digits in length. The | is added to ensure that we can read codes regardless of whether or not they are enclosed within brackets. Some possible extensions are 761 or (342). If there is a code, there will be a separator between it and the rest of the number, which is why the second line is required. We are allow for either a space, or a hyphen or a dot. For example, these numbers (with codes) would be read: 342-345-5454, 342.345-5454, and (342) 345-5454. From then onwards, the regex is pretty simple. The third line just scans three consecutive digits (as is mandatory in phone numbers). The fourth line is just a repetition of the second, as elements within a number are always separated by a hyphen or a space in most cases.

The last line could be a bit tricky. It is meant to include any extensions that the owner of the phone number has. \s* is added as there can an arbitrary amount of space between the number and extension. Then, ext|x\ext., where . is the wildcard character accounts for the way extensions are denoted. Lastly \d{2,5}accounts for extensions that are 2-5 characters long. Here are some sample phone numbers with extensions: 345-6571 x 453, (423) 341-9872 ext 45, and 221-986-1034 ext*86298.

Now, time to write down our regex for email addresses.

emailRegex = re.compile(r'''(
    )''', re.VERBOSE)

[a-zA-Z0-9._%+-]+ represents an expression that occurs at least once (may occur more than once but no 0 times). The content of the square brackets represents our own class of characters. a-z A-Z 0-9 represents any alphanumeric character. Email addresses may also contain periods, underscores, % signs, + signs or hyphens (terribly common), which is why they are included. Then we have the @ sign. After that, we need to identify the domain, which could be made up of any alphanumeric character along with hyphens, underscores and periods. All of these are included inside our square brackets. Lastly, \. represents a period, to start .com or .net. This period is followed by a string of alphabets that is 2 to 4 digits long. Examples – .nl, .in, .com, .gov.

The ZIP code is really easy. We just need six consecutive digits.

zipRegex = re.compile(r'\d{6}')

The URL regex is also easy, we just need to find all ‘words’ startin with http, which covers https in its own, and ending with a .<something>.

urlRegex = re.compile(r'http.*?\.[a-zA-Z]{2,4}')

Using the pyperclip module, we will copy our text from the clipboard. Although this may vary from person to person, I want to include all search results in one list instead of different lists for different types of searches. Here is the code.

text = str(pyperclip.paste())

# <...
#       various Regexes
# ...>

res = []

for number in phoneRegex.findall(text):
    phoneNum = '-'.join([number[1], number[3], number[5]])
    if number[8] != '':
        phoneNum += ' x' + number[8]

for email in emailRegex.findall(text):

for url in urlRegex.findall(text):

for ZIP in zipRegex.findall(text):

The functionality pertaining to searching for and storing a valid phone number is complex as our output changes with whether or not the phone number contains the area code and/or extension. We then copy all information gathered to the Clipboard, and print ‘no information found’ if the list of results is empty.

if len(res) > 0:
    print('Copied to clipboard...')
    print('no information found')

The entire code can be found here. Also, there are some helpful links for understanding Regexes below.

There is an issue with WordPress, links randomly open in new tabs or the same tab, so be careful.

Common Applications of Time Series Data (Python x Pandas)

Firstly, time in Python can be represented as either a string or a timedelta or datetime object. We’ll have more on those two below. Whenever reading data from a .csv file using the pd.read_csv function, time is generally read as a string, which can be converted to a datetime object with a very simply function. But more importantly, it is imperative to understand how time is actually represented.

YYYY-MM-DD HH:mm:ss.msmsms (m is minute, s is seconds, ms is milliseconds).

This notation is both simple and effective when dealing with time. Now, while reading, your code sees this format as a string, so it doesn’t whether this represents time or not. The pd.to_datetime is a vectorized function that converts all values in a Series (for dataframes, a specific column) to datetime objects. Intuitively, the first half is the ‘date’, and the second part the ‘time.’

Perhaps the most important element of dealing with time in python is your purpose. Generally, one would use this data to either calculate the amount of time passed between readings, or to calculate the amount of time between a starting time and current time. Let’s look at each separately.

Between Readings

To start off, I’ll write down an example. Say you have a pH sensor that gives you the following values: 7, 8, 9, 8. Now, your computer also logs the time at which these values are received: 2019-06-19 4:03:12, 2019-06-19 9:06:10, 2019-06-19 23:56:55 and 2019-06-20 1:01:04. Now, we have to find the amount of time passed between readings. Thus, you have a pandas dataframe as follows. I will explain why the last row is as such below.

        pH       time
0 7 2019-06-19 4:03:12
1 8 2019-06-19 9:06:10
2 9 2019-06-19 23:56:55
3 8 2019-06-20 1:01:04
Now, what we will do here is shift each time value down by 1 and store it in a series. This is done using the shift() function from pandas that takes in 1 required argument, n, that determines the number of spaces to shift by and whether to shift from left to right or top to bottom (default is top to bottom).
shifted_time = df.time.shift(1)

creating this series

0    NaN
1    2019-06-19 4:03:12
2    2019-06-19 9:06:15
3    2019-06-19 23:56:55
name: shifted_time

now, creating a last row of NaN and 0 respectively ensures that our last value is taken into consideration. The difference can now be calculated as follows.

diff = [df.datetime[i] - diff[i] for i in diff[]
diff[0] = 0    # as initial value has nothing to be compared against
df['difference'] = diff

creating the following data frame and giving us the amount of time between reading

    pH           time               difference
0 7 2019-06-19 4:03:12 ~
1 8 2019-06-19 9:06:15 0 days 05:03:03
2 9 2019-06-19 23:56:55 0 days 14:50:30
3 8 2019-06-20 1:01:04 0 days 01:05:09
This difference can be converted to seconds by specific functions, such as dt.total_seconds This just does this:

    pH           time               difference
0 7 2019-06-19 4:03:12 ~
1 8 2019-06-19 9:06:15 18183
2 9 2019-06-19 23:56:55 53430
3 8 2019-06-20 1:01:04 3909
difference is in seconds now. note: subtracting two datatime objects automatically creates a timedelta object, no need to do manually for latest version of pandas

Now, these sort of computations have many real world applications. For example, formula 1. If you’ve ever noticed, the person who comes first (Hamilton most of the time) has his time written in absolute terms, i.e, his time is written as it is. This is done by initializing the time column of the first row to 0. Then, from second place onwards, their time is represented as the difference by which they lost to the first person. I suggest you google F1 and have a look at it yourself.

Differences from starting point

This methodology isn’t so prominent to customers, but is extremely useful to data scientists looking to check whether the time data they receive has breaks or not. This is done by taking the minimum time value (normally the first one), and then subtracting it from each value – giving the time passed from the start to that specific point. These ‘differences’ are then plotted on the vertical axis, with the index of the time data in the entire data set taken on the horizontal axis.

Consequently, you get graphs like these that display the continuity of the data you have.

A log scale has been taken on the y-axis to ensure that small difference values are noticeable.

Now, Python, and especially pandas has many functions and methods to deal with time series data. This link would be the best to explore these functions –> One functionality that I absolutely adore is the ability to convert data as per timezone – adding both flexibility and adaptability to your code. This is extremely prominent on the web. For example, if you’ve ever noticed, most websites display the timings for a sport event as per your timezone. This is independent from python and pandas, but has the same functionality. Using your location, the time data employed by the server is converted to your timezone

Iterative and Recursive approaches to implementing the Fibonacci Series (Python)

So I was writing a program to implement the following task. For any value of n greater than or equal to 2, F(n) equals the sum of F(n-1) and F(n-2), where F(x) returns the sum of all elements in a fibonacci series of length x.

For example, F(4) would equal F(3) + F(2). Now, F(3) is 0, 1, 1; which sums to 2; and F(2) is 0,1; which sums to 1. Thus, F(4) = F(3) + F(2) = 2 + 1 = 3. Now, this is the original algorithm I came up with. Now, turns out that this algorithm is perfectly applicable for small values of n, say upto 30. However, the ‘runtime cost’ of this algorithm is exponential, which is why I wanted to try out the iterative approach as well (linear runtime cost).

# fibonacci sequence --> 0 1 1 2 3 5 8 13 21

def fibonacci(n):
	# f(n) = f(n-1) + f(n-2) where n > 1
	a, b = 0, 1
	if n == 0:
		return a
	elif n == 1:
		return b
	elif n > 1 and n < 30: 
		return fibonacci(n-1) + fibonacci(n-2)
		return 'n should be between 2 and 29'

print (fibonacci(6))  # input here

# methodology, every time, f(n) breaks down further and further
# eg. f(4) = f(3) + f(2)
#		   = f(2) + f(1) + f(1)      f(0) not written because it is = 0, no effect on output
#          = f(1) + 2f(1)
#          = 3f(1)

The above approach, or rather the one I originally came up with, has a runtime cost of O(2^n). How?

F(n) => F(n – 1) + F(n – 2) + c.
% c is a constant. for the exercise above, it is = 0.

% we can assume both to be approximately equal F(n – 1) ~= F(n – 2)
F(n) => F(n – 2) + F(n – 2) + c
=> 2F(n – 2) + c (F(n – 2) = F(n – 3) + F(n – 4) + c ~= 2F(n – 4) + c
=> 2( 2F(n – 4) +c) +c
=> 4F(n – 4) + 3c
repeat process
=> 4(2F(n – 6) + c) + 3c
=> 8F(n – 6) + 7c

now, set 2^k = T, where T is the coefficient of F(n-…). T = 8 above *eg.

F(n) ==> 2^k * F(n – 2k) + (2^k – 1)c

Now, we will set n – 2k equal to 0. This means that k should equal n/2. Thus, the equation simplifies as follows and the exponential cost is revealed.

k = n / 2
F(n) ==> 2 ^ (n / 2) * F(0) + 2 ^ (n / 2) * c
This simplifies to 2 ^ (n / 2), as other elements can be approximated to having either no cost, or a constant cost that does not change with n. This means that the cost of the algorithm is O(2 ^ (n / 2)), which shows that it increases exponentially with n.

To avoid an exponentially increasing cost, we can adopt an iterative approach, which avoids stacking function calls on top of each other as in a recursive approach.

def fibs(n):    # iterative approach
    fibs = [0, 1, 1]                                                                                           
    for f in range(2, n):                                                                                      
        fibs.append(fibs[-1] + fibs[-2])                                                                         
    return fibs[n]

Now, the cost of this algorithm is O(n), as it will only be executed n times, or rather n – 2 times to be exact, as the for loop starts iterating from 2, so as to avoid touching the values that correspond to f(0), f(1), and f(2). To conclude, the iterative algorithm is simpler and much more efficient for large values of n. this post on stack overflow dives deeper into the implementation of the iterative algorithm.