Modelling COVID-19 Cases for South Korea, India, and Sweden.

I recently wrote a paper that modelled the spread of covid-19 in Italy using a logistic fit. I wrote this a while back, and I was curious about how such a logistic function would behave NOW, so I decided to look at how a logistic fit could be applied to India, South Korea, and Sweden. I’ve taken these 3 countries because they’ve adopted entirely different approaches to tackling the pandemic, so I was curious as to whether or not the effects of these approaches would be graphically discernible. Here, I modelled the cumulative confirmed cases for the respective country against the number of days since the beginning of the outbreak. I’ll go through solely the results, you can find the full code here.

I’m using covid19api.com for obtaining the required data. It gives you a good deal of data, and has excellent support so I’d recommend using this API for all things covid-19. After obtaining the data, I treated it by filtering out undesired columns/rows, converting the date string to an integer that represented number of days since the beginning of the outbreak, and broke my data into a feature matrice and a target vector. I’ve used a logistic function for fitting the data.

South Korea

Here are the results, graphically, for South Korea.

It’s evident that the logistic model can’t correctly explain the epidemic in South Korea, but that’s because of the South Korean government’s perfect handling of their localised cluster. South Korea rolled out tests and implemented stringiest social distancing policies not just more effectively, but more quickly than any other country. The graph acts as representative of these results. Real world cases overtake the logistic fit, but then fall back down. South Korea’s strict implementation of social distancing and mass-scale testing can be seen to be effective at around the 50 day mark. The curve flattens out, to some extent, but then resumes an upwards slope.

India

India demonstrates a nearly perfect logistic fit. You can see that the logistic model explains really world data to what appears to be nearly perfect accuracy. I believe this is because of some inherent conditions here. Firstly, population density. It’s hard not to come in contact with someone else while sick in India. While modelling a disease’s spread, you use a r0 value. In other countries, population density may restrict how many people an infected person infects. However, In India, any individual who is not quarantined and sick will come in contact with more people than the R0 number.

Sweden

A logistic model also seems to describe the situation in Sweden pretty well. The logistic fit largely pertains to real world data. This is somewhat of a surprise, as Sweden implemented a vastly different policy regarding COVID-19. Instead of carrying out mass scale testing or enforcing a lockdown, Sweden opted for ‘herd immunity.’ Herd immunity is immunity to a disease as a consequence of a large proportion of the population having recovered from it. On a social level, this didn’t exactly go well for Sweden.

I think its quite interesting to look at how accurate logistic models are in states with different approaches. For instance, South Korea succeeded in pushing down the R0 in their cluster due to testing and quarantine. India, on the other hand, has so far failed to implement any successful policy that reduces the R0 values. Sweden is actually trying to increase its R0 so as to develop herd immunity. One inherent assumption in any logistic model, and most epidemiological models is that the R0 value remains largely constant. This assumption definitely holds true for India and Sweden, but breaks down when it comes to South Korea.

Implementing Hierarchal Clustering (Python)

Clustering is an important non-supervised learning technique. It aims to split data into certain clusters. For instance, if you input data pertaining to shoppers in the local grocery market, clustering that could output age based clusters of say < 12 years, 12-18, 18-60, and > 60. Another intuitive example is banking, clustering financial data for a large group of individuals could output income-based clusters, say 3 pertaining to the lower middle, upper middle, and upper classes.

The most basic and intuitive method of clustering is K-means, which identifies K clusters. It randomly initialises K centroids, marks the points near it, and repeats until it repeats K averaged out cluster centroids. Hierarchal analysis, however, is based on a much different principle.

There are two methods of hierarchal analysis: agglomerative and divisive. Agglomerative puts each data point into a cluster of its own. Hence, if you input 6000 points, you start out with 600 clusters. It then clusters the closest points (and then clusters) and repeats this process until there is only one giant cluster encompassing the entire dataset left. Divisive does the exact opposite. It starts with one giant clusters, and splits each cluster (repeatedly) until each point is its own cluster. The results are plotted on a special plot, called a dendrogram. The longest vertical line segment on the dendrogram gets to be the optimum number of clusters for analysis. This will be much easier to understand when I show a dendrogram below.

Here is a link to the dataset I’ve used. You can access the full code here. This tutorial on analyticsvidhya was also immensely helpful to me when understanding how hierarchal clustering works.

Note down the libraries I’ve imported. The dataset is fairly straightforward. You have an ID for each customer, and financial data corresponding to that ID. You’ll notice that the standard deviation or range for features is quite different. Where balance frequency tends to stay close to 1 for each ID, account balance is wildly different. This can cause issues during clustering, so that’s why i’ve scaled my data so that each feature is similar to each other feature in relative terms.

Here is the dendrogram for the data. The y-axis represents the ‘closeness’ of each individual data-point/cluster. You’d obviously expect y to be maxed out when there’s only 1 cluster, so that’s no surprise. Now, looking at this graph, we must select the number of clusters for our model. A general rule of thumb here is to take the number of clusters pertaining to the longest vertical line visible here. The distance of each vertical line from each other represents how faraway those clusters are. Hence, you want a small number of clusters (not necessary, but in this application, optimal), but also want your clusters to be spaced far apart, so that they clearly represent different groups of people (in this context).

I’m taking 3 clusters, which corresponds to a vertical axis value of 23. 3 clusters also intuitively makes sense to me as any customer can broadly be classified into lower, middle, and upper class. Of course, there are subdivisions inside these 3 broad categories too, and you might argue that the lower class wouldn’t even be represented here, so we can say that these 3 clusters correspond to the lower middle, upper middle, and upper classes.

Here is a diagrammatic representation of what I’ve chosen.

After building the model, all that’s left is visualising the results. There are more than two features, so I’m arbitrarily selecting two, plotting all points using those features for my axes, and giving each point a color that corresponds to all other points in its cluster.

You’ll notice that there is a lot of data here, but also a clear pattern. Points belonging to the purple cluster visibly tend towards the upper left corner. Similarly, points in the teal cluster tend to the bottom left corner, and points in the yellow cluster tend to the bottom right corner.

Implementing PCA and UMAP in Python

You can find the full code for PCA here, and the full code for UMAP here.

Dimensionality reduction is an important part of constructing Machine Learning models. Dimensionality Reduction is basically the process of combining multiple features into a smaller number of features. Features that have a higher contribution to the target value have a greater representation in the final combined feature than features that contribute less. For instance, if you have 8 features, the first 6 of which have a summed contribution of around 95%, and the last 2 of which have a contribution of about only 5%, then those 6 features will have a greater representation in the final combined feature. In terms of advantages, the most significant is less memory storage and hence higher modeling and processing speed. Other advantages include simplicity and easier visualization. For instance, you can easily plot the contribution of two combined features to the target, especially compared to plotting, say, 20 initial features. Another significant aspect is that features will less contribution that would otherwise add useless ‘weight’ to the model are removed early on.

The two methods of dimensionality reduction I will be using are PCA and UMAP. I won’t be going in through how they work as I’ve given a short overview of their purpose above. Instead, I’ll go through the code I implemented for each, and visualize the results. For this exercise, I’m using the WHO Life Expectancy Dataset that can be found on Kaggle, as its very small and easy to work with. My target variable will be life expectancy, and my features will be aspects like adult mortality, schooling, GDP etc. I randomly selected these features from the dataset.

Here is a list of the modules we will be using. train_test_split will help us break out data into a training set and a testing set (about a 7:3 ratio). While this isn’t significant right now, this aids in the detection of under fitting and over fitting. Under-fitting is detected by bad performances on both the training set and the testing set, whereas Over-fitting is detected by really good performance on the training set but bad performance on the testing set. StandardScaler has been used to normalise features. Feature normalisation is a technique that reduces the range of the dataset, or the standard deviation, in layman’s terms. Lastly, we’ve imported both PCA and UMAP, which will be used.

Here we just load our dataset, extract the features that will be used (see column names in the dataframe), and rename them for the sake of simplicity. As you can see, there are some random spaces and not all use underscores as notation, so I decided to have one uniform way of typing out each feature. Now, to extract a feature matrix and a target vector, just drop the life_expectancy column from the dataframe and convert it into a numpy array, and convert the life_expectancy column into a separate numpy array. I won’t show the code for splitting and normalising, because that’s pretty much irrelevant here.

Implementing PCA in itself is very simple, as shown above. You’ll notice that I’ve specified n_components to be equal to 2 above. This is because I just wanted to point out that the number of combined features you want at the end can be set by you. In this case, it doesn’t really matter because PCA will give only two combined features if I do not a specify a pre-set number. After that, I’ve fitted the training_data to PCA.

Here’s a bit of data treatment before I finally plot the results. I’ve basically converted PCA’s output, which was a numpy array, to a pandas dataframe, and then added life_expectancy as a column because that will be used for the color-bar you will see below.

Here is the code for my plot, and here is the plot:

You can see the relative contribution of each componenent to each feature, whose target value, or life expectancy, is represented by the color of the marker. While I don’t see any patterns straightaway (specific colors being clustered somewhere etc.), the primary thing that does stand out is how heavily green dots (~ 70 expectancy) are clustered towards the bottom left. There are other colors as well, but there don’t seem to be many green dots anywhere else.

The code for UMAP is the exact same, except with UMAP as our decomposer instead of PCA. Here’s the plot.

You can straightaway see that the results of UMAP are quite different. Once again, there are no noticeable patterns in terms of specific colors being clustered in specific locations, but the overall structure is quite different from that of PCA. We can see that each color is distributed throughout.

There’s no way to say which method is better without modeling your target variable with respect to both principal components and calculating the accuracy on the testing set. This post just aims to illustrate how both of them work without going into specific details.

Plotting Shapefile Data Using Geopandas, Bokeh and Streamlit in Python.

I was recently introduced to geospatial data in python. It’s represented in .shp files, in the same way any other form of data is represented in say .csv files. However, each line in a .shp file corresponds to either a polygon, a line, or a point. A polygon can represents certain shapes, so in the given context of maps and geospatial data, a polygon could act as a country, or perhaps an ocean and so on. Lines are used to represent boundaries, roads, railway lines etc. Points are used to represent cities, landmarks, features of interest etc. This was very much new to me, so I found it fascinating to see how any sort of map can be broken down into polygons, lines, and points then played around with using code. I was also introduced to streamlit, which provides, say, an alternative to Jupyter Notebooks but with more interaction and in my opinion, better visual appeal. I think one distinct advantage Jupyter Notebook has is compartmentalisation, and how good code and markdown look next to each other, whereas streamlit seems to be more visually appealing. However, one big advantage streamlit has is the fact that it is operated by the command line, making it much more efficient for a person like me who’s very much comfortable with typing out commands and running stuff, rather than dragging my pointer around to click objects.

I used the geopandas library for dealing with shapefile data. It’s incredibly efficient and sort of extends native pandas commands to shapefile data, making everything much easier to work with. One thing I didn’t like was how streamlit didn’t have inbuilt functionality to view GeoDataFrames, so essentially that means I have to output geospatial data using the st.write() method, and that just results in some ugly, green-colored output, very much unlike the clean, tabular output you get when you use st.write() for displaying dataframes. It’s also a bit surprising how st.dataframe() doesn’t extend to a GeoDataFrame, but eh, it works for now.

Bokeh is new for me, I decided to start out with making plots using inbuilt geopandas and matplotlib functionality rather than move straight to Bokeh. Hence, in this post, I’ll be going through how I made and annotated some maps using geopandas, then extended that to Bokeh to make my code much more efficient. A huge advantage Bokeh brings to the table is that it can be used to store plots, so no going back to earlier cells to find a plot. Just output it to an html file and write some code to save the contents of that html file, you’re then good to go.

The very first thing I did was create a very basic plot showing Germany filled with a blue color. Here’s a small piece of code that accomplishes that.

Plotting Germany

This simple takes the geodata for the entire world, in low resolution. I then select the data specific to Germany, plot it using geopandas, turn plot axes off just to make it look better visually, and output a plot to my streamlit notebook (do I call it that?). Here’s the result

As you can see, it’s a very simple plot showing the nation state of Germany. Now, I thought I’d extend this, make it look better, and annotate it with the names of some cities and the capital, Berlin. The very first thing to do here is to get some data for each city, their latitudes and longitudes, to be specific. You can either import that via a shapefile, which will have each location as a point. However, I manually inputted 6 cities and their data from an internet source to a dataframe, and then used that to annotate my figure, which I’ll be talking about now.

The dataframe at the top contains my city position data. Down below, I’m creating another GeoDataFrame that holds each cities data as a point object. You’ll notice that I’ve used the points_from_xy() method while creating the city dataframe. points_from_xy() wraps around the point() method. You can view it as equivalent to [point(x,y) for x, y in zip(df.Longitude, df.Latitude]. It works really well and removes the need to have a for loop, making, making my code much more efficient. I’ve then plotted the same map as above, except with a white fill and a black outline (better looking imo). After that, I’ve gone over each point using a for loop and added a label, which is the City name (stored as just “Name”). I’ve also increased the size of the marker for Berlin, given that its the capital. The last step is just adding a red marker to indicate each city’s position. Note that st.pyplot() is a streamlit method that outputs any figures we might have. Here is the output of the code above.

I think this looks much better.

Now, I decided to plot a map showing all the main railway lines in India on top of a blank map of India, and output this to a new .html page as a bokeh plot.

As you can see the, the code for this is very simple. I’ve firstly plotted the map of India using the get_path functionality in geopandas. Then, for the sake of visbility, axes lines have been turned off. Then, I’ve read the railway shapefiles, which consists entirely of line objects, and plotted it using the pandas_bokeh library, outputting to an html file. Here’s the result

I find working with geospatial data to be terribly useful, and very much exciting. I think that’s partly because I love data science; playing around with data and modelling it, and working with geospatial data opens up an entire realm of possibilities. I’d describe the experience of making my first shapefile plot as something akin to working for the first time with time series data. Much like time, it adds another dimension to what one can do. In the coming weeks, I’ll be blogging more often, hopefully once every week, on geospatial data specifically. The contents of this post are not even an introduction to what can be accomplished using geospatial data. Next point of exploration, for me, is going to be how to depict elevations, and use color density to indicate population density, average cost of living, gdp etc. One immediate application that comes to mind is making a map that uses color to reflect how much covid-19 has impacted a country, based on not just cumulative confirmed cases, but also factors like healthcare expenditure, economic downturn, unemployment etc. I think it’ll be interesting.

You can find the entire code here.

Solving the UEFA Champions League problem on Code Chef with OOP

So I decided to try out code chef, and chose to solve the UCL problem listed on the home page of the ‘practice’ portion of code chef. The problem statement was detailed and easy to understand, you can view it here. It basically involves inputting the match results for a league in the following format:

<home team> <home team goals> vs. <away team goals> <away team>

for instance, fcbarca 5 vs. 1 realmadrid

12 match results are inputted, and match results for T leagues are taken in. The output for each league should be the names of the winning team and the runner-up. For instance, the output for the following league schedule/results,

manutd 8 vs. 2 arsenal
lyon 1 vs. 2 manutd
fcbarca 0 vs. 0 lyon
fcbarca 5 vs. 1 arsenal
manutd 3 vs. 1 fcbarca
arsenal 6 vs. 0 lyon
arsenal 0 vs. 0 manutd
manutd 4 vs. 2 lyon
arsenal 2 vs. 2 fcbarca
lyon 0 vs. 3 fcbarca
lyon 1 vs. 0 arsenal
fcbarca 0 vs. 1 manutd

would be

manutd fcbarca

Here are some basic rules we observe for football tournaments. Each victory gives you 3 points, each lose gives you 0 points, each draw gives you 1 point. If two teams have the same point tally, the team with the higher goal difference is placed first. For simplicity’s sake, we’re assuming here that no two teams with the same point total will have the same goal difference.

I defined two classes, League, and Team right at the start. The only attribute an instance of the League would have would be table, which is a dictionary wherein the key is a string pertaining to the team’s name and the value if a Team object containing all necessary information. I also put five methods inside the League class, which I’ll be explaining just in a bit.

League Class

The does_team_exist() method just returns true or false values pertaining to whether or not the inputted string is a key in the dictionary field. In my main code, if does_teaam_exist() returns False, the add_team() method is called and the self.table attribute is updated with a Team object being added to it. If does_team_exist() returns True, the program continues. match_won() and match_lost() are pretty well defined. The former calls the win() method for the Team object corresponding to the Team that won and the lose() method for the Team object corresponding to the Team that lost. You’ll notice that we also pass in the goal difference here. The latter calls the same draw() method for both, and no gd is passed in because it equals 0 for both times in the event of a drawn match.

return_top-two() is our most important method. It creates a list of all values in the table dictionary. Then, I used a lambda function to sort on the basis of two parameters, points and goal difference. You’ll notice that my first element in the lambda function inside the tuple is the number of points associated with that object, and the second is the goal difference for that team. After sorting all Team objects, I return the team names of the last and second last objects.

Now, I’ll show you the Team Class.

Team Class

This is much simpler. When initializing, the only argumentative input is the team name, while points and team difference are initialized to 0. The win() method increases points by 3 and goal difference by the argumentative input. the draw() method only increments points by 1, and the lose() method decrements goal difference by gd. This is followed by three get methods, two of which are used in the lambda function shown above, and one which is used to get the desired output for our program.

Now, the only part left is our main class or working code. Here it is.

Code

We take in T as our first input as defined by the problem statement, then iterate over each League. Before going into the results for each league, I’ve initialized league to an instance of the League class. I then iterate over each of the 12 fixtures in each league. Splitting the input and extracting certain indices (see code) gives us our values for each team’s name and the number of goals it scored. You can see that I have two “if not”s up there that initialize a new team object inside the table attribute of the League class if the team in question is not there already. Then, I compare home team goals and away team goals and call the appropriate method where needed. Our output is the value returned by the return_top_two().

You can check out the full code here.

Credits to code chef for the problem statement.

UI to mail NASA APOD image to inputted email [#2 NASA APOD Series]

I’ll be going through using Flask to create a 2-webpage website with a simple form on the first page that takes in your email, clicking the submit button sends an image of the latest NASA APOD image to the inputted email. For this, I’m using the script written and explained in the last post in this series, which can be found here. So in this post, I’ll be going through the flask templates i created, and my form functionality, which is pretty simple.

The first step is to create a .html page and a main.py file for our website. Here’s the code for the following.

<body> tag of .html file

So the image above shows the code I’ve written for the body tag, which is all that’s worth looking at. The form data returned in the main.py file is used to just show the form and a submit button, very simple functionality so I’m not going to go that much into detail for this part. We only have an email field and a submit button, clicking the submit button firstly checks whether or not the email field has a valid input. If it doesn’t, the field is cleared and a red outline is put on it. If it is, an email containing the latest NASA APOD image is mailed to the inputted email, and the user is redirected to another webpage.

This is the code in my main.py file. I’ve initialised my Flask app with the first line and given it a secret key (for forms). The send_email() function contains the code explained in the previous post. It’s called if and only if the submit button is clicked on the home page and a valid email address is filled out. If these conditions are satisfied, the send_email() function is called with the inputted email as the only argument. The code then redirects the user to the bye.html page, which is just a small html file displaying the words “bye.” I did it for the sake of seeing whether or not the send_email()function has been called inside home.

My functionality for setting up the email form is very simple. I use the wtforms module.. I’ve created emailButton class wherein the the email is set to a StringField object, and submit is set to a SubmitField object. The library provides validator methods. DataRequired() ensures that some data is inputted into the field, and Email() ensures that the given input is a properly formatted email.

This is what our webpage displays. After entering a valid email and clicking submit, here is what I’ve received in my inbox.

Next time I’ll have a more styled layout, and I’ll go through the countdown functionality that will send emails to inputted email addresses until the number of inputted days becomes zero. Project should ideally be complete when that’s done.

You can check out the full code here.

Using Python to email NASA APOD API data to a custom recipient [#1 NASA APOD Series]

I’m doing a mini-project as a learning experience. The first step here involved two sepearate functionalities that I’ve now combined and will explain here. Firstly, I had to write a few lines of code to actually use the API and save the image it returns to the current directory. This was really easy. I then had to learn how to use smtplib to send emails in python. I used this article on Real Python to do so. This post on Stack Overflow was also helpful in instructing me about attaching image files to an email. You can find the entire code in here, on my GitHub. I’ve been having some issues with WordPress plugins, so I’ve just attached screenshots of my code for this post.

right, so the first thing we’ll be looking at is accessing the NASA APOD API and storing the latest NASA Astronomy Picture of the Day Image in the current directory.

I firstly import the necessary modules. After that, we’ll access the API with our key. The url contains the link given in the offficial documentation for accessing latest image. The API returns a json object with keys pointing to a description of the image, a hd url, and a nominal quality url. I’ve used the urllib module to fetch and store the image to our current repo (stores to which repo we are in or we have changed to with os). Now comes the hard part.

You can take a look at all the modules I’ve imported in the GitHub link i’ve posted above.

Above, we start our smtp_server and initialize our port. I’ve used 587 because I’m using the TLS method, I think you’ll need to use 465 if you use the other method. I’ll be using mehulpy@gmail.com to send this email. This is a throwaway account I created with all security features turned off. I’m mailing it to mehuljangir42@gmail.com, which is one of the emails I use. We then input the password for the email we will be sending it from. Then, we’ll our message will be set to an instance of the MIMEMultipart objective with “alternative” as our input parameter. We then key in some values, like the email to use, the email to mail it to, and the Subject of our email. I’ve then specified html and plaintext messages for the email to contain.

I’ve attached both html and plaintext messages to our email. The procedure for attaching a .png file as an email attachmetn is very simple. Using the inbuilt open subroutine in python, we read the image. Then, we create an instance of the MIMEImage object where we key in the img_data we just read and input the path of our image file. os.path.basename returns your current path, kinda like the pythonic way of writing pwd.

A SSL context is created after that. Check out this Stack Overflow post to learn more about what an SSL context is. We then move into a try except condition, so that we can catch any errors that come our way. A server is established, and server.ehlo() is basically the equivalent of waving hi to your friend across the room. We then start tls, which is a method of encryption for safe and secure communication across the internet. Finally, as a method of the server instance, I send the email to the recipient. The finally block quits the server regardless of whether or not the email was sent and whether or not an error was thrown.

I’ll now create a webpage where you can input your email, and a python script sends the current NASA APOD image to your email. So stay hooked for #2!

Also, FYI, here’s what the email looks like:

Implementing Univariate Linear Regression in Python

The objective of this post is to explain the steps I took to implementing univariate linear regression in Python. Do note that I’m not using libraries with inbuilt ML models like sklearn and sci-py here.

Here is our gradient descent function, utilising mean squared error as the cost function.

def gradientDescent(theta, alpha, iterations):
    
    m = ex1data.shape[0]  # finding the number of trial examples
    n = ex1data.shape[1]  # finding the number of features + 1
    
    for iteration in range(iterations):
        
        total0 = 0
        total1 = 0
        
        for row in range(m): # iterating over each training example
        
            hypothesis = 0
            
            for val in range(n-1):
                hypothesis += ex1data.values[row][val] * theta[val][0] 
                
            load = hypothesis - ex1data.values[row][n-1]
            total0 += load*ex1data.values[row][0]
            total1 += load*ex1data.values[row][1]
            
        temp0 = theta[0][0] - ((alpha*total0)/m)
        temp1 = theta[1][0] - ((alpha*total1)/m)
        theta = [[round(temp0, 4)], [round(temp1, 4)]]
        
    return theta

We carry out gradient descent 1500 times here, by setting iterations equal to 1500. Our starting values of theta are 0. Now, for those of you who don’t know how gradient descent works, here’s a short explanation that attempts to cover the crux of it. Intuitively, we subtract each of our output values as given by the hypothesis function by the target value we’re trying to predict, then square the difference. The gradient descent update rule subtracts the partial derivative of this (beyond us mortals for now) from the existing values of theta – updating them. This entire process repeats 1500 times until gradient descent converges to an optimal value of theta, or the minimum point of the cost function.

Now, we plot our data to see what it looks like.

m = ex1data.shape[0]
print ('No. of training examples --> {}'.format(m)) # outputting the number of traning examples for the user
eye = []

for i in range(0,m): 
    eye.append(1)  # creating an array of 1s and adding it to X
    
if len(ex1data.columns) == 2:  # to avoid an error wherein ex1data already has the column of 1s
    ex1data.insert(0, "feature1", eye)
    
print ('here is theta (initial)')
theta = [[0], [0]]
matrix_print(theta)

Now, we firstly add a column vector consisting entirely of 1s, of dimensions m by 1, to our feature matrix. Hence, we have a feature matrix of m by 2, where one column pertains to our variable data and another to a column vector of 1s. We then initialise both values of theta to 0. Our learning rate, alpha, will be set to 0.01 for gradient descent, and we will execute gradient descent 1500 times. After running gradient descent and plotting our predicted values against the actual dataset, this is what we get:

Pretty cool, right?

This entire example was based on solving the Week 1 problem set of Andrew Ng’s machine learning course on coursera.org through Python. So credits to Stanford University. Stay quarantined, stay safe!

r/dailyprogrammer – Yahtzee Upper Section Scoring

Here’s the source.

Description

The game of Yahtzee is played by rolling five 6-sided dice, and scoring the results in a number of ways. You are given a Yahtzee dice roll, represented as a sorted list of 5 integers, each of which is between 1 and 6 inclusive. Your task is to find the maximum possible score for this roll in the upper section of the Yahtzee score card. Here’s what that means.

For the purpose of this challenge, the upper section of Yahtzee gives you six possible ways to score a roll. 1 times the number of 1’s in the roll, 2 times the number of 2’s, 3 times the number of 3’s, and so on up to 6 times the number of 6’s. For instance, consider the roll [2, 3, 5, 5, 6]. If you scored this as 1’s, the score would be 0, since there are no 1’s in the roll. If you scored it as 2’s, the score would be 2, since there’s one 2 in the roll. Scoring the roll in each of the six ways gives you the six possible scores:

0 2 3 0 10 6

The maximum here is 10 (2×5), so your result should be 10.Examples

yahtzee_upper([2, 3, 5, 5, 6]) => 10
yahtzee_upper([1, 1, 1, 1, 3]) => 4
yahtzee_upper([1, 1, 1, 3, 3]) => 6
yahtzee_upper([1, 2, 3, 4, 5]) => 5
yahtzee_upper([6, 6, 6, 6, 6]) => 30

Opional Bonus

Efficiently handle inputs that are unsorted and much larger, both in the number of dice and in the number of sides per die. (For the purpose of this bonus challenge, you want the maximum value of some number k, times the number of times k appears in the input.)

yahtzee_upper(
    [1654, 1654, 50995, 30864, 1654, 50995, 22747, 1654, 1654, 1654, 1654, 1654, 30864, 4868, 1654, 4868, 1654, 30864, 4868, 30864]
) => 123456

Solution

Answering this problem is pretty trivial. All you need to do is a create another list that contains all numbers from 1 through the maximum number in the inputted list. Then multiply each number in the new list by the number of time it appears in the old list using the count function native to python. Here’s how it goes:

def yaht(t):
    return max([t.count(j) * j  for j in [i for i in range(max(t)+1)]])
    
# that's all there is to it!

Locking and Unlocking .pdf files

Here, I’ll be showing you how to set the same password for all .pdf files in a directory simultaneously using python, instead of manually setting passwords for each pdf. This comes in handy for files with over hundreds of .pdf documents. The full code for setting passwords can be found here, and the same for unlocking files is available here.

We will be using the PyPDF2 and os modules. To start off, we’ll get the user to input their password and root directory. Then, we’ll use the os.walk()function to iterate over each file in that directory and its sub-directories.

for (root, dirs, files) in os.walk(path):
	for filename in files:
		if filename.endswith('.pdf'):
		    pdfFile = open(root + '/' + filename, 'rb')
			pdfReader = pypdf2.PdfFileReader(pdfFile)

Above, we store the result of each ‘walk’ as a three-element tuple. We then access each file in the current directory, and use an if statement to check whether or not the file concerned has a .pdf extension. Then, we open the file in a read binary format, and also initialise a pdfFileReader object for the file.

Now, to make an encrypted pdf using PyPDF2, we’ll read one file and copy over its contents into another file, then encrypt the second file and delete the first. Do note that opening the first file in a write binary format won’t delete it, it’ll delete the data inside the file, but not the actual file. To delete files, we can use the os.remove() function.

# inside 'endswith('.pdf')' if

if pdfReader.isEncrypted == False:
    pdfWriter = pypdf2.PdfFileWriter()
    for pageNum in range(pdfReader.numPages):
		pdfWriter.addPage(pdfReader.getPage(pageNum))

	pdfWriter.encrypt(password)

	if filename.endswith('decrypted.pdf'):
		resultPdf = open(root + '/' + filename[:-13] + 'encrypted.pdf', 'wb')
	else:
		resultPdf = open(root + '/' + filename[:-4] + 'encrypted.pdf', 'wb')

	pdfWriter.write(resultPdf)
	os.remove(root + '/' + filename)
	resultPdf.close()
	pdfFile.close()
	

Now, we firstly check whether the file in question is encrypted or not. If not, we initialise a PdfFileWriter object, and copy over the contents of the file to the object. Using the user-inputted password, we encrypt the new file and then save it with a suffix of ‘_encrypted.pdf.’ Then, the os.remove function is used to delete the starting file.

Unlocking files has the same methodology. Using a python program also saves a lot of time when say unlocking hundreds of files. Once again, we use os.walk() to traverse our root directory. Here, we also check for whether the file is encrypted, and proceed only if it is.

if pdfReader.decrypt(password) == 1:
    pdfWriter = pypdf2.PdfFileWriter()
    for pageNum in range(pdfReader.numPages):
        pdfWriter.addPage(pdfReader.getPage(pageNum))
	
	if filename.endswith('encrypted.pdf'):
	    resultPdf = open(root + '/' + filename[:-13] + 'decrypted.pdf', 'wb')
	else:
	    resultPdf = open(root + '/' + filename[:-4] + 'decrypted.pdf', 'wb')
	pdfWriter.write(resultPdf)
	os.remove(root + '/' + filename)
	resultPdf.close()
	pdfFile.close()
else:
    print('password incorrect for {}'.format(root + '/' + filename))
    

If the inputted password works for a file, the file is unlocked, then its contents are copied over into a new file with no password. Then, the encrypted file is deleted, and the new file replaces it. If the inputted password is incorrect, the program outputs the name of the file.

Application-wise, these programs are very efficient. For example, if there is a website with hundreds of free resources (many of which are password protected PDFs), you can download all these files and use the program above to unlock all files simultaneously. Using a brute-force attack (coming soon), we can also download hundreds of files and try to hack them using a spin-off of the program above.