Personality Blender.

I remember sitting in the new building’s relatively large auditorium on my first day of freshman year four years ago. Starry-eyed and intoxicated with the transition from middle school (years of fun and frolic) to the four years of high school that were apparently meant to be the bridge to the rest of your life, I was both scared and excited. I remember being the only student who opted for 11 subjects in grade 9, which invited the nickname “Mr Genius.” More than that, I remember having no clue where I would end up at the end of high school. I didn’t even know about the very existence of any university other than Stanford and Harvard.

I’ve spent the 2102400 minutes between the present and that first day working not for some extrinsic goal, but for myself. I’ve never seen the college admissions mania as something that requires you to mould yourself to wherever you’re applying, or mould yourself to the general applicant pool, or try and resemble that one Russian kid who interned at Google, won the International Math Olympiad, and got his start-up acquired by a Fortune 500 company. On the contrary, rather than fitting myself to a college, I’ve always wanted to fit a college to me, and that’s imparted in me a variety of passions, hobbies, and goals for the future.

Looking back, there’s so much I’ve done and so much that has happened that I would’ve never imagined. It seems only yesterday that I received my first full chemistry grade, played football with my friends, fell just short of my personal goal for my IGCSEs, and took my first step in a whole new world. During this insanely long time period, I never once thought about looking back and seeing how far I’ve come. I can’t imagine going back and telling 9th grade Mehul that he would be here right now.

No matter which college I go to and where I’m accepted, I’m satisfied with the knowledge that I didn’t do what I had to or what was instructed to, but I did what I wanted to, and that should be enough to set me apart from anyone else out there. I’ve taken away something memorable from each experience, from each day, whether be it an academic trait I previously undervalued or a character flaw I worked to correct.

Over the course of the next few months, I’ll be applying to some of the best and definitely the most enriching universities in the world. It excites me and it scares me. I’m afraid of the countless possibilities the future holds, of that lingering doubt that I might not get into this place. At the same time, I’m proud of what I’ve done and excited to continue doing that. Over the past few years I’ve gone from having no clue what to do, to developing visible interests in engineering, computer science, and astrophysics.

While computer science and engineering are domains I feel naturally attracted to, the former being a field wherein I’ve interned, hacked, and programmed, nothing invokes as much awe as astrophysics. This leaves me with a huge decision to make before I apply, what do I want to do at college? Do I want to write code and study algorithms? Do I want to make robots and machines? Do I want to point my eyes into the deepest depths of the cosmos and maybe try exploring it? The answer is, irrefutably and undeniably, is all of them. I’ve never been a person who does just one thing. It’s why I took 11 subjects in grade 9, I can’t stand not doing enough. Even right now, despite being told about the intensive and demanding and 24×7 nature of the college applications process, I want to continue exploring and learning in other domains.

I feel like the world has made it necessary for you to be one person. You’re either an engineer or a computer scientist, there’s no ‘both here. There’s no way to do stuff simultaneously. This leaves me with having to choose between what I feel like is me. It’s like asking me to throw away one interest, neglect another, and choose one as the thing that will define my life.

This is all the more important because both cricket and writing were prominent items on this list of ‘who I am.’ Ask ninth grade me what I wanted to become, and he would probably say “either an astrophysicist, or a player in the Indian cricket team, or an author, or a programmer.” While my list of items that define me may change, the mere fact that I have a list cannot. This certainly makes me question myself, should I let this unidirectional-ness change me? Or should I try and change it on my own personal level?

When talking about the college applications process, everyone says that it is inherently transformative. That while working on your applications and taking a deeper look at yourself than any university will do when putting you up against 50k other applications, you learn who you are and you know what you want to do. I find this notion to be quite absurd, I don’t see any point to applying somewhere if you don’t even know who you are per say. I know I want to go into computer science, mechanical engineering, and astrophysics, all 3 of these collectively. There may be no option to do so, but that’s what I want, and that’s what defines me.

This all connects back to me always doing what I’ve wanted to. You may look at my resume and break it down into a person who has a passion for STEM or a guy with a flair for writing, all of that comes together: my passion for programming, my ardor for engineering, my curiosity in understanding the universe, and my love for writing to make what is irrefutably a personality blend that is, me.

Summary of Reading – August ’20

Astronomy (Andrew Fraknoi et al.) – A bit too heavy at times, and way too much extra material that is exciting but definitely can’t be done in a single day or such. Got some really nice visuals and challenging quantitative problems, definitely would recommend as a free astronomy textbook available on the internet.

The Amazing Adventures of Kavalier and Clay – I took an extraordinarily long amount of time to finish reading this book, and it is definitely one of the longest books I’ve ever read. Sammy’s arc, which started when he realised his own nature and started accepting it albeit secretly to coming to odds with the fact that everyone knew about it and living with no secrets was totally brilliant. Both Sammy and Joe started out with clear hurdles and goals. Sammy with his gender and Joe with his family. While I did find the book to be quite stretched at times, the insane amount of characterisation and growth that went into everything was quite astounding. One thing that did stand out was how Rosa waited twelve+ years for Joe to come back, which ordinarily isn’t something someone would’ve done. I think the entire committee on the investigation of comic books was portrayed to be quite antagonistic, and it was, and to me, that stood as symbolic of how systems larger than any individual tend to ignore individual motivations and escape the truth when an alternate explanation is equally plausible

Taking a look at UV-flash accompanied Type Ia Supernovae

This post is a report/summary/discussion on a research paper published by Northwestern University, and a research article published by science daily on the paper in question.

Spectacular ultraviolet flash may finally explain how white dwarfs explode

Date: July 23rd 2020

Source: Northwestern University

White dwarfs are stars that have burnt up all of the hydrogen they once used as fuel. The inward push of gravity is balanced by electron degeneracy pressure. Fusion in a white dwarf’s core produces outwards pressure, which is balanced by the inward push of gravity due to the star’s mass. A type Ia supernova occurs when a white dwarf in a binary star system goes over the Chandrashekhar limit due to accreting mass from or a  merger with its companion star. Ultraviolet radiation is produced by high temperature surfaces in space, such as the surfaces of blue supergiants.

The research article details an astronomical event wherein a type Ia supernova explosion, a relatively common phenomenon, is accompanied by a UV flash, an incredibly rare phenomenon. This is pointed out as the second observed occurrence of such an event, making it innately important and intriguing. The supernova was first observed in December, 2019, using the Zwicky Transient-Facility in California. The event was dubbed as SN2019yvq, and it occurred in a nearby galaxy that lies 140 million light years from planet Earth in the Draco constellation. 

The given figure is a light curve for the observed type Ia supernova. A light curve plots the brightness of an object against time, as to showcase how its brightness/magnitude varies with time. This technique is used in other domains and applications like transit photometry. This light curve plots the absolute magnitude of 2019yvq, the event this paper looks at. The light curve of a supernova usually reaches a peak quickly, then gradually “cools off” with lowering intensity and brightness. This is demonstrated above.

What set this particular type Ia supernova apart was the aforementioned UV flash. What makes a UV flash important in this event is that it indicates very hot material in the white dwarf. This is presumably because of the explosion heating the material which was responsible for emitting light. The study offers four potential explanations behind the event: a consumed companion star that became so large that is exploded, leading to a UV flash; radioactive core materials reacting with the outer layer to make it very hot; an outer layer of helium igniting a carbon core; and two white dwarfs colliding and exploding. Understanding how type Ia supernovae work is key to our understanding of planetary formation as they produce iron, the most abundant element in the core of planets like the Earth. More importantly, type Ia supernovae can be used as ‘standard candles’ to measure extremely large cosmic distances. White dwarfs explode with the same brightness; hence their distance from planet Earth is inversely proportional to how brightly they seem to explode as observed from the Earth.. Further dividing type Ia supernovae on the basis of UV flashes would make the use of these cosmic yardsticks more accurate. Determining distances more accurately extends to working towards bigger challenges like how can we model the universe’s expansion accurately, what is dark energy, and how much of ‘stuff’ is dark energy.

The implications of classifying type Ia supernovae further would lead to improved cosmic distance measurements, enabling insights into the nature of cosmic inflation and dark energy. Furthemore, the very act of attempting to classify type Ia supernovae with UV flashes could lead to the discovery of an entirely new astronomical event, which could catalyse a sub-domain of astronomical study. They add a more reliable and robust method of distance measurement to a gallery of techniques: radio astronomy, stellar parallax, cepheid variables, the Tully-Fisher relation etc. What type Ia supernovae add to this ‘gallery’ is the ability to measure incredibly large cosmic distances due to how bright their explosions are in absolute terms. This phenomenon offers improved accuracy over incredibly large distances. 

Another pressing question is determining whether or not type Ia supernovae with UV flashes constitute a threat to us, if say, one occurs in our galactic neighbourhood. After observing a greater number of such events, a ‘safe distance’ could be determined below which such a supernova would have a sterilizing impact on life on Earth. According to the paper, the flash is of 19th magnitude as observed from planet Earth. At a distance that is 10^6 times smaller, 140 light years, this would be 10^12 times brighter, or a magnitude of -11. 

To conclude, discovering why these UV flashes occur and what they mean for cosmic distance measurement using type Ia supernovae can provide inroads into other pressing problems, like dark matter and dark energy.

Summary of Reading – July ’20

Speaker for the Dead (Orson Scott Card) – Very beautiful and elegant. The slow pace and the depth of understanding here, along with the level of detail and complexity works well with the pace, seriousness, and action of the first book in the series. It’s great to see Card take a respite from pure action and thrill and focus a lot on world-building, which sets up the sequel to this book pretty well. It’s amazing how similar the situation set up by the ending is to the plot of the first book, an invitation for xenocide. 

I really like how the story progresses and how Card transitions the murders (apparent) of Pipo and Libo to something done for a very honourable and specific purpose. What seemed like a monstrous act was instead an event of transformation, of sending your best to the afterlife (which is very real for the piggies). It also enabled that evident and missed gap of communication that would be present between any two species, and that’s how Card made the human interpretation of Pipo’s and Libo’s deaths look horrific but then made the human interpretation itself look very biased. Personally, I thought Pipo was killed by stumbling onto one of the Piggies’ closely guarded secrets, one they would kill to protect, and that the same fate befell Libo, albeit independently. It wasn’t so, they both were killed for refusing to kill and send their brothers to their third lives (also note how important ‘third’ is now compared to what ‘third’ was at the start of the book). 

Ender’s level of control and influence is also brilliant. How he uses his skills to unravel the situation almost perfectly, and guide everyone towards a rebellion is amazing. It’s beautiful character development. Over the course of two books, Ender goes from being a compassionate killer to a wonderful father and a man who has fully redeemed himself by bringing the species he destroyed back to life and nurturing a new species so as to prevent them from the same fate the buggers met. 

It’s also quite interesting to read about the ecology on the planet. How every living thing shares a plant-animal life cycle. Raises a lot of questions about our basal assumptions for any form of alien life. 

Xenocide (Orson Scott Card) – One of the best books I’ve read so far, and that’s saying a lot. While the scientific reasoning behind many things may not be sound (anything at all regarding philotes i.e), the book more than makes up for it through world-building, action, and plot. It brings so many concepts and worldviews and beliefs together that it is just exhilarating. The concept of genetic enslavement was rather interesting to read about. The people of Path were genetically enhanced to be more intelligent than any other human being, but had a specific gene engineering inside them that led them to believe in the power of the gods and that the gods spoke to them to keep them on their path, making them slaves to Congress. I like how the plant itself is named path, and how the tampered gene is an attempt to make everyone fall onto the designated path, but as the story ends, the planet goes on to forge its own path. We also see genetic enslavement in the form of the descolada. I find the entire concept to be dictatorial and bizarre, and also worrisome as it isn’t that hard for something like that to be done irl. 

In terms of storytelling, Peter’s return might just breathe back the intense action and absorbance of the first book into the series. It was definitely a huge surprise. However, the brilliance of Peter’s return is equally matched by Novinha’s stupidity. I mean, what was Novinha even doing? Why did she become a nun? While the dependence of every single character on mysterious religious power is troubling, as any advanced future society should’ve freed itself of the shackles of religious belief by then, what’s even more disturbing is the big question: what happened to Jane? I take it she will be weakened, but what happens? The ending felt like a small calm before the storm, and I personally believe that this is great, as the next book heralds the arrival of the fleet, Peter’s bid to destroy Congress, and the end of Ender’s journey.

We are the Nerds (Christine Lagorio-Chafkin) – Not as good a read as I expected it to be, primarily because I use reddit a lot and I expected an account of its history to be based more on the product rather than the people running it or the social lives of the people running it. Before reading this book, I was expecting an account of reddit’s journey as a product, not as a business. Furthermore, the book goes off on various tangents, everything from Ohanion’s relationship with Serena Williams to Aaron Swartz and net neutrality. While I feel this may have been necessary to attract a larger audience (Aaron Swartz attracts tech-savvy people and Williams is just famous), it isn’t related to reddit whatsoever. However, there were some aspects of this book I found fascinating. One of these was Huffman’s return, which sounded quite poetic. I was surprised by how he wasn’t able to fit in and needed a counsellor/therapist. I also found the entire “spezgiving” saga to be quite childish, but it really brought out how big businessmen/women are the same people as us, and how they’re prone to the same mistakes. I also found it interesting to read about how people are targeted online and the internet’s potential to be downright toxic (seeing as I’ve encountered a very small portion of this toxicity myself), but how its countered by the internet’s ability to be wholesome, helpful, and amazing. The best example of this, in the book, was perhaps Barack Obama’s reddit AMA, or perhaps the “Mr Splashy Pants” thing. While there were some likeable facets to the book, it doesn’t really do justice to its title.

Modelling COVID-19 Cases for South Korea, India, and Sweden.

I recently wrote a paper that modelled the spread of covid-19 in Italy using a logistic fit. I wrote this a while back, and I was curious about how such a logistic function would behave NOW, so I decided to look at how a logistic fit could be applied to India, South Korea, and Sweden. I’ve taken these 3 countries because they’ve adopted entirely different approaches to tackling the pandemic, so I was curious as to whether or not the effects of these approaches would be graphically discernible. Here, I modelled the cumulative confirmed cases for the respective country against the number of days since the beginning of the outbreak. I’ll go through solely the results, you can find the full code here.

I’m using for obtaining the required data. It gives you a good deal of data, and has excellent support so I’d recommend using this API for all things covid-19. After obtaining the data, I treated it by filtering out undesired columns/rows, converting the date string to an integer that represented number of days since the beginning of the outbreak, and broke my data into a feature matrice and a target vector. I’ve used a logistic function for fitting the data.

South Korea

Here are the results, graphically, for South Korea.

It’s evident that the logistic model can’t correctly explain the epidemic in South Korea, but that’s because of the South Korean government’s perfect handling of their localised cluster. South Korea rolled out tests and implemented stringiest social distancing policies not just more effectively, but more quickly than any other country. The graph acts as representative of these results. Real world cases overtake the logistic fit, but then fall back down. South Korea’s strict implementation of social distancing and mass-scale testing can be seen to be effective at around the 50 day mark. The curve flattens out, to some extent, but then resumes an upwards slope.


India demonstrates a nearly perfect logistic fit. You can see that the logistic model explains really world data to what appears to be nearly perfect accuracy. I believe this is because of some inherent conditions here. Firstly, population density. It’s hard not to come in contact with someone else while sick in India. While modelling a disease’s spread, you use a r0 value. In other countries, population density may restrict how many people an infected person infects. However, In India, any individual who is not quarantined and sick will come in contact with more people than the R0 number.


A logistic model also seems to describe the situation in Sweden pretty well. The logistic fit largely pertains to real world data. This is somewhat of a surprise, as Sweden implemented a vastly different policy regarding COVID-19. Instead of carrying out mass scale testing or enforcing a lockdown, Sweden opted for ‘herd immunity.’ Herd immunity is immunity to a disease as a consequence of a large proportion of the population having recovered from it. On a social level, this didn’t exactly go well for Sweden.

I think its quite interesting to look at how accurate logistic models are in states with different approaches. For instance, South Korea succeeded in pushing down the R0 in their cluster due to testing and quarantine. India, on the other hand, has so far failed to implement any successful policy that reduces the R0 values. Sweden is actually trying to increase its R0 so as to develop herd immunity. One inherent assumption in any logistic model, and most epidemiological models is that the R0 value remains largely constant. This assumption definitely holds true for India and Sweden, but breaks down when it comes to South Korea.

Summary of Reading (June ’20)

Rafa (Rafael Nadal) – I’ve never really been that engrossed with tennis so reading this book was an insight into a world I’ve never really been a part of, but it was fun nonetheless. Major parts of the book seemed like they were devoted to more tennis-tuned readers, given that there’s a lot on game specifics and a lot of tennis terminology that went flying over my head. I found it really insightful to read about the impact his social and mental stability had on his physical stability and his game. I especially found the sections on the the importance of family to Nadal, and how his circle of stability helps him. It was also inspiring to read about having such a concrete routine that you’re at the court of 5 am no matter what and no matter how much sleep you were able to get. He talks a lot about having a very focused mental state during the game, and filtering everything out but the game so that you have very high concentration, and I relate to that not just because I’ve played cricket in the past, but because that state of mind where nothing else can impact you and all your mind is bent on particular task is something I try to attain and do attain while programming or reading. The sections on humility are also wonderful to read, but also mildly humorous sometimes, especially the parts where he talks about how his uncle and coach, Toni, makes him perform small, irrelevant gestures like not walking in the middle of the group and not violating dress codes etc. I also enjoyed reading about Mallorca and the societal setup there, the amount of peace Nadal gets is incredibly valuable given not many athletes can afford that in the modern world, and seeing him talk about how important a factor that is is brilliant and wholly justified. This book deviated from my usual sci-fi adventures, but I enjoyed it but found the sections on describing games for long periods of time a bit too much.

The Three Body Problem (Cixin Liu) An incredible book that brings together two vastly different story-lines in a manner I am yet to see. The book starts out in mundane manner, throwing you straight into the action wherein the antagonist’s father dies, forming the motivation for an act we see quite a while later. Everything that occurs at the starts: the countdown, weird results from experiments, the universe flashing, seem to be pure sci-fi elements that the author eithers fails to or doesn’t explain. The beauty of this book lies in turning that around into something that is rational and believable. Even more so, it’s amazing to see the Three Body video game turn into something that exists in the real world. Of course, abandoning the question of how life even came to be on such a world, it’s fascinating to see alternate theories play out as to how the environment functions, and see hundreds of years of scientific progress occur in the span of a few pages. The different scenarios that play out: solar trygzys, triple sun days, chaotic eras and so on make this book incredibly fascinating. Add to that the depth to which all characters have been created; we have Wang who is oblivious to the grand scheme of things but a good man at his core, then we have Dang Shi who is oblivious to the big picture and not a man of science, but the guy who always solves the problem through his core set of principles, and lastly we have Ye, an almost psychopathic personality with a deep hatred for humankind, such that she’s effectively brought and end to it. Learning about the 3 body problem was an incredible experience for me. I also loved the author’s note about how most ideas that take off fall back to ground because of how the gravity of reality is too strong.

Delta-V (Daniel Suarez) – Very much new, and very relevant. While some of the things in the book were downright outlandish, how Suarez portrayed everything and how real he made the dangers of space travel seem to be made this book amazing to read. The candidate smelection process had a lot of time and space devoted to it, and it payed off with it being one of the best pre-climax sections I’ve personally read. In fact, I’d say some of the stuff in the entire candidate selection section was better than the rest of the book and the climax. It was really good at hooking me on. Some stuff that stood out during the candidate selection process: the high-co2 atmosphere puzzle solving event, the psychological test, and most importantly, how bonds were being formed. When the actual crew went up to hotel LEO to pursue other projects, I thought that the book was dying down, and I personally vouched for the chosen crew dying in their first attempt and the actual crew substituting for them, but what Saurez did was much better, although admittedly a bit rushed. I also couldn’t quite comprehend why a spaceship that had 14 billion dollars invested in it had a lot of software errors that made living unbearable and potentially fatal for the crew. I mean, come on, if you have 14 billion dollars, surely you can have a nice programming team as well who don’t mess their job up. The entire concept of mining was made so much better by the sample schematics attached by Saurez towards the end, they really payed off. The arrival of the Argo was also surprising, and Joyce’s downfall was also kind of expected but tragic nonetheless. The worst part of the book, in my opinion, was how the new investors treated the astroanuts. That seemed like pure fiction, rather than the rest of the book, which was genuinely believable and inspiring. This book also hammered into me the sort of challenges future astronauts can face, and the sacrifices they might have to make. It really drove the point home; space is hard.

Ender’s Game (Orson Scott Card) – Pretty good read. What really held me back was the sheer outlandishness of a boy of eleven years of age leading an entire army and being representative of the Earth’s military WHILE not even realizing he was doing this. I know its purely fictional, and the story is quite impressive, but the entire notion of such a young person doing all that is beyond me. Another thing I did not understand was why Mazer Rackham didn’t lead the armies. He told Ender that he wouldn’t be alive until the date of the future battle, when instead that same battle was Ender’s training. The simple explanation for this is that Ender is better than Mazer. That being said, no matter how good Ender is, why would the authorities choose a 11 year old boy with an incredible skill set over an experienced veteran and celebrated hero? That doesn’t make any sense to me. I love the elegance in how Ender is compassionate at heart and Peter is more hurtful, but how events reverse roles, making Ender seem hurtful and Peter seem compassionate. Throughout the book, Ender’s desire to not be like Peter holds him back and dominates him, whereas Peter’s desire to be more compassionate drives him to greater heights. It’s a stunning depiction of how events in the real world can pan out, and how roles can be reversed even when characters are not. In very cliche fashion, I’m going to say that the character I bonded the most with was the main character, Ender, but only because of his tendency to look at things the way they were and not the way they were conventionally taken to be. The best example of this is how each team thought of the battleroom as horizontally aligned, but Ender was the only one who viewed it as a place where you’re going down towards your enemy and not straight, and that changed a lot. His tendency to tackle the rules is also brought out by the match against two teams, where he sends a man through the gate rather than eliminating the two teams. The ending is a bit heartbreaking, especially when Ender realizes that he didn’t just obliterate the opposite side which had its own culture and legacy, but also sacrificed soldiers on his side without knowing so, and seeing how the guilt of this plays out is fascinating.

Recursion (Blake Crouch) – A book that takes science and throws it out the front door. The reasoning that time has no linearity, and that the past, the present, and the future exist as one just makes no sense. The simplest way to gauge time is drink tea; your cup of tea cools down gradually, that’s the progression of time. However, time isn’t a thing in this book. Instead, it’s a virtual construct made by our brains to make everything much more simple. And guess what this means, you can travel back and forth, but not in time, in your memories. Now, in normal conditions, this would be time travel, but since time apparently does not exist in this book, or is as traversable as a physical dimension, this is just memory travel. The part that struck me the most was how Helena and Barry were not able to figure out how to nullify everything after the original timeline in what was about 198 years, when Slade did that on his own in just a single timeline. Both of them thought about it so much that Barry, who was a police detective in the first timeline, became an astrophysicist/quantum physicist in the second and was trying to calculate the Schwarzschild radius of a memory (what?). Both of them didn’t think about returning to the previous timeline by activating a dead memory. It’s only logicial that if you can time travel, the only way to nullify your existing timeline is to go back and cut out the event that birthed your existing timeline. Instead, Barry and Helina decided to try and find a way to remove dead memories themselves rather than the events that created those dead memories, had they decided to try and pursue the latter in 198 years, they would’ve succeeded. I also don’t get why people need to be killed to be sent back into their own timeline, or why the U-shaped building building just appeared and caused mass FMS rather than it being there forever and people getting FMS when the cut-off date finally came.

Currently Reading

The Amazing Adventures of Kavalier and Clay (Michael Chabon)

The Brain (David Eagleman)

Implementing Hierarchal Clustering (Python)

Clustering is an important non-supervised learning technique. It aims to split data into certain clusters. For instance, if you input data pertaining to shoppers in the local grocery market, clustering that could output age based clusters of say < 12 years, 12-18, 18-60, and > 60. Another intuitive example is banking, clustering financial data for a large group of individuals could output income-based clusters, say 3 pertaining to the lower middle, upper middle, and upper classes.

The most basic and intuitive method of clustering is K-means, which identifies K clusters. It randomly initialises K centroids, marks the points near it, and repeats until it repeats K averaged out cluster centroids. Hierarchal analysis, however, is based on a much different principle.

There are two methods of hierarchal analysis: agglomerative and divisive. Agglomerative puts each data point into a cluster of its own. Hence, if you input 6000 points, you start out with 600 clusters. It then clusters the closest points (and then clusters) and repeats this process until there is only one giant cluster encompassing the entire dataset left. Divisive does the exact opposite. It starts with one giant clusters, and splits each cluster (repeatedly) until each point is its own cluster. The results are plotted on a special plot, called a dendrogram. The longest vertical line segment on the dendrogram gets to be the optimum number of clusters for analysis. This will be much easier to understand when I show a dendrogram below.

Here is a link to the dataset I’ve used. You can access the full code here. This tutorial on analyticsvidhya was also immensely helpful to me when understanding how hierarchal clustering works.

Note down the libraries I’ve imported. The dataset is fairly straightforward. You have an ID for each customer, and financial data corresponding to that ID. You’ll notice that the standard deviation or range for features is quite different. Where balance frequency tends to stay close to 1 for each ID, account balance is wildly different. This can cause issues during clustering, so that’s why i’ve scaled my data so that each feature is similar to each other feature in relative terms.

Here is the dendrogram for the data. The y-axis represents the ‘closeness’ of each individual data-point/cluster. You’d obviously expect y to be maxed out when there’s only 1 cluster, so that’s no surprise. Now, looking at this graph, we must select the number of clusters for our model. A general rule of thumb here is to take the number of clusters pertaining to the longest vertical line visible here. The distance of each vertical line from each other represents how faraway those clusters are. Hence, you want a small number of clusters (not necessary, but in this application, optimal), but also want your clusters to be spaced far apart, so that they clearly represent different groups of people (in this context).

I’m taking 3 clusters, which corresponds to a vertical axis value of 23. 3 clusters also intuitively makes sense to me as any customer can broadly be classified into lower, middle, and upper class. Of course, there are subdivisions inside these 3 broad categories too, and you might argue that the lower class wouldn’t even be represented here, so we can say that these 3 clusters correspond to the lower middle, upper middle, and upper classes.

Here is a diagrammatic representation of what I’ve chosen.

After building the model, all that’s left is visualising the results. There are more than two features, so I’m arbitrarily selecting two, plotting all points using those features for my axes, and giving each point a color that corresponds to all other points in its cluster.

You’ll notice that there is a lot of data here, but also a clear pattern. Points belonging to the purple cluster visibly tend towards the upper left corner. Similarly, points in the teal cluster tend to the bottom left corner, and points in the yellow cluster tend to the bottom right corner.

Implementing PCA and UMAP in Python

You can find the full code for PCA here, and the full code for UMAP here.

Dimensionality reduction is an important part of constructing Machine Learning models. Dimensionality Reduction is basically the process of combining multiple features into a smaller number of features. Features that have a higher contribution to the target value have a greater representation in the final combined feature than features that contribute less. For instance, if you have 8 features, the first 6 of which have a summed contribution of around 95%, and the last 2 of which have a contribution of about only 5%, then those 6 features will have a greater representation in the final combined feature. In terms of advantages, the most significant is less memory storage and hence higher modeling and processing speed. Other advantages include simplicity and easier visualization. For instance, you can easily plot the contribution of two combined features to the target, especially compared to plotting, say, 20 initial features. Another significant aspect is that features will less contribution that would otherwise add useless ‘weight’ to the model are removed early on.

The two methods of dimensionality reduction I will be using are PCA and UMAP. I won’t be going in through how they work as I’ve given a short overview of their purpose above. Instead, I’ll go through the code I implemented for each, and visualize the results. For this exercise, I’m using the WHO Life Expectancy Dataset that can be found on Kaggle, as its very small and easy to work with. My target variable will be life expectancy, and my features will be aspects like adult mortality, schooling, GDP etc. I randomly selected these features from the dataset.

Here is a list of the modules we will be using. train_test_split will help us break out data into a training set and a testing set (about a 7:3 ratio). While this isn’t significant right now, this aids in the detection of under fitting and over fitting. Under-fitting is detected by bad performances on both the training set and the testing set, whereas Over-fitting is detected by really good performance on the training set but bad performance on the testing set. StandardScaler has been used to normalise features. Feature normalisation is a technique that reduces the range of the dataset, or the standard deviation, in layman’s terms. Lastly, we’ve imported both PCA and UMAP, which will be used.

Here we just load our dataset, extract the features that will be used (see column names in the dataframe), and rename them for the sake of simplicity. As you can see, there are some random spaces and not all use underscores as notation, so I decided to have one uniform way of typing out each feature. Now, to extract a feature matrix and a target vector, just drop the life_expectancy column from the dataframe and convert it into a numpy array, and convert the life_expectancy column into a separate numpy array. I won’t show the code for splitting and normalising, because that’s pretty much irrelevant here.

Implementing PCA in itself is very simple, as shown above. You’ll notice that I’ve specified n_components to be equal to 2 above. This is because I just wanted to point out that the number of combined features you want at the end can be set by you. In this case, it doesn’t really matter because PCA will give only two combined features if I do not a specify a pre-set number. After that, I’ve fitted the training_data to PCA.

Here’s a bit of data treatment before I finally plot the results. I’ve basically converted PCA’s output, which was a numpy array, to a pandas dataframe, and then added life_expectancy as a column because that will be used for the color-bar you will see below.

Here is the code for my plot, and here is the plot:

You can see the relative contribution of each componenent to each feature, whose target value, or life expectancy, is represented by the color of the marker. While I don’t see any patterns straightaway (specific colors being clustered somewhere etc.), the primary thing that does stand out is how heavily green dots (~ 70 expectancy) are clustered towards the bottom left. There are other colors as well, but there don’t seem to be many green dots anywhere else.

The code for UMAP is the exact same, except with UMAP as our decomposer instead of PCA. Here’s the plot.

You can straightaway see that the results of UMAP are quite different. Once again, there are no noticeable patterns in terms of specific colors being clustered in specific locations, but the overall structure is quite different from that of PCA. We can see that each color is distributed throughout.

There’s no way to say which method is better without modeling your target variable with respect to both principal components and calculating the accuracy on the testing set. This post just aims to illustrate how both of them work without going into specific details.

Sustainability: Why it Matters and What I’m Trying to do about it.

There are so many facets to global warming and climate change people are aware about, but don’t understand. This leads to a lot of focus on very specific issues, which, in turn, leads to negligence of other problems. For instance, one such facet is decreasing tree cover in urban areas and loss of forested areas. This problem is very simple to understand. People need space to live, and they need shelter to live in, hence, they take a plot of land with trees, cut the trees down to get space and materials for building those shelters. Sustainability here would be planting more saplings elsewhere and restricting yourself to the plot you cleared out initially. If, with time, you lack space, build vertically, not horizontally. Leave room for mother nature. Instead, due to an ever-increasing need for residential and industrial areas, cities continue to expand at a breathtaking pace. Find an image comparing what cities were like ten years ago and what they are like now visually, you’ll see what I’m talking about. Urban areas constantly demand more homes, offices, power generation facilities, shopping malls, roads etc. The list goes on and on. Nature doesn’t. Due to urban expansion and hence reduced green cover, carbon dioxide levels are ramped up with more vehicles to produce it and less trees to remove it. Describing the different ways through which carbon dioxide is produced is futile. More urban areas results in more airports, not just one for each city, but more than one for the large cities. Airports need vast patches of land, which means less tree cover. More airports means more flights, and aircraft are one of the biggest singular producer of greenhouse gases, so that just keeps adding on. This is just one impact of not having enough green cover, some other consequences include higher temperatures (hence more need for air conditioners in buildings and vehicles, which has redundant unit-impact, but a very large aggregate impact), soil erosion, elimination of natural flood barriers etc. It’s bad. You might say that cutting down a small patch of trees has no impact in the larger scheme. You’re right, it doesn’t. But when millions of people say that, that impact adds up, and it leads to the world we’re in today.

Now less green cover is just one facet of the problem, as I said at the start. You might’ve realized the magnitude of this problem, but there are other problems out there that bear the same, if not heavier consequences. There’s rising temperatures, rising sea levels, desertification etc. And these are just other aspects of the same overarching problem: global warming. There are a host of issues out there: space junk, war, plastic pollution etc.

But then, the reason why one can’t solve all these issues is because they’re too large. That doesn’t just mean each one needs too much effort and time, it means that each one needs too much money, on the scale of billions of dollars. Knowing this, I set out to try and make my own small-scale, high-impact solution to the problem of reduced tree numbers: Treephillia.

Planting trees is something we’re all taught as a kid. Kindergarten and elementary school are full of activities where PT teachers show you how to plant a sapling, then you go ahead and plant yours and it feels like you’re doing your bit for the world. It’s something we all read about, and I do believe that the mass media deserves a lot of credit for showing the world just how important planting trees can be. In the developed world, and in a significant proportion of the developing world, most people know about this issue, even though they may not do anything about it or not even care. Awareness is there, but no one knows what to do. Take person X, for example. X wants to contribute by planting trees, but he has no idea about what to plant or how to plant it. He doesn’t know where he should plant it, and he doesn’t know which saplings he should buy from where. X is also aware of how important planting trees is, but is unsure about the impact just one plantation can have. X is representative of the majority of Earth’s population in this matter.

My application tries to remedy these issues. With inputs from experts who have field experience, I can advertise to its users what they should plant. For instance, the Eucalyptus is a 100% no-no, it might look pretty but it stunts the growth of other trees. With a plantation site feature that enables officials within the local forestry department to mark spots ripe for public plantation campaigns, I can tell X where to go to plant his tree. With a map that enables X to see plantations, he gets to know the larger impact. If you have a city with a population of 1 million people, and 1 out of 100 plant just a single sapling, you still have 10000 plantations. If you have a country with a population of 100 million people, via an extension of the same assumption 1/100 plant, you have 1 million plantations. That’s a lot, and that would really matter. My application gives its user the personal interface they need to plant trees smartly in the modern day era. And no, it’s not just limited to the stuff I listed out above. There is a serious lack of incentive surrounding planting trees, so I also decided to implement a voucher feature that rewards users who plant trees. While a voucher should not be a reason to want to plant trees, it serves as a pathway to doing so, and that works.

Now, how exactly does my application in reality? It’s not just for singular users who want more information and who want to record what they actually do. For instance, businesses can use it to have each employee plant a tree in the event of an office birthday and track the trees planted on a map. People can plant a tree on the important occasions of their life, say birthdays, anniversaries etc. Hotels can get employees to plant a tree when a guest has a birthday, or maybe even plant one without any occasion to visually improve the setting their guests stay in. The government can use it to track tree plantations and mark planting sites for the public. NGOs in this sector can use this to reach people who genuinely care and want to have an impact on the world through planting trees.

Personally, tree plantation is very important. As a 7 year old, I helped Dad plant a sapling outside our old house. For four years, I saw that sapling grow from a timid little thing to a leafy tree. Every time I return to that house, I think about just how much that one tree has grown. With this in mind, I’ve always drawn an analogy between planting a sapling and a mass tree plantation movement. Like a sapling, any plantation trend would be small at first. But, with time, it’ll grow. It’ll sprout branches and twigs, wear leaves, and most importantly, grow roots deep enough to keep it stable and strong. However, there is one difference here. Unlike a tree, a movement can’t be ripped out of its foundations or chopped off from its base. It will persist, and so will our planet.

Plotting Shapefile Data Using Geopandas, Bokeh and Streamlit in Python.

I was recently introduced to geospatial data in python. It’s represented in .shp files, in the same way any other form of data is represented in say .csv files. However, each line in a .shp file corresponds to either a polygon, a line, or a point. A polygon can represents certain shapes, so in the given context of maps and geospatial data, a polygon could act as a country, or perhaps an ocean and so on. Lines are used to represent boundaries, roads, railway lines etc. Points are used to represent cities, landmarks, features of interest etc. This was very much new to me, so I found it fascinating to see how any sort of map can be broken down into polygons, lines, and points then played around with using code. I was also introduced to streamlit, which provides, say, an alternative to Jupyter Notebooks but with more interaction and in my opinion, better visual appeal. I think one distinct advantage Jupyter Notebook has is compartmentalisation, and how good code and markdown look next to each other, whereas streamlit seems to be more visually appealing. However, one big advantage streamlit has is the fact that it is operated by the command line, making it much more efficient for a person like me who’s very much comfortable with typing out commands and running stuff, rather than dragging my pointer around to click objects.

I used the geopandas library for dealing with shapefile data. It’s incredibly efficient and sort of extends native pandas commands to shapefile data, making everything much easier to work with. One thing I didn’t like was how streamlit didn’t have inbuilt functionality to view GeoDataFrames, so essentially that means I have to output geospatial data using the st.write() method, and that just results in some ugly, green-colored output, very much unlike the clean, tabular output you get when you use st.write() for displaying dataframes. It’s also a bit surprising how st.dataframe() doesn’t extend to a GeoDataFrame, but eh, it works for now.

Bokeh is new for me, I decided to start out with making plots using inbuilt geopandas and matplotlib functionality rather than move straight to Bokeh. Hence, in this post, I’ll be going through how I made and annotated some maps using geopandas, then extended that to Bokeh to make my code much more efficient. A huge advantage Bokeh brings to the table is that it can be used to store plots, so no going back to earlier cells to find a plot. Just output it to an html file and write some code to save the contents of that html file, you’re then good to go.

The very first thing I did was create a very basic plot showing Germany filled with a blue color. Here’s a small piece of code that accomplishes that.

Plotting Germany

This simple takes the geodata for the entire world, in low resolution. I then select the data specific to Germany, plot it using geopandas, turn plot axes off just to make it look better visually, and output a plot to my streamlit notebook (do I call it that?). Here’s the result

As you can see, it’s a very simple plot showing the nation state of Germany. Now, I thought I’d extend this, make it look better, and annotate it with the names of some cities and the capital, Berlin. The very first thing to do here is to get some data for each city, their latitudes and longitudes, to be specific. You can either import that via a shapefile, which will have each location as a point. However, I manually inputted 6 cities and their data from an internet source to a dataframe, and then used that to annotate my figure, which I’ll be talking about now.

The dataframe at the top contains my city position data. Down below, I’m creating another GeoDataFrame that holds each cities data as a point object. You’ll notice that I’ve used the points_from_xy() method while creating the city dataframe. points_from_xy() wraps around the point() method. You can view it as equivalent to [point(x,y) for x, y in zip(df.Longitude, df.Latitude]. It works really well and removes the need to have a for loop, making, making my code much more efficient. I’ve then plotted the same map as above, except with a white fill and a black outline (better looking imo). After that, I’ve gone over each point using a for loop and added a label, which is the City name (stored as just “Name”). I’ve also increased the size of the marker for Berlin, given that its the capital. The last step is just adding a red marker to indicate each city’s position. Note that st.pyplot() is a streamlit method that outputs any figures we might have. Here is the output of the code above.

I think this looks much better.

Now, I decided to plot a map showing all the main railway lines in India on top of a blank map of India, and output this to a new .html page as a bokeh plot.

As you can see the, the code for this is very simple. I’ve firstly plotted the map of India using the get_path functionality in geopandas. Then, for the sake of visbility, axes lines have been turned off. Then, I’ve read the railway shapefiles, which consists entirely of line objects, and plotted it using the pandas_bokeh library, outputting to an html file. Here’s the result

I find working with geospatial data to be terribly useful, and very much exciting. I think that’s partly because I love data science; playing around with data and modelling it, and working with geospatial data opens up an entire realm of possibilities. I’d describe the experience of making my first shapefile plot as something akin to working for the first time with time series data. Much like time, it adds another dimension to what one can do. In the coming weeks, I’ll be blogging more often, hopefully once every week, on geospatial data specifically. The contents of this post are not even an introduction to what can be accomplished using geospatial data. Next point of exploration, for me, is going to be how to depict elevations, and use color density to indicate population density, average cost of living, gdp etc. One immediate application that comes to mind is making a map that uses color to reflect how much covid-19 has impacted a country, based on not just cumulative confirmed cases, but also factors like healthcare expenditure, economic downturn, unemployment etc. I think it’ll be interesting.

You can find the entire code here.