Alan Zhao

Aug 28, 2017

Resources and Inspiration Online

As much of my learning has come from online resources as from Swarthmore or Yale classrooms. This is my attempt to share some of the absolute best blogs I've encountered. They run the gamut from very technical to purely practical. I'll periodically add to this as I find more.

Practical Business Python

http://pbpython.com/

The blog that started it all. Written by Chris Moffitt, an engineer by training but finance guy by trade, this blog is focused on showcasing how Python can take on business analysis usually left to Excel. Example topics include using pandas for pivot tables, and streamlining bulk generation of Excel reports. This blog inspired me to learn Python more deeply and provided numerous insights to automate work; at my old nonprofit we even used some posts as training material. Most importantly of all, a conversation with Chris himself led me to the idea of creating this blog.

Math union Programming

https://jeremykun.com/

Written by Jeremy Kun, a math PhD turned software engineer, this blog covers a variety of mathematical topics (ie optimization, statistics, etc). What's interesting is that it is all done from a coder's perspective, focusing on intuition and programmatic examples rather than endless equations. His article on support vector machines is illustrative of this. Many of his posts has helped clarify several concepts from graduate level statistics courses I took at Yale.

Own your bits

https://ownyourbits.com/

Written by a hacker "Narcho," this blog focuses on leveraging the raspberry pi as a home cloud storage solution. In other words, a DIY dropbox. A great DIY project that I did over the past winter break with a leftover portable hard drive.

Narcho himself is an ardent advocate of understanding all technology used in daily life, particularly the hardware side. Obviously impractical to do for everything, but an admirable mindset to drive your curiosity.

John Myles White's Blog

http://www.johnmyleswhite.com/

Another blog attempting to explain statistical content in a colloquially understandable, but mathematically rigorous way. His optimization perspective on mean, median, and mode was eye-opening for me.

Jul 23, 2016

Thoughts on Coursera's Algorithms and Data Structures

Motivation

After spending the past three years largely independently learning programming with a "just make it work" mentality, I decided in January 2016 to formalize my knowledge with an actual course. Massive open online course companies like Coursera, EdX, and Codecademy offer lots of programming courses, but the majority of these are introductory-level or application-specific (like data-analysis with Python or web development with Ruby). I wanted something that would be a general "next level" course but also one that I could do with Python. I talked to software engineer friends and looked at some universities' syllabi and found that the Data Structures and Algorithms topics were the standard 2nd or 3rd course.

Coursera had an entire 6 month Data Structures and Algorithms specialization and it (mostly) fit before my graduate school began so I signed up. I liked that the class would be going in-depth, with 5 separate courses and also a "capstone" applied project. It also allowed for numerous languages (Ruby, Python 2&3, C, C++, Java) but only officially supported Java, C and Python 3 with starter files. To motivate myself further to actually complete the course, I paid the ~400 dollars to get the course verified (and for the nifty certificates on my linkedin).

Course Structure

Unsurprisingly, the course philosophy is learning through doing. You watch 1-2 hours of lectures, with some embedded quizzes, and then code solutions to the problem set. The solutions are submitted to a cold, inhuman grader that compiles your code and checks against 15+ test cases. There's also a discussion session to post questions and answers. The homeworks are well designed and closely follow the lectures. I did find that the coursework typically took double the amount of time estimated (6-8 vs 3-4 hours). The courses are pre-recorded and each one follows a set session. Missing one is fine as you can restart it, but you only have one full year from payment to finish the 6-month specialization.

Learnings

Testing

Learning how to design test cases and automate them was the most valuable takeaway from the course. The test cases are hidden beyond the first 3, so you need to become adept at implementing test cases to pass. Running tests manually becomes incredibly annoying, so I got much more comfortable with the assert statement. The course introduced me to the idea of testing corner cases with manually created cases, and then automated testing with random inputs (and brute-force calculated correct outputs).

Once I saw the time and headache save from rigorous testing, I started implementing testing at work. Prior to the course, my code base utilized integration testing not unit testing. Afterwards, I made it a team project to go back and write unit tests and the amount of blocker issues we uncovered was incredible.

Pseudocode Literacy

I never read formal pseudocode before - shocking, I know. Pseudocode was intimidating, and I just avoided it. You can't avoid it in this specialization though: the course is language-agnostic so the lingua franca is pseudocode. Every lecture has pseudocode, so every week involves translating what's conceptually laid out into code. This skill greatly increased my ability to pick up technical documentation on code-agnostic places like wikipedia.

Immediate Applicability

Algorithms and Data Structures sounded more like conceptual learnings than something helpful in my day to day at work. I was wrong about that. Learning "memoization" (giving your program memory of past results) as part of dynamic programming immediately gave me insight on how to speed up a database call that made redundant calculations. Implementing it took less than two days and cut down a query run 10x/day down from 5-10 minutes to 1-2 minutes. Sounds simple, but I had never heard of the concept before. However, because the class is general, not applications focused, you're going to need to figure out the applications yourself. I still haven't figured out applciations for all those graph algorithms or self-balancing trees.

So What's Missing?

Declining Enrollment

The course started off with high participation that gradually declined as we advanced through courses. In the first course, Algorithmic Toolkit, we had several thousand students across the world according to a live world map with student populations. The forums were active; every question I had while doing the homeworks was already asked and answered on the forums.

It was a different story by the third course, Algorithms on Strings. From forum activity, I estimate only several hundred people are taking this course. One question that I posted only got <10 views after several days with no response. Coincidentally, the world map showing classmate numbers is also no longer on the course side. I wish I had taken a screenshot of the original world map with student populations to prove this! The enrollment decline is expected from the increasing difficulty of the course and the extended commitment to stay on track. To be honest, I'm one of those students who's fallen behind. After completing the first two courses, I wasn't able to complete Algorithms on Strings on time and am now doing it with the second session. I hope more students regroup with me; the active discussion is key to learning.

Python's Limitations

I love Python because it's an abstracted high-level language. Unfortunately, this makes it difficult to implement many of the data structures because they don't exist natively in the language. Take the Python list object: it's easy to understand and use to build applications because of it's flexibility. The downside is that you need to simplify it or use it other ways to use it as a linked list or queue structure.

This lack of native support for "low-level" Python also meant a greater lack of resources than what I expected. Most supplemental resources I found outside the class were exclusively C or Java, so I relied heavily on reading pseudocode and stack overflow Python questions. A great find was this free Data Structures and Algorithms textbook written in Python. Roughly 70% of the first two courses was covered in these books.

After all that's said and done, I do thank Coursera for including Python as a supported course. I would never have taken on this course if it required learning a whole new language.

Hope this was helpful. Now on to finishing Algorithms on Strings!

Jun 25, 2016

National Parks Historical Attendence

About the Visualization

I've been a huge fan of national parks; one of my goals is to visit all 60+ of them (so far only halfway there). When I saw that Tableau was sponsorsing a visualization competition at the 2016 DoGoodData conference, doing something with National Parks came to mind. The competition rules were broad: pick a social sector data set and tell a story with it. I did a bit of digging on the National Park Service site and saw they have a significant amount of data collected on a statistics subsection.

Unfortunately the data was separated by every park and in some report-style Excel spreadsheets. I used Python to scrape, clean and aggregate the different pages data. Once I built the full dataset, I started exploring the data. I was curious to see how attendance varied by park, how it grew over time, and how cyclical it was. My own experience visiting parks had given me intuition about these trends (ie Grand Canyon is really popular compared to Bryce Canyon, nobody visits Acadia in the Maine winter). These guiding questions each drove what went into each page.

Page One - Attendance by Park

Building the map of NPS sites with their dot size determined by attendance was straightforward, but the visualization looked bland initially. I really wanted to get some kind of image of the parks - after all people love and recognize parks for their imagery, not their data points. I wasn't sure where I could find a data source of iconic photos for each park. I considered grabbing each park's wikipedia page, but found that their image quality varied and in some cases their page didn't have a photo yet. I ended up with a neat solution - use each park's official NPS site. Each site had a high-quality iconic image and I could embed the image directly into Tableau using Tableau's website embed feature. A bit of tinkering with the dimensions and having Tableau load the page at the image's HTML class div and voila, an on demand library of curated images.

Page Two - Visitation Growth Over 40 Years

Parks have gotten much popular since data started getting collected in 1979; almost doubling. It turned out that the growth is concentrated in a few parks in particular: Zion, Yosemite, and Grand Canyon.

Page Three - Seasonality of Park Attendance

Each park has some sort of seasonality, here you can see what each particular park's looks like. Most follow a trend of high attendance in summer (nice weather plus summer break) but several parks diverge from this. Some are obvious; Death Valley has almost no visitation in the summer. Others I'm not sure about; Great Smoky Mountains peaks again in the fall (maybe people checking out the autumn leaves?)

I was pleased with what came out, but it still could use a bit of polish. Overall, the data collection and visualization building took a few days as a side project. Maybe I'll write an in-depth analysis of the data in a future blog post with a cleaner visualization.

← Future Past → Page 2 of 3