In the lab session this week we looked again at the Freedom of Information act and considered a request to Cardiff Council for the list of empty properties in Cardiff. Last year we did a very similar session, but this year I carried out the simple data analysis slightly differently.
Our data is stored as a .csv file. Each row contains the name of the parish in Cardiff and the post code of the empty property, like so:
At the simplest level, all we want to do is count how many properties are empty in each parish, sort that count, and display the top X parishes. Here’s the code I used to do that last year:
1 import csv 2 import operator 3 from collections import defaultdict 4 5 inputfile = open('emptyproperties.csv') 6 csv_reader = csv.DictReader(inputfile) 7 8 parish_count = defaultdict(int) 9 10 for row in csv_reader: 11 parish = row['Parish'] 12 parish_count[parish] += 1 13 14 sorted_parishes = sorted(parish_count.items(), key=operator.itemgetter(1), reverse=True) 15 print(sorted_parishes[0:10])
This uses no extra libraries other than those included in the standard Python install. It’s fairly straightforward, although we need to loop through each row of our CSV and do data extraction ourselves, and the sorting of the dictionary is rather complicated.
However, there are many libraries that exist for doing data analysis in Python, so it’s worth looking at how this relatively simple task is completed using each of them.
1 import pandas 2 3 empty = pandas.read_csv('emptyproperties.csv') 4 by_parish = empty.groupby('parish') 5 6 parish_totals = by_parish.size() 7 print(parish_totals.nlargest(10))
This code is quite clear and straightforward. We read the csv into a data structure, group the rows of that structure by a particular column, and find the size of each group. We can then print out the top 10 groups by value. Pandas deals with the simple task rather well, and this is easy to teach to novice programmers, with little complexity. However, it’s not entirely clear from the API documentation why we need to use .size() rather than .count() when computing the totals.
1 import agate 2 3 empty = agate.Table.from_csv('lessthansix.csv') 4 by_parish = empty.group_by('parish') 5 parish_totals = by_parish.aggregate([('parish', agate.Length(), 'count')]) 6 7 sorted_totals = parish_totals.order_by('count', reverse=True) 8 sorted_totals.print_table(max_rows=10)
Although broadly similar, this is slightly more complicated code. Here we need to understand the aggregation syntax of Agate in order to count the number of occurrences of each parish within our data (Line 5). Given that this function accepts a list of tuples as its argument, it is slightly unnerving for a Python novice (and slightly ugly). Also, we need to explicitly sort the totals and print the ones we want, rather than using a helper function like ‘nlargest’. In my opinion, Agate over complicates this simple task.
For this ridiculously trivial task (counting occurrences of items in a CSV file), I’d say Pandas wins over both native Python and Agate (purely in terms of code complexity and ease of teaching). As we get into more complicated tasks with larger data sets it’ll be interesting to see whether that remains true. Agate is new to the scene, so support in terms of community documentation (stack overflow etc.) is rather thin on the ground. It may well be that this example can be simplified, but I was just working off the tutorial in the official documentation. Also, Agate does get bonus points for its nicer output, and for being able to print bar charts to the console:
9 sorted_totals.limit(10).print_bars('parish', 'count', width=80)