coding

SELECT * FROM a day of SQL…

6th March 2020 by Aidan O'Donnell

This month our students survived a full-day workshop on SQL, moving from the very basics of the syntax to querying datasets or working through some of the better tutorials.

First up was the excellent Select Star tutorial by Zi Chong Kao, which is based on a dataset of US prisoners executed since 1976.

We then looked for newslines in a sqlite database of US babynames (via the command line) and wrote queries in Carto to map a dataset of protected Welsh monuments.

There was more sqlite with a database of shooting incidents involving Dallas police officers, this time via a notebook. And we finished with the Knight Center’s fine SQL-based murder mystery.

Enough there to get you started (or refreshed) with your SQL syntax.

Chatbots in the Classroom: Education Innovation Research

7th June 2017 by Martin Chorley

The Computational and Data Journalism team has recently been awarded research funding from the University Centre for Education Innovation to investigate the use of chatbots in the classroom.

The project “proposes the development of chat bots as part of the teaching and learning team to support learning and automate everyday issues to alleviate staff workload.

“This would essentially create an on-demand classroom assistant who can provide informational support whatever schedule students choose to keep outside of the classroom environment and increase their overall satisfaction levels as a result.”

We’ve just hired a 3rd year Computer Science student, Stuart Clark to work with us on the project, and he has started swiftly, working to identify sources of data within the university that such a system can plug into, designing system architectures and interfaces, and beginning work on the implementation.

We’ll follow up this development work over the summer with a live trial of the system in Autumn to see how well it works and assess whether this sort of technology can be successfully used by students and lecturers alike to improve information flow and ease administrative pressures.

We’ll continue to blog about the project as it progresses over the next few months.

Hacking VoterPower with the Bureau Local

31st May 2017 by Martin Chorley

Today we hosted one of several hackdays happening nationwide, organised by The Bureau Local. Journalists from The Bristol Cable joined up with students from the MSc in Computational and Data Journalism to analyse election data, hoping to uncover local data stories around the voters in their local constituencies.

We’re pleased to be able to support one of the first community initiatives from The Bureau Local, which along with their project examining dark advertising on Facebook is beginning to show how they will deliver on their mission to build a “network of journalists and tech experts across the country who will work together to find and tell stories that matter to local communities”.

It was also great to meet up again with MSc Computational Journalism 14/15 grad Charles Boutaud, here representing the Bureau Local in his new role as a developer-journalist in their team.

Here's our team in Cardiff about to get stuck into some juicy datasets. We have teams in London, Bournemouth, Glasgow and Birmingham too pic.twitter.com/FFG6JMzurE

— The Bureau Local (@bureaulocal) May 31, 2017

.@bureaulocal is hacking #ge2017 live in 5 cities across the UK: London, Bournemouth, Cardiff, Birmingham and Glasgow! #voterpower pic.twitter.com/Nac7Gjtfo3

— Megan Lucero (she/her) 🏳️‍🌈 (@Megan_Lucero) May 31, 2017

We've been at @bureaulocal hack day in Cardiff, digging into #Bristol election data, part of nationwide network. #GE2017 https://t.co/hdRWIm9G6Z

— The Bristol Cable (@TheBristolCable) May 31, 2017

Scraping the Assembly

2nd November 2016 by Martin Chorley

Glyn is currently teaching the first-semester module on Data Journalism. As part of this, students need to complete a data investigation project. One of the students is looking at the expenses of Welsh Assembly Members. These are all freely available online, but not in an easy to manipulate form. According to the Assembly they’d be happy to give the data out as a spreadsheet, if we submitted an FOI.

To me, this seems quite stupid. The information is all online and freely accessible. You’ve admitted you’re willing to give it out to anyone who submits an FOI. So why not just make the raw data available to download? This does not sound like a helpful Open Government to me. Anyway, for whatever reason, they’ve chosen not to, and we can’t be bothered to wait around for an FOI to come back. It’s much quicker and easier to build a scraper! We’ll just use selenium to drive a web browser, submit a search, page through all the results collecting the details, then dump it all out to csv. Simple.

I built this as a quick hack this morning. It took about an hour or so, and it shows. The code is not robust in any way, but it works. You can ask it for data from any year (or a number of years) and it’ll happily sit there churning its way through the results and spitting them out as both .csv and .json.

All the code is available on Github and it’s under an MIT Licence. Have fun 😉

Empty Properties: simple choropleth maps with leaflet.js

27th November 2015 by Martin Chorley

We’re still working on looking at empty properties around Wales, and so while we wait for the FOI request results to come in, I thought it would be interesting to do a bit of basic mapping. Normally, if I want to create a choropleth I reach straight for d3 and my collection of topojson, but we’re still very early in the course, and we haven’t covered d3 yet (we go into it in some detail in next semester’s visualisation course). As we haven’t covered d3 yet, we need a simple solution, and fortunately the leaflet API makes it very easy to draw polygons on top of a map; all we need to know are the coordinates of the shape that we want to draw.

So, first we need to grab boundary files for the parishes around Wales. A quick hunt through the bafflingly obtuse ONS geoportal brings us to the generalised parish boundaries (E+W). Although it doesn’t seem immediately obvious from that page, there is a download link there that allows us to obtain shapefiles containing the boundary data for every parish in England and Wales. Unfortunately, these files are in a rather complicated shapefile format, when all we really need is a list of coordinates that we can throw into some JavaScript. We could extract and transform this data using command line tools, but as this is an early demo, we’ll use some graphical tools to do the work. So, first of all we open up the shapefile in our favourite GIS software:

England + Wales Parish boundaries in QGIS

This is all the parishes for England and Wales, and we only want the boundaries for Wales, so the next thing we’ll do is extract those. Looking at the attribute table, we see that each parish has a code connecting it to it’s Local Authority District (the LAD14CD). Using a simple filter on the ‘LAD14CD’, we can extract all those parishes that are in a local authority district in Wales, by selecting only those LAD14CDs that begin with a ‘W’:

Filtering based on attributes

This gives us our Welsh parishes:

Welsh parishes selected

Now we can save this selection as geoJSON, which is a nicer format to work with than ESRI shapefiles, and will easily be handled by Leaflet. While we’re at it, we can convert the coordinates of the boundary data to WGS84 (which essentially gives us Lat,Lng coordinates we can use with our map):

Saving the selected parishes

For this example (because we’ve only had a response from Cardiff Council so far), we only need to deal with the Cardiff parishes, so for simplicity we’ll extract the Cardiff parishes from our large geoJSON file into a smaller Cardiff specific file. A quick bit of Python looking for all the parishes with a LAD14CD of ‘W06000015’ is all that’s needed here:

import json

parishes = json.load(open('Wales_Parish.geojson', 'r'))
cardiff_parishes = {'type': parishes['type'], 'crs': parishes['crs'], 'features': []}

for feature in parishes['features']:
 if feature['properties']['LAD14CD'] == 'W06000015':
 cardiff_parishes['features'].append(feature)

json.dump(cardiff_parishes, open('Cardiff_Parish.geojson', 'w'))

This geojson is all we need to display the parish boundaries on our map. In fact, if we edit the geojson file to include

var parishes = {ALL_OUR_GEOJSON_DATA}

We can import this directly into a webpage and load it into a map with leaflet relatively easily using the geoJson function in leaflet:

<!DOCTYPE html>
<html lang="en">
<head>
 <meta charset="UTF-8">
 <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no" />
 
 <title>Empty Properties</title>
 <link rel="stylesheet" href="http://cdn.leafletjs.com/leaflet/v0.7.7/leaflet.css" />
 
 <style>
 html, body, #map {
     height: 100%;
     width: 100%;
 }
 </style>
</head>
<body>
 <div id="map"></div>

 <script src="http://cdn.leafletjs.com/leaflet/v0.7.7/leaflet.js"></script>
 <script src="cardiff_parish.js"></script>
 <script>
   // create map and centre on Cardiff
   var map = L.map('map').setView([51.455, -3.19], 12);

   L.geoJson(parishes).addTo(map);

   // add some mapbox tiles
   var tileLayer = L.tileLayer('http://{s}.tiles.mapbox.com/v3/' + 'YOUR_MAPBOX_API_KEY' + '/{z}/{x}/{y}.png', { 
       attribution: 'Map data &copy; <a href="http://openstreetmap.org">OpenStreetMap</a> contributors, <a href="http://creativecommons.org/licenses/by-sa/2.0/">CC-BY-SA</a>, Imagery © <a href="http://mapbox.com">Mapbox</a>',
       maxZoom: 18
   }).addTo(map);
 </script>
</body>
</html>

This gives us a nice map of Cardiff with the parish boundaries:

Cardiff parishes on a map

All we need to do now is alter the colour of our parishes based on the number of empty properties within that parish. So, we go back to the data we extracted preciously, which gave us the total number of empty properties in each parish. We can go back to our code that extracts the Cardiff parishes from the large geojson file, and this time whenever we extract a Cardiff parish, we add a property to the geoJson feature with its value from the empty properties data. We also add min and max values across the whole set of parishes:

import json
import pandas

parishes = json.load(open('Wales_Parish.geojson', 'r'))
cardiff_parishes = {'type': parishes['type'], 'crs': parishes['crs'], 'features': [], 'properties':{}}

parish_totals = pandas.read_csv('parish_totals.csv', index_col=0)

cardiff_parishes['properties']['min'] = parish_totals['value'].min()
cardiff_parishes['properties']['max'] = parish_totals['value'].max()

for feature in parishes['features']:
 if feature['properties']['LAD14CD'] == 'W06000015':
 
 parish_name = feature['properties']['PARNCP14NM'].strip().upper()
 feature['properties']['empty_total'] = parish_totals.loc[parish_name]['value']

 cardiff_parishes['features'].append(feature)
 
json.dump(cardiff_parishes, open('Cardiff_Parish.geojson', 'w'))

Then, we set up a colour scale in our JavaScript code for creating the map (based off a single-hue colorbrewer scale), and style each shape according to its value by adding a style function that gets called by Leaflet when it is drawing each geoJson feature:

<script>
 // create map and centre on Cardiff
 var map = L.map('map').setView([51.455, -3.19], 12);

 var divisor = parishes.properties.max / 9;
 var colour_scale = ["#fff7ec", "#fee8c8", "#fdd49e", "#fdbb84", "#fc8d59", "#ef6548", "#d7301f", "#b30000", "#7f0000"];

 L.geoJson(parishes, {
   style: function(feature){
     var colour = colour_scale[Math.round((feature.properties.empty_total/divisor)-1)];
     return {color: colour, fillOpacity: 0.4, weight: 2}
   }
 }).addTo(map);

 // add some mapbox tiles
 var tileLayer = L.tileLayer('http://{s}.tiles.mapbox.com/v3/' + 'YOUR_MAPBOX_API_KEY' + '/{z}/{x}/{y}.png', { attribution: 'Map data &copy; <a href="http://openstreetmap.org">OpenStreetMap</a> contributors, <a href="http://creativecommons.org/licenses/by-sa/2.0/">CC-BY-SA</a>, Imagery © <a href="http://mapbox.com">Mapbox</a>', maxZoom: 18 }).addTo(map); </script>

And with a refresh of our map page, there we have a choropleth of the parishes in Cardiff, coloured by the number of empty properties:

Choropleth of empty properties in Cardiff

This is a nice quick example that has allowed us to begin thinking about mapping data, and some of the issues surrounding such mappings, before we begin to study them it in detail next semester. As more of the data is returned from our FOI requests, we can start expanding this visualisation across Wales.

Updating Empty Properties: Agate vs Pandas

5th November 2015 by Martin Chorley

In the lab session this week we looked again at the Freedom of Information act and considered a request to Cardiff Council for the list of empty properties in Cardiff. Last year we did a very similar session, but this year I carried out the simple data analysis slightly differently.

[Read more…]

NHS Hackday 2015

30th January 2015 by Martin Chorley

This weekend staff and students from the MSc in Computational Journalism took part in the Cardiff NHS Hackday 2015. Our liveblog of the event is here, and you can see our final product online, with all the code available in our GitHub repository.

We’ve created the foundation for a project (Health Explorer Wales) that can visualise all sorts of health related data across the local authorities and local health boards of Wales. We’ll look to improve this over the coming weeks, with the hope that it can find a community willing to adopt it and drive it forwards in the future.

Health Explorer Wales

I’ve written a fuller debrief of the event on my personal blog here, and photos from the event are included below. All pictures are released under a Creative Commons BY-NC 2.0 licence by Paul Clarke

This slideshow requires JavaScript.

Empty Properties & Postcodes

5th January 2015 by Martin Chorley

As part of the course we hold a weekly session where we try and tie together Journalism and Computer Science: “The Lab”. One of the first sessions we held looked at the results of a Freedom of Information request – tying together a commonly used journalistic tool with some simple coding and data analysis.

Glyn had submitted a Freedom of Information request to Cardiff Council asking for the number of empty properties across the city. This was partially successful, as it turns out the information was already published on the Council website. Unfortunately, as is common with many council documents, the data was made available as a .pdf file. This is a terrible way to have to receive data, as .pdf files are not easily machine readable. So, our first task was to extract the data. (It’s interesting to note that the latest version of this data has been released as an .xls file. It’s still not a fully REST compliant API spitting out lovely well formed JSON, but it’s a step in the right direction at least).

There are many excellent tools for extracting data from .pdf files, such as Tabula for instance. However, often the simplest solutions are the best, and in this case it was completely possible to just copy and paste the contents of the .pdf into a spreadsheet. Once the data was in the spreadsheet we could save it as a Comma Separated Value (.csv) file, which is a fairly simple format to deal with using some python code.

We now have a .csv file listing the postcode and parish of every empty property in Cardiff, along with the date when the property became unoccupied. It is therefore pretty easy to do some simple analysis of the data using Python. For example, we can count the number of occurrences of each parish name, and find the ten areas of Cardiff with the most empty properties:

import csv
from collections import defaultdict

inputfile = open('emptyproperties.csv', 'rU')
csv_reader = csv.DictReader(inputfile)

parish_count = defaultdict(int)

for row in csv_reader:
  parish = row['Parish']
  parish_count[parish] += 1

sorted_parishes = sorted(parish_count.items(), key=operator.itemgetter(1), reverse=True)
print(sorted_parishes[0:10])

Part of creating a story around this result would be to add context to this data. Anyone with local knowledge will recognise that Butetown (including Cardiff Bay) has many blocks of rental flats, which probably explains why there are so many empty properties there. Whitchurch however is a fairly affluent middle class area, so its presence in the top ten is surprising and may require further investigation.

We can also use the dates within the data to find the postcode of the property that has been empty longest:

import csv
import datetime

inputfile = open('emptyproperties_correct.csv', 'rU')
csv_reader = csv.DictReader(inputfile)

earliest_date = datetime.datetime.now()
earliest_postcode = ''

for row in csv_reader:
 date = row['Occupancy Period Start Date ']

 if date is not '':
   py_date = datetime.datetime.strptime(date, "%d-%b-%y")

   if py_date < earliest_date:
     earliest_date = py_date
     earliest_postcode = row['Post Code ']

print earliest_postcode, earliest_date

According to the data, a property in central Cardiff, near to HMP Cardiff, has been empty since 1993. Clearly, further investigation is required to find out whether the data is accurate, and if so, why the property has been empty so long.

These short little examples show how you can start to use simple bits of code to dive into and analyse data quickly, to find the interesting features hidden in the data, that with some investigation may lead on to an interesting story. In future sessions, we can go on to look at interesting ways to visualise this data and examine it further.

The Importance of Owning your Toolset

19th November 2014 by Martin Chorley

This morning, upon logging in to ScraperWiki, Glyn found the following message:

ScraperWiki loses access to Twitter API

ScraperWiki can no longer access the Twitter API, meaning that tools based on Twitter data on ScraperWiki will no longer work. As it happens, I have just written a really simple Twitter API wrapper in Python, so we thought it would be worth presenting it here as a how-to. If you understand and own the code that is gathering your data, you can obviously remove your reliance on third parties.

QUICK DISCLAIMER: this is a quick and dirty solution to a problem, so may not represent best coding practice, and has absolutely no error checking or handling. Use with caution…

The code presented here will allow you to make any API request to Twitter that uses a GET request, so is really only useful for getting data *from* Twitter, not sending it *to* Twitter. It is also only for using with the REST API, not the streaming API, so if you’re looking for realtime monitoring, this is not the API wrapper you’re looking for. This API wrapper also uses a single user’s authentication (yours), so is not setup to allow other users to use Twitter through your application.

The first step is to get some access credentials from Twitter. Head over to https://apps.twitter.com/ and register a new application. Once the application is created, you’ll be able to access its details. Under ‘Keys and Access Tokens’ are four values we’re going to need for the API – the Consumer Key and Consumer Secret, and the Access Token and Access Token Secret. Copy all four values into a new python file, and save it as ‘_credentials.py‘. The images below walk through the process. Also – don’t try and use the credentials from these images, this app has already been deleted so they won’t work!

: Create a new Twitter Application

: Give it a name, description and website

: Check the App details

: Get your Consumer Key and Secret

: … and your Access Token and Secret

: Store your credentials in _credentials.py

Once we have the credentials, we can write some code to make some API requests.

First, we define a Twitter API object that will carry out our API requests. We need to store the API url, and some details to allow us to throttle our requests to Twitter to fit inside their rate limiting.

class Twitter_API:

 def __init__(self):

   # URL for accessing API
   scheme = "https://"
   api_url = "api.twitter.com"
   version = "1.1"

   self.api_base = scheme + api_url + "/" + version

   #
   # seconds between queries to each endpoint
   # queries in this project limited to 180 per 15 minutes
   query_interval = float(15 * 60)/(175)

   #
   # rate limiting timer
   self.__monitor = {'wait':query_interval,
     'earliest':None,
     'timer':None}

We add a rate limiting method that will make our API sleep if we are requesting things from Twitter too fast:

 #
 # rate_controller puts the thread to sleep 
 # if we're hitting the API too fast
 def __rate_controller(self, monitor_dict):

   # 
   # join the timer thread
   if monitor_dict['timer'] is not None:
   monitor_dict['timer'].join() 

   # sleep if necessary 
   while time.time() < monitor_dict['earliest']:
     time.sleep(monitor_dict['earliest'] - time.time())
 
   # work out then the next API call can be made
   earliest = time.time() + monitor_dict['wait']
   timer = threading.Timer( earliest-time.time(), lambda: None )
   monitor_dict['earliest'] = earliest
   monitor_dict['timer'] = timer
   monitor_dict['timer'].start()

The Twitter API requires us to supply authentication headers in the request. One of these headers is a signature, created by encoding details of the request. We can write a function that will take in all the details of the request (method, url, parameters) and create the signature:

 # 
 # make the signature for the API request
 def get_signature(self, method, url, params):
 
   # escape special characters in all parameter keys
   encoded_params = {}
   for k, v in params.items():
     encoded_k = urllib.parse.quote_plus(str(k))
     encoded_v = urllib.parse.quote_plus(str(v))
     encoded_params[encoded_k] = encoded_v 

   # sort the parameters alphabetically by key
   sorted_keys = sorted(encoded_params.keys())

   # create a string from the parameters
   signing_string = ""

   count = 0
   for key in sorted_keys:
     signing_string += key
     signing_string += "="
     signing_string += encoded_params[key]
     count += 1
     if count < len(sorted_keys):
       signing_string += "&"

   # construct the base string
   base_string = method.upper()
   base_string += "&"
   base_string += urllib.parse.quote_plus(url)
   base_string += "&"
   base_string += urllib.parse.quote_plus(signing_string)

   # construct the key
   signing_key = urllib.parse.quote_plus(client_secret) + "&" + urllib.parse.quote_plus(access_secret)

   # encrypt the base string with the key, and base64 encode the result
   hashed = hmac.new(signing_key.encode(), base_string.encode(), sha1)
   signature = base64.b64encode(hashed.digest())
   return signature.decode("utf-8")

Finally, we can write a method to actually *make* the API request:

 def query_get(self, endpoint, aspect, get_params={}):
 
   #
   # rate limiting
   self.__rate_controller(self.__monitor)

   # ensure we're dealing with strings as parameters
   str_param_data = {}
   for k, v in get_params.items():
     str_param_data[str(k)] = str(v)

   # construct the query url
   url = self.api_base + "/" + endpoint + "/" + aspect + ".json"
 
   # add the header parameters for authorisation
   header_parameters = {
     "oauth_consumer_key": client_id,
     "oauth_nonce": uuid.uuid4(),
     "oauth_signature_method": "HMAC-SHA1",
     "oauth_timestamp": time.time(),
     "oauth_token": access_token,
     "oauth_version": 1.0
   }

   # collect all the parameters together for creating the signature
   signing_parameters = {}
   for k, v in header_parameters.items():
     signing_parameters[k] = v
   for k, v in str_param_data.items():
     signing_parameters[k] = v

   # create the signature and add it to the header parameters
   header_parameters["oauth_signature"] = self.get_signature("GET", url, signing_parameters)

   # add the OAuth headers
   header_string = "OAuth "
   count = 0
   for k, v in header_parameters.items():
     header_string += urllib.parse.quote_plus(str(k))
     header_string += "=\""
     header_string += urllib.parse.quote_plus(str(v))
     header_string += "\""
     count += 1
     if count < 7:
       header_string += ", "

   headers = {
     "Authorization": header_string
   }

   # create the full url including parameters
   url = url + "?" + urllib.parse.urlencode(str_param_data)
   request = urllib.request.Request(url, headers=headers)

   # make the API request
   try:
     response = urllib.request.urlopen(request)
     except urllib.error.HTTPError as e:
     print(e)
   raise e
     except urllib.error.URLError as e:
     print(e)
     raise e

   # read the response and return the json
   raw_data = response.read().decode("utf-8")
   return json.loads(raw_data)

Putting this all together, we have a simple Python class that acts as an API wrapper for GET requests to the Twitter REST API, including the signing and authentication of those requests. Using it is as simple as:

ta = Twitter_API()

# retrieve tweets for a user
params = {
   "screen_name": "martinjc",
}

user_tweets = ta.query_get("statuses", "user_timeline", params)

The full code is online on Github, and is released under an Apache 2.0 Licence.