Blog

Report from the successful DataJConf & C+J Conference

26th June 2023 by Martin Chorley

Last week we held the 4th edition of the European Data & Computational Journalism Conference (#datajconf) in Zurich, Switzerland, and this time it was held jointly with the Computation + Journalism Conference (#cplusj), which is usually held in the US.

I’ve written about DataJConf previously, when we held the first edition in Dublin in 2017. We’ve always encouraged a strong multi-disciplinary approach, aiming to attract journalists, developers, industry professionals and academics from across many fields including journalism and media studies, computer science and data science. It was really gratifying to see that this year was no different, with a great mix of all these groups, and more, in attendance. Our teaming up with C+J meant that we had more attendees from the US than we’ve had before, which I think will have been a great opportunity to strengthen links between Europe and the US, and has also helped raise the profile of our European ‘little sister’ conference to the more well established C+J conference.

This was our first conference back since the pandemic, it had been 4 years since we last got together and discussed all things Computational + Data Journalism. It was fascinating to see the progress that has been made in the last few years, and how both academic research and industry practice has evolved.

It’s no surprise that AI dominated much of the agenda, with some very interesting discussion of the use of GenerativeAI by news and media organisations. What was most revealing though was that widespread adoption of GenerativeAI tools is still some way off, if it will ever happen at all. Most organisations that have experimented with these sorts of technologies have found that the unreliability and ‘hallucinations’ that can be introduced create all sorts of integrity and trust issues when using GenAI to create content, which generally outweigh the benefits. Many newsrooms/organisations are instead for now sticking with rule based/templating automation for content generation, which is much more reliable and controllable. Where GenAI has found use is in transforming/summarising existing content, which again can be more controlled/reliable. On the AI front there was also some discussion of deepfakes, and the potential issues arising there, though little discussion of solutions, perhaps because we don’t have them, or perhaps because existing fact-checking and verification techniques are already sufficient to deal with the problem?

The other big development that was noticeable at this conference was the increase in algorithmic accountability efforts – news organisations and others working to investigate the impacts and biases present in algorithmic decision making processes that have a real impact on people and society. An increasingly important area of concern, that was not really touched upon in previous editions of the conference, but which is now a focus for many teams.

It’s gratifying to see so much algorithmic accountability reporting at #cplusj #datajconf this year, including that presented by ⁦@zehnzehen⁩ this morning: pic.twitter.com/kCwDra8X3U
— Nicholas Diakopoulos (@ndiakopoulos) June 24, 2023

Teams was also an interesting revelation from some of the talks and discussions – where previously the EU has perhaps lagged behind the US in terms of the development and position of data/computational teams in the newsroom and other media organisations, it does feel like there has been a change in the intervening years in some organisations, where these teams are now bigger, more established, and perhaps more central. Certainly, whereas in our first edition much of the talk was about small, niche teams on the fringes of media operations, this time it felt like we had a number of talks where the data/computational teams were key to some of the organisational strategy. A good development to see, though it is also clear from a number of talks that there is still work to do in this area.

For a really nice roundup of some of the issues I’ve touched on above, and more, see this thread from Jim Haddadin:

Just wrapped at #cplusJ #datajconf at @ETH in Zurich! Thanks to the @datajconf organizers for a super fascinating set of discussions about the future of journalism and technology. Sharing 5 takeaways… pic.twitter.com/EndCNcSFhL
— Jim Haddadin (@JimHaddadin) June 24, 2023

You can also go back and check through the hashtag to see how the conference unfolded.

Overall it was a really successful conference. The local team did a great job pulling it together, and it was great to catch up with familiar faces from previous editions and conferences, and to meet new people too. We’re looking forward to the next edition …

Finding donors to Truss leadership campaign, via Datasette

8th October 2022 by Aidan O'Donnell

Liz Truss now has the job of leading the Conservative Party and of running the country. So who gave her the money for her campaign?

MPs in Britain have to declare any money they receive, via donations or second jobs for example, and this is listed in the Register of Members’ Financial Interests.

The website explains that the Register exists to provide information about “any financial interest which a Member has, or any benefit which he or she receives” and the reason that this matters is that, according to the website, “others might reasonably consider” this money could influence the MP.

The Register is updated regularly but the data is laid out as text as a web page and only the line breaks serve to distinguish one field from another. This makes it very difficult to scrape the data or interrogate it for trends.

Datasette

This is where we turn to Datasette, a lovely tool for “exploring and publishing data”, built and maintained by Simon Willison and which allows SQL queries. Happily, there is already an example in place for the Register of Members’ Interests.

Querying

There are four main tables in this instance of Datasette: categories, items, members, and people. A query of the members table (using “View and edit SQL”) returns two ids we can use to look up Liz Truss in the main items table: a member number (40560) in the “id” column and a person number (24941) in the “person_id” column.

SELECT * FROM members WHERE name = ‘Elizabeth Truss’

The table with the crucial information, items, has close to two million entries. But, as Simon Willison explains, its members field seems to stop around 2015, so the person field is a better choice. Querying Truss in items via her “person_id”:

SELECT * FROM items WHERE person_id = ‘uk.org.publicwhip/person/24941’ ORDER BY date DESC

returns just over 950 entries, from 2010 to 2022.

But if you just want the 2022 donations:

SELECT * FROM items WHERE person_id = ‘uk.org.publicwhip/person/24941’ AND date LIKE ‘2022%’ ORDER BY date DESC

or more precisely again, just the donation descriptions that mention the word “campaign”:

SELECT * FROM items WHERE person_id = ‘uk.org.publicwhip/person/24941’ AND item LIKE ‘%campaign%’

This last query returns 48 donations, which you can then download as a csv or json from Datasette. Here is that data as a csv, after some further cleaning.

truss_donations2-1 Download

Answers

Some initial observations on the donations are that:

£120,000 came from six companies: Big Bang Films, JC Bamford Excavators, Grolar Developments, SJJ Contracts, Smoked Salmon and Tungsten West. JC Bamford is the only one to have also donated to the wider Tory party in the last two years.

A little over £700,000 came from 13 people: Natasha Barnaba, Linda Edwards, Clara Freeman, Alison Frost, Fitriani Hay, Phillip Jeans, Gary Mond, Jon Moynihan, Sheila Noakes, Gordon Phillips, Howard Shore, Michael Spencer and Barbara Yerolemou.

A further £85,000 appeared to be help with transport, from Graham Edwards, Tony Gallagher, Greville Howard, Andrew Law and Nigel Vinson.

These initial observations however are just a starting point.

Sharing Jupyter notebooks online

20th March 2021 by Aidan O'Donnell

You have a great Jupyter notebook you’ve been working on. If only you could share it with the world: here are some options for getting your notebook online.

If you just want to show a notebook to people without them running the code, nbviewer does the job by showing the cells and their output (beware long dataframes that won’t be cropped). Just put the notebook file (.ipynb) on github and supply the link to nbviewer. If your visitor likes what they see, they can immediately launch a functioning version via a Binder link, or download the .ipynb file. Here’s a simple example of what the user sees.

If your notebook is in a github repo you can skip nbviewer and build a working version of the notebook via https://mybinder.org/. Just supply the repository url and it will serve up all the .ipynb files, with the notebook cells ready to run.

jupyter{book} lets you build a complete book using notebook elements. Here’s an example with some notebooks.

The Voilà package “turns Jupyter notebooks into standalone web applications” or if you prefer, it puts only the cell output on the webpage. Where it gets really useful is by involving widgets from ipywidget to allow user interaction.

A github repo with Github Pages enabled can run as a webpage using a package called nbinteract but I’ve found it has trouble loading widgets, as seen in some of the tutorial pages.

Of course, Jupyter notebooks are not the only option: Kaggle, Google Colab and many more. There’s an episode of the podcast Talk Python To Me about a paper that reviewed 60 (!) different notebooks.

What we did in 12 weeks of data journalism

9th February 2021 by Aidan O'Donnell

Now that we’ve finished a first semester of data journalism work, I’ve put the module details online. It’s not an online course but it does have our Reading List, a running list of Interesting Datasets for practice (or work) and outlines of what we did each week in the Great Academic Year Of The Pandemic (some in great detail, others less so).

The course ran over 12 weeks and covered … “everything”.

If you want to see more, there are some catalogues online of what people are doing when they teach data journalism: this one started a few years ago by Dan Nguyen, this from the IJEC and this list by Jonathan Gray.

Some APIs for journalism

22nd November 2020 by Aidan O'Donnell

This month we find ourselves digging up data with the help of APIs. While there are oodles of APIs for different things (there’s a Star Wars API and an ISS API and many many others), I wondered which endpoints might be interesting for journalists. So here is a list of some of them — we’ll add to it as we find more — starting with government and moving on to business, health and … where you can charge your electric car.

* means an API key is required, ** means an API key plus extra authentication is required

Government

Covid, weather etc.

US Federal Election Commission (FEC)*

Media

Tim Harford’s lethal bathtub

15th September 2020 by Aidan O'Donnell

Tim Harford’s books are on the reading list for journalism students at Jomec and we are big fans of More or Less. And this month he supplied us all with a great case of numbers going wrong, in a piece for the Financial Times.

You can listen to him explaining it on Radio 4’s The World at One (segment at 17′ 52″).

The thinking — about how dangerous UK life is during the Covid pandemic — goes like this:

Every day in the UK about 40 people out of a million get the virus (ONS).
How dangerous is it if you’re one of the forty? If you’re aged 60, you have roughly a 1% chance of dying if you catch it.
1% of ’40 in a million’ gets you to almost a 1 in 2 million chance of dying. So, if you are 60 and live in the UK at the moment (and are exposed to the typical risk in the UK) there’s a 1 in 2 million chance Covid will kill you.
Or make that a one in a million chance if you include ‘serious injury’ since another 1% of the ’40 in a million’ who catch it are left with health problems.

Everything, Tim Harford says, is fine up to here. But then he looked for other things that had a one in a million chance of death / serious injury. One of them, he explained to The World at One, was “taking a bath”.

“So when I discovered this I thought ‘oh, I wonder what else is about that risky?’ […] So when I wrote this all up for the Financial Times I just — as an afterthought, having worked so carefully to get all my Covid maths right — I just said ‘it’s a bit like riding a horse, riding a motorbike, going skiing, or taking a bath'”.

This is the error. The risk of dying in the bath is one in three million every year — not every time you take a bath. As Tim Harford remarks “Covid is no more risky than you thought. And taking a bath is much safer than you thought”.

Nonetheless, “That is the most shared thing I’ve ever said because it’s the most interesting thing I’ve ever said […] because it happens to be wrong”.

It is, as he observed, an instructive case of how mistakes happen and what newspeople pick up on.

His full account of it is on twitter.

Our course after 6 months of Covid-19

28th August 2020 by Aidan O'Donnell

The Covid-19 pandemic shut down our schools at the end of March and sent staff and students alike home to work on their laptops. This meant MSc students finished their group projects using online platforms and started dissertation projects while trying to get back to their home countries, or while stuck in Cardiff.

Although the Summer months are probably the right time to be stranded here when we get more sun than usual but less than in hotter parts of the world.

A new cohort of students will be arriving in Cardiff next month. Our course this year will run both online and in classrooms for the first semester. The computer science courses will be taught online, while most of the journalism work will take place in classrooms.

There is of course a huge amount of data and data-related stories that have been published in recent months because of the pandemic. And, it appears, the data and the effects of the pandemic on societies around the world will keep coming for a while yet. So it is a good time to be working on this kind of material.

And the Americans are planning an election, which should keep us busy in November and the weeks before.

The Clwstwr news projects — update

20th July 2020 by Aidan O'Donnell

Clwstwr is a five-year programme in south Wales — run from Cardiff — that was started to encourage the development of original screen-related projects. ‘Screen’ here means anything that involves creative or technological industries in a broad sense. Since it was set up in early 2019, it has allocated funding and development support to 23 different projects to allow for original research and development.

Many of the projects have been underway for close to a year at this stage (a full list of the projects is here) and a few of them are of particular interest to us since they are working on news:

Artificial Intelligence in the newsroom

This project is investigating how to put the resources of the deep web at the disposal of working journalists, by using artificial intelligence. It’s run by the Cardiff team of Amplyfi, a company that uses tech for business intelligence, and the project aims to develop technology that will identify new entities that are emerging in the deep web, and especially new relationships between those entities.

Extracting court information for the press

The team behind the Caerphilly Observer are running this project, which will deal with court information (who’s appeared in court, who’s due to appear) that is often either unwieldy or downright inaccessible for journalists. The plan is to gather all this information for Magistrates Courts in Wales and make it available to journalists through a searchable database, which would greatly aid press coverage of local courts.

New ways of telling news stories

What’s the best way to tell a news story? This project is trying to answer this question by looking firstly at how people understand and response to stories in general, and then by designing new journalism techniques that will allow the press to tell stories in the most effective way possible. It’s a radical re-evaluation of a journalistic storytelling tradition that has long worked just on the basis of ‘that’s how we’ve always done it!’.

News in school

This project will design “a pilot for regular news service delivered to pupils within school hours”. The idea is that teachers can use this service to complement their teaching and that a new generation of young people will be introduced to the idea of staying informed.

Journalism by Numbers — 2019 [Virtual] Summer School

26th June 2020 by Aidan O'Donnell

With Cardiff University buildings closed since March because of the Coronavirus pandemic, the Summer School for the public moved online in June, and included a one-hour session on what datajournalists do.

The Summer School comprised a week of workshops that ranged from radiography and earth sciences to building design and writing for business.

In our rapid run-through the data journalism world, we touched on classic go-to number stories like A&E waiting times and party-political donations as well as how journalists dig up the data in the first place (FOI, web scraping and so on). We looked at visuals done with colouring pencils, graphing cleaner air in Cardiff during lockdown and the ongoing questions around who keeps an eye on the algorithms.

People appeared online for our Journalism by Numbers workshop from around Wales and the UK, but also from Pakistan, Sweden and Nigeria.

Other workshops during the week covered ethics in Artificial Intelligence, copywriting and Google analytics. There was also a session on the ever-interesting Pharmabee project (which launched the Spot-a-bee app this year as part of their bee-mapping project).