Webinar 3 - AI & Investigative Journalism - Training Slidedeck

AI and investigative journalism
Josh Nicholas
Data journalist
The Guardian

Agenda
● Introduction
● What is AI
○ Different forms
○ More than a black box
● Case studies
○ Extracting useful info from text
○ Fuzzy matching between datasets
○ Finding a needle in a haystack
● Homework
● Q + A
Code for all examples is on my Github
More resources in HANDOUT
After the session:
● Recording
● Handout
● Homework in our LinkedIn Group
● LINK to join

● Many AI terms are used
interchangeably
● We are going to focus on machine
learning models
● These are algorithms that can learn
their own rules from data
Artiﬁcial intelligence is catch-all
This graphic was adapted from Build a Large Language Model by Sebastian Raschka

Learning from the data
● Machines are great at identifying patterns that aren’t obvious to humans
● Given some examples to learn from, an algorithm can find more

AI and newsgathering
● Machine-learning algorithms are trained on large datasets
○ They can be fine-tuned on smaller datasets
● They are useful for “fuzzy” problems, when it’s hard to write explicit
rules/instructions
● You can access many pre-trained algorithms for free e.g.
○ Huggingface.co
○ Google, OpenAI, Mistral, Facebook etc.
and…
● If we can’t find an algorithm that fits our purpose, we can fine-tune an existing one

Examples we can steal from borrow
• Email spam filters
• Recommendation systems (Netflix, Spotify etc.)
• Language translation
• Audio transcription
• Facial recognition
• Object detection
• Predictive text
• Search engines
■ Google BERT etc.

1) Extraction
The problem:
● Extracting names, locations and
dollar amounts from thousands of
text documents:
○ 34k+ Facebook posts
○ 2.4k media releases
● What if we don’t know the names
they’ll use?
● What if they say something vague
like a “a million for x”?

● We scraped thousands of Facebook
posts and media releases from official
websites
● We used a pre-trained model from
Spacy, a common Python library
● The model identified names, locations
and references to money in the texts
● Since 2022 these tools have become even easier to use
● You can also achieve similar results with GenAI tools ike ChatGPT

2) Fuzzy matching
The problem:
● We need to connect datasets that are
slightly different
○ Josh Nicholas vs Joshua Nicholas
● Previously we used a method called
Levenshtein Distance
○ Matching every name against every
other name
○ It took ages!!

Making use of the AI ecosystem
● When you input text into a chatbot it
turns the text into a series of numbers
● We can use this same technique to
match names
• Find the numbers that are most
similar
● This same technique can be scaled to
full sentences or even entire documents
● Can also be run in reverse - what things
are least similar

3) Finding a needle in a haystack
The problem:
● Who poses most with dogs, babies,
hi vis etc.?
● We need to search through
thousands of images, many of them
not captioned

● There are loads of models that are
immediately useful
• E.g. ones for workplace safety, that can
identify hard hats etc.
• Also lots of free datasets online
● We manually created a training dataset
with novelty cheques and hi vis vests
Training a detection model

● Machine learning models can learn their own rules from the patterns in
data
● This helps us when we need to work with fuzzier/unlabelled data
○ Images, entire documents etc.
● There are thousands of models available for free online
● We can fine tune them for specific tasks if necessary
● They can be run directly or built into interfaces for common problems
● GenAI tools can often do the same tasks, but harder to scale
Quick summary

● Homework 1 (if you can code),
○ Open the Huggingface MODELS tab and choose a model that
would solve an editorial problem for you
○ Try out the tool and share your results in the LinkedIn Group
■ Why/what did you choose?
● Homework 2 (If you can't code yet):
○ Open the Huggingface SPACES tab and choose one of the tools
○ Give it a prompt and share your results in the LinkedIn Group
■ Why/what did you choose?
● How would this help in a journalism context?
Homework

1. Join the Closed LinkedIn Group
2. Post your work for trainer feedback within 4 weeks
3. Leave constructive feedback on at least one other
person’s post - within 2 weeks
4. Follow the Group Rules!
How homework works

Any questions?
?
Josh Nicholas
Data journalist
The Guardian
josh.nicholas@theguardian.com

Webinar 3 - AI & Investigative Journalism - Training Slidedeck

Recommended

Recommended

More Related Content

Similar to Webinar 3 - AI & Investigative Journalism - Training Slidedeck

Similar to Webinar 3 - AI & Investigative Journalism - Training Slidedeck (20)

More from walkleys

More from walkleys (6)

Recently uploaded

Recently uploaded (20)

Webinar 3 - AI & Investigative Journalism - Training Slidedeck