Users rarely think about verifying screenshots of social media posts before sharing them on social media. This eventually leads to the spread of misinformation and disinformation. We are developing an automated tool to estimate the probability that a screenshot of a social media post is fake. In many cases, web archives can be used to validate the attribution of such screenshots.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Web Archives for Verifying Attribution in Twitter Screenshots
1. Web Archives for Verifying Attribution in
Twitter Screenshots
Presented By:
Tarannum Zaki, PhD Student
Advisors: Dr. Michael L. Nelson & Dr. Michele C. Weigle
Department of Computer Science
Old Dominion University, Norfolk, Virginia
April 26, 2024
@tarannum_zaki @WebSciDL
2024 Web Science and Digital Libraries Research Group Expo
2. Screenshots are commonly used to annotate the social media of others
Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots
@tarannum_zaki @WebSciDL
2
https://twitter.com/BetteMidler/status/1541472225341198338
https://twitter.com/MahyarTousi/status/1534307163073658881 https://twitter.com/urbanachievr/status/1505944201208516612
3. Why screenshots?
To use as an evidence for deleted posts
3
https://web.archive.org/web/20220525125749/https://twitter.com/DanielDefense/status/1526237750277681154
Controversial posts
may be deleted.
https://twitter.com/ashtonpittman/status/1530243294868930560
https://twitter.com/DanielDefense/status/1526237750277681154
Other reasons: To deny cross-platform engagement, to aggregate, to mark-up etc.
Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots
@tarannum_zaki @WebSciDL
4. Did they really post that?
Screenshots can also be used for humor, satire, and disinformation
4
https://twitter.com/Shayan86/status/1515753937139388418
https://twitter.com/paulthacker11/status/1495436489492090881
https://twitter.com/elonmusk/status/1544051155562598401
Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots
@tarannum_zaki @WebSciDL
5. Creating fake tweets using Tweetgen
5
https://www.tweetgen.com/
https://www.tweetgen.com/create/tweet.html
Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots
@tarannum_zaki @WebSciDL
6. Using the live web and web archives to validate attribution of
screenshots
6
https://www.google.com/search
https://archive.org/web/
https://www.reuters.com/
https://www.snopes.com/
Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots
@tarannum_zaki @WebSciDL
7. Motivation
➢ Fake tweets can be responsible for misinformation/disinformation spread.
➢ Fake tweets are easy to create using online tools.
➢ There are no tools currently available to evaluate the authenticity of
attribution of screenshots.
7
Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots
@tarannum_zaki @WebSciDL
8. Aim
To develop a tool that would automatically provide a probability
whether screenshot of a social media post was actually posted by the
alleged author using the services of live web and web archives.
8
Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots
@tarannum_zaki @WebSciDL
9. To search for a tweet in the Wayback Machine, you must first
know its URL
9
https://web.archive.org/web/20220323185843/https://twitter.com/annaturley/status/1506706947239817224
URL of the tweet:
https://twitter.com/annaturley/status/1506706947239817224
https://web.archive.org/
Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots
@tarannum_zaki @WebSciDL
10. But, URL of a tweet is not present in most screenshots
10
https://twitter.com/AaronBastani/status/1507391218854117377
@annaturley
March 23, 2022
March 25, 2022
https://twitter.com/TWITTER_HANDLE/status/TWEET_ID
https://web.archive.org/web/20220323185843/https://twitter.com/annaturley/status/1506706947239817224
Tweet ID encodes the timestamp of when
the tweet was created
Construction of a tweet URL
- Use the Twitter handle and approximate a time window based
on the timestamp.
- Construct URL for the tweet.
- Search for the tweet in the Wayback Machine using the URL.
Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots
@tarannum_zaki @WebSciDL
11. Verifying if screenshot exists in the Wayback Machine
11
Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots
@tarannum_zaki @WebSciDL
12. Creating a dataset of screenshots collected from Twitter
12
Fields
Shared post’s URL Original post’s URL
Category Reason
Content category Structural features
Post type Social media
Search strategy Annotated images
Screenshot Remarks
- Screenshot images shared on Twitter.
- 200 examples
- Examples include both real and fake screenshots
https://ws-dl.blogspot.com/2022/12/2022-12-12-disinformation-spread-on.html
https://twitter.com/rvawonk/status/1503227687917305863
https://twitter.com/RealCandaceO/status/1501576
352587292673
Category: Real
Reason: Found in the live web
Content category: Politics
Post Type: Tweet
Structural features: Single author, single
post
Search strategy: Searched on Twitter
interface
Social media: Twitter
Original post’s URL
Shared post’s URL
Screenshot
Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots
@tarannum_zaki @WebSciDL
13. OCRing screenshots: Single tweet images
13
OCR
Optical Character Recognition extracts information as text from digital image.
Example screenshot image OCR extracted output
Twitter Handle
Timestamp
Tweet Text
Zaki, T., Nelson, M.L., and Weigle, M.C. (2023, Jun 14). Extracting Information from Twitter Screenshots. Tech Report arXiv:2306.08236. https://doi.org/10.48550/arXiv.2306.08236
Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots
@tarannum_zaki @WebSciDL
14. Computing a time window based on the screenshot timestamp
14
The maximum difference between two time zones on Earth is 26 hours.
Example screenshot image OCR extracted output
Twitter handle and computed timestamps
Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots
@tarannum_zaki @WebSciDL
15. Using CDX API to retrieve archived tweets with left hand boundary
15
request = "http://web.archive.org/cdx/search/cdx?url=" + urir + params
urir = "https://twitter.com/"+randyhillier+"/status"
params = "&matchType=prefix&from="+20220218154100
CDX API prefix search process
Twitter handle and computed timestamps
Output: Retrieved archived tweets with the left hand boundary(cropped).
https://archive.org/help/wayback_api.php
Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots
@tarannum_zaki @WebSciDL
16. Extracting tweet IDs and determining tweet creation
timestamp using TweetedAt
16
https://web.archive.org/web/20220222163926/https://twitter.com/randyhillier/status/1006984708109099008
https://ws-dl.blogspot.com/2019/08/2019-08-03-tweetedat-finding-tweet.html
Each tweet ID encodes its
creation timestamp
An archived tweet’s URL
https://oduwsdl.github.io/tweetedat/#1006984708109099008
Tweet ID Tweet Creation Date
1006984708109099008 20180613194037
………… …………..
Mapping between all the tweet IDs and
tweet creation timestamps
Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots
@tarannum_zaki @WebSciDL
17. Determining the final set of archived tweets by filtering the
tweet creation timestamps within the time window
17
Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots
@tarannum_zaki @WebSciDL
Output: 917 archived tweets with left hand boundary (cropped)
Mapping between tweet ID and
tweet creation timestamp
Output: 29 archived tweets within 52 hours time window (cropped)
Creation timestamp of
tweets which does not
fall within the 52 hours
time window are filtered
out.
449 archived tweets
Multiple mementos are
filtered out.
29 archived tweets
18. Extracting tweet text from archived tweets using
BeautifulSoup and Selenium
18
https://web.archive.org/web/20220220024223/https://twitter.com/randyhillier/status/1495226962058649603
TweetTextSize TweetTextSize--jumbo js-tweet-text tweet-text
An archived tweet’s URL
Extracted text from archived tweet
HTML tag containing
the tweet text
https://www.selenium.dev/
https://pypi.org/project/beautifulsoup4/
Selenium automates web scraping and BeautifulSoup parses text from HTML.
Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots
@tarannum_zaki @WebSciDL
19. Computing text similarity score between tweet text from
screenshot and archived tweets using Python’s difflib library
19
https://docs.python.org/3/library/difflib.html
Example screenshot image Extracted text from archived tweet Extracted tweet text from screenshot
match_score(Archived_Tweet_Text, Screenshot_Tweet_Text)= 81.40%
Text similarity score is computed based on longest common subsequence
Archived_Tweet_Text1 Screenshot_Tweet_Text match _score = 81.40%
Archived_Tweet_Text2 Screenshot_Tweet_Text match_score = 30.78%
Archived_Tweet_Text3 Screenshot_Tweet_Text match_score = 5.67%
……………..
A match score of 81.40% helps us to prove the existence of the screenshot tweet posted by the alleged
author.
Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots
@tarannum_zaki @WebSciDL
20. A threshold of 60% produced the highest F1 (0.69)
20
Threshold Value Precision Recall F1 Score
90% 1.00 0.42 0.59
80% 1.00 0.49 0.66
70% 1.00 0.51 0.67
60% 1.00 0.53 0.69
Experimented on 108 single tweet images from the collected dataset.
Performance of the overlap between the tweet text from the
screenshot and the archived tweets.
Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots
@tarannum_zaki @WebSciDL
21. Limitations & Future Work
21
Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots
@tarannum_zaki @WebSciDL
OCR
Complex screenshot images Extracted output mostly results in
garbage value.
22. Summary
22
➢ Screenshots are an easy way to share content on social media.
➢ Since screenshots can be easily faked, it is a critical task to detect a fabricated post.
➢ Services of web archives could be useful to verify attribution of a screenshot by finding
an archived version of the screenshot content.
➢ Our research will mitigate misinformation and disinformation spread on social media.
Tarannum Zaki WSDL Research Group Expo 2024 Web Archives for Verifying Attribution in Twitter Screenshots
@tarannum_zaki @WebSciDL