SlideShare a Scribd company logo
1 of 51
Comparing the Archival Rate of Arabic,
English, Danish, and Korean Language
Web Pages
Lulwah M. Alkwai, Michael L. Nelson, Michele C. Weigle (@weiglemc)
Web Sciences and Digital Libraries (WS-DL) Group (@WebSciDL)
Department of Computer Science
Old Dominion University
Norfolk, Virginia, USA
July 24, 2019 / ACM SIGIR 2019 / Paris, France
Published in ACM Transactions on Information Systems (TOIS), 36(1), July 2017
Extended version of Best Student Paper awardee at IEEE/ACM JCDL 2015
@weiglemc, @WebSciDL
2000
Web archives are collections of web pages of the
past
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 2
2007 2009 2012
@weiglemc, @WebSciDL
Web archives are essential for studying recent
history and culture
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 3
https://twitter.com/NetPreserve/status/1141321443373920256 (photo cropped and enlarged)
http://web.archive.org/web/19970222174751/http://www1.geocities.com/
@weiglemc, @WebSciDL
The Internet Archive holds the largest web
archive
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 4
https://archive.org/web
@weiglemc, @WebSciDL
But it’s not the only one
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 5
http://timetravel.mementoweb.org/list/19990518173206/http://geocities.com https://en.wikipedia.org/wiki/List_of_Web_archiving_initiatives
@weiglemc, @WebSciDL
We’ve studied recent (2010s) events in the
Middle East
• Iranian Elections and Protests - Jun 2009
– SalahEldeen and Nelson, TPDL 2012
• Egyptian Revolution - Jan 20 - Mar 1, 2011
– SalahEldeen and Nelson, TPDL 2012
– AlNoamany, Weigle, Nelson, ACM WebSci 2017
• Syrian Uprising - Mar 2012
– SalahEldeen and Nelson, TPDL 2012
• Egypt’s Presidential Election - 2012
– AlNoamany, Weigle, Nelson, TPDL 2015, IJDL 2016
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 6
@weiglemc, @WebSciDL
But, we can only study the past Web that already
exists in the archives
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 7
Hany M. SalahEldeen, Michael L. Nelson, Losing My Revolution: How Many Resources Shared on Social Media Have Been Lost?, Proceedings of TPDL 2012.
11% of resources
shared in social
media disappear
each year
@weiglemc, @WebSciDL
How well is the Arabic Web archived?
• Arabic is the 4th most popular language on the Internet
• Anecdotally known that web archives and search engines
favor Western and English language pages
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 8
@weiglemc, @WebSciDL
We investigated the state of archival of Arabic language
web pages in 2014-2015
• Gathered Arabic language web pages
• Analyzed domains, TLDs, GeoIP
• Analyzed presence in Google’s index and web archives
• Compared this to the archival and indexing of English,
Danish, and Korean language web pages
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 9
@weiglemc, @WebSciDL
How to gather and detect Arabic
language web pages?
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 10
@weiglemc, @WebSciDL
Gathered URIs from Arabic website directories
• 15,743 seed URIs
– 15,092 were unique
– 11,014 were live on the Web in March-May 2014
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 11
Directory
Registered
Country
Year
Estab. Directory URI URIs
DMOZ US 1999 Dmoz.org/world/arabic 4,086
Raddadi Saudi Arabia 2000 Raddadi.com 3,271
Star28 Lebanon 2004 Star28.com 8,386
Total 15,743
@weiglemc, @WebSciDL
We used four methods to determine the language
of the 11,014 URIs
• HTTP Content-Language header
• HTML Title tag
• Language Detection API
• Trigrams
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 12
https://github.com/kent37/guess-language
http://detectlanguage.com
https://github.com/decultured/Python-Language-Detector
Sample of test results evaluated by a native speaker (first author)
@weiglemc, @WebSciDL
HTTP Content-Language classified 41% of the
seed URIs as Arabic
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 13
HTTP response header
ex:
Content-Language: ar
HTML tag
ex:
<html dir="rtl"
xmlns="http://www.w3.org/19
99/xhtml" xml:lang="ar"
lang="ar">
@weiglemc, @WebSciDL
HTML Title tag classified 38% of the seed URIs as
Arabic
14ACM TOIS 36(1) 2017 / ACM SIGIR 2019
Extract text from HTML
title tag
ex:
<TITLE>‫الشامل‬ ‫العرب‬ ‫/<دليل‬TITLE>
guess-language library
https://github.com/kent37/guess-language
@weiglemc, @WebSciDL
The Language Detection API test classified 39%
of the seed URIs as Arabic
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 15
Extract title and text
Language Detection API
http://detectlanguage.com
@weiglemc, @WebSciDL
The Trigram test classified 36% of the seed URIs
as Arabic
16ACM TOIS 36(1) 2017 / ACM SIGIR 2019
Extract title and text
Test sequences of three
letters (trigrams)
Python-Language-
Detector tool
https://github.com/decultured/Python-
Language-Detector
@weiglemc, @WebSciDL
We took the union to obtain 7976 Arabic seed
URIs
17ACM TOIS 36(1) 2017 / ACM SIGIR 2019
72.4% of the
seed URIs were
determined to be
in Arabic
@weiglemc, @WebSciDL
We expanded the dataset by crawling the live
Web and the past Web
• Crawled the 7976 live Arabic seed URIs, 2 levels deep
– gathered all URIs linked from each seed URI
– then, gathered all URIs linked from those URIs
– 575,242 additional URIs
• Crawled the most recent memento of the Arabic seed URIs
– 515,821 additional URIs
• Total of 663,443 unique crawled URIs gathered
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 18
@weiglemc, @WebSciDL
482,905 were live
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 19
@weiglemc, @WebSciDL
292,670 were live and Arabic
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 20
Total Arabic Dataset = 300,646 URIs
7,976 seed URIs
+ 292,670 crawled URIs
@weiglemc, @WebSciDL
What are the characteristics of our
Arabic language dataset?
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 21
@weiglemc, @WebSciDL
Dataset has 17,536 unique domains
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 22
Rank Domain URIs GeoIP Category
1 Alarab.net 284 US News
2 Aljarida.com 248 US News
3 Arabic.cnn.com 245 US News
4 Alarabiya.net 231 US News
5 Ar.wikipedia.org 230 US Encyclopedia
6 Aljazeera.net 213 US News
7 Moheet.com 142 US News
8 Facebook.com 133 US Social
9 Al-sharq.com 132 US Middle East Portal
10 Lakii.com 123 US General Portal
17 Kuwaitclub.com.kw 71 Kuwait Sport
@weiglemc, @WebSciDL
Rank Domain URIs GeoIP Category
1 Alarab.net 284 US News
2 Aljarida.com 248 US News
3 Arabic.cnn.com 245 US News
4 Alarabiya.net 231 US News
5 Ar.wikipedia.org 230 US Encyclopedia
6 Aljazeera.net 213 US News
7 Moheet.com 142 US News
8 Facebook.com 133 US Social
9 Al-sharq.com 132 US Middle East Portal
10 Lakii.com 123 US General Portal
17 Kuwaitclub.com.kw 71 Kuwait Sport
First Arabic GeoIP is at rank 17
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 23
@weiglemc, @WebSciDL
Rank Domain URIs GeoIP Category
1 Alarab.net 284 US News
2 Aljarida.com 248 US News
3 Arabic.cnn.com 245 US News
4 Alarabiya.net 231 US News
5 Ar.wikipedia.org 230 US Encyclopedia
6 Aljazeera.net 213 US News
7 Moheet.com 142 US News
8 Facebook.com 133 US Social
9 Al-sharq.com 132 US Middle East Portal
10 Lakii.com 123 US General Portal
17 Kuwaitclub.com.kw 71 Kuwait Sport
6 top domains are news websites
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 24
@weiglemc, @WebSciDL
Rank Domain URIs GeoIP Category
1 Alarab.net 284 US News
2 Aljarida.com 248 US News
3 Arabic.cnn.com 245 US News
4 Alarabiya.net 231 US News
5 Ar.wikipedia.org 230 US Encyclopedia
6 Aljazeera.net 213 US News
7 Moheet.com 142 US News
8 Facebook.com 133 US Social
9 Al-sharq.com 132 US Middle East Portal
10 Lakii.com 123 US General Portal
17 Kuwaitclub.com.kw 71 Kuwait Sport
Popular Western domains are in the Top 10
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 25
@weiglemc, @WebSciDL
TLD Percent
com 57.97%
net 15.07%
org 6.40%
gov.sa 1.94%
info 1.68%
edu.sa 1.27%
ws 1.16%
org.sa 0.97%
com.sa 0.80%
gov.eg 0.80%
Other 11.94%
Over half have a .com TLD
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 26
@weiglemc, @WebSciDL
TLD Percent
com 57.97%
net 15.07%
org 6.40%
gov.sa 1.94%
info 1.68%
edu.sa 1.27%
ws 1.16%
org.sa 0.97%
com.sa 0.80%
gov.eg 0.80%
Other 11.94%
Only ~10% have an Arabic ccTLD
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 27
ccTLD Country Percent
.sa Saudi Arabia 5.33%
.eg Egypt 2.00%
.jo Jordan 2.00%
.ae United Arab Emirates 1.06%
.kw Kuwait 0.82%
@weiglemc, @WebSciDL
Most are geo-located in the US
Geo-location Percent
United States 57.97%
Arabic Countries 10.53%
Germany 9.75%
Netherlands 5.29%
France 4.37%
Canada 3.31%
United Kingdom 3.07%
Other 5.71%
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 28
@weiglemc, @WebSciDL
Within Arabic countries, most are geo-located in
Saudi Arabia
Geo-location Percent
United States 57.97%
Arabic Countries 10.53%
Germany 9.75%
Netherlands 5.29%
France 4.37%
Canada 3.31%
United Kingdom 3.07%
Other 5.71%
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 29
Geo-location Percent
Saudi Arabia 4.75%
Egypt 1.97%
Jordan 1.42%
Kuwait 0.71%
United Arab Emirates 0.67%
@weiglemc, @WebSciDL
How well are these Arabic web pages
indexed and archived?
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 30
@weiglemc, @WebSciDL
53.77% of Arabic language web pages are
archived
• Used Memento aggregator to determine if archived
– checks multiple web archives
• 161,678 URIs were archived
• 97% of those were found in the Internet Archive
• Mementos also found in 9 other archives
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 31
@weiglemc, @WebSciDL
6 out of the 10 most archived are news websites
URI-Rs Mementos Category
gulfup.com 10,987 File Sharing
masrawy.com 9,144 Egyptian portal
arabic.cnn.com 9,022 News
aljazeera.net 8,906 News
maktoob.yahoo.com 8,478 Search Engine
shorooknews.com 7,548 News
arabnews.com 6,274 News
bbc.co.uk/arabic 6,268 News
ahram.org.eg 5,347 News
google.com.sa 4,968 Search Engine
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 32
@weiglemc, @WebSciDL
Many are Arabic versions of globally popular
sites
URI-Rs Mementos Category
gulfup.com 10,987 File Sharing
masrawy.com 9,144 Egyptian portal
arabic.cnn.com 9,022 News
aljazeera.net 8,906 News
maktoob.yahoo.com 8,478 Search Engine
shorooknews.com 7,548 News
arabnews.com 6,274 News
bbc.co.uk/arabic 6,268 News
ahram.org.eg 5,347 News
google.com.sa 4,968 Search Engine
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 33
@weiglemc, @WebSciDL
Most mementos are from recent years
34
Analysis done in 2015
@weiglemc, @WebSciDL
We looked at indexing of Arabic seed URIs
• Used Google Custom
Search API to query
– limited to 1000
queries/day
– tested only seeds
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 35
@weiglemc, @WebSciDL
69% of Arabic language seed URIs were indexed
by Google
Arabic Seed Dataset
(Live, Indexed, Archived) Percent
(1, 1, 1) 43.34%
(1, 1, 0) 25.59%
(1, 0, 1) 15.27%
(1, 0, 0) 15.76%
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 36
All seeds were live
• 82% of those listed in DMOZ were indexed
• 74.6% of the top-level seeds (path depth = 0) were indexed
@weiglemc, @WebSciDL
58% of seeds were archived
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 37
But, 42% were not archived
Arabic Seed Dataset
(Live, Indexed, Archived) Percent
(1, 1, 1) 43.34%
(1, 1, 0) 25.59%
(1, 0, 1) 15.27%
(1, 0, 0) 15.76%
@weiglemc, @WebSciDL
43% were indexed and archived
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 38
Arabic Seed Dataset
(Live, Indexed, Archived) Percent
(1, 1, 1) 43.34%
(1, 1, 0) 25.59%
(1, 0, 1) 15.27%
(1, 0, 0) 15.76%
Good!
(discovered
and saved)
@weiglemc, @WebSciDL
31% were not indexed by Google
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 39
Arabic Seed Dataset
(Live, Indexed, Archived) Percent
(1, 1, 1) 43.34%
(1, 1, 0) 25.59%
(1, 0, 1) 15.27%
(1, 0, 0) 15.76%
@weiglemc, @WebSciDL
16% were neither indexed nor archived
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 40
Bad!
(undiscovered
and not saved)
Arabic Seed Dataset
(Live, Indexed, Archived) Percent
(1, 1, 1) 43.34%
(1, 1, 0) 25.59%
(1, 0, 1) 15.27%
(1, 0, 0) 15.76%
@weiglemc, @WebSciDL
How does this compare to other
languages?
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 41
@weiglemc, @WebSciDL
We chose English, Danish, and Korean
• English
– most popular language on the Internet
– over 65 countries have English as an official language
• Danish
– European language
– 96% of population in Denmark uses the Internet
– Denmark has government initiative to archive Danish cultural heritage on the Web
• Korean
– Asian language
– 92% of population in South Korea uses the Internet (highest in Asian countries)
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 42
@weiglemc, @WebSciDL
We gathered seed URIs from DMOZ
• English
– sample of size 10,000
• Danish
– sample of size 10,000
• Korean
– all 2,347 URIs in DMOZ
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 43
@weiglemc, @WebSciDL
Crawled the seed URIs to expand the dataset
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 44
Arabic English Danish Korean
Seeds 15,742 10,000 10,000 2,347
Live 11,014 9,384 9,245 2,070
Language 7,976 8,576 6,331 1,157
Crawled 663,443 224,249 174,369 16,016
Live 482,905 176,261 131,484 11,099
Language 292,670 137,950 99,019 7,965
Total 300,646 146,526 105,350 9,482
December 2015 - March 2016
562,004 total
@weiglemc, @WebSciDL
English and Arabic web pages were .com, Danish
and Korean were ccTLD
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 45
47.37%
84.64%
66.34%
60.28%
37.50%
@weiglemc, @WebSciDL
English and Arabic web pages were located in the
US, Danish and Korean were in their countries
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 46
89.54%
83.12%
57.97%
@weiglemc, @WebSciDL
Arabic is archived less than English, but more
than Danish and Korean
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 47
* Danish government archive is dark (not publicly available)
*
Arabic English Danish Korean
Seeds and Crawled 300,646 146,526 105,350 9,482
Archived 161,678 107,398 41,703 3,972
Percent 53.77% 73.30% 39.59% 41.89%*
@weiglemc, @WebSciDL
>90% of pages listed in DMOZ were archived,
regardless of language
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 48
* DMOZ closed in 2017, but curlie.org began in 2018 with DMOZ data
Arabic English Danish Korean
DMOZ Seeds 2,904 8,576 6,331 1,157
Archived 2,774 8,014 6,164 1,358
Percent 95.52% 93.44% 97.36% 89.52%
@weiglemc, @WebSciDL
There are a few caveats
• DMOZ is no longer operational
• Some web pages are multilingual
• Not easy to characterize “the Arabic Web”
– 67% of Arabic dataset had neither Arabic ccTLD nor Arabic geo-
location
• The language of a web page may shift over time
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 49
@weiglemc, @WebSciDL
How well are Arabic language web
pages archived?
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 50
@weiglemc, @WebSciDL
Only about half of Arabic language web pages are
archived
• Analyzed over 500,000 web pages (2014-2016)
• Arabic is archived less than English, but more than Danish and Korean
• Most Arabic and English pages are geo-located in the US, Danish in
Denmark, and Korean in South Korea
• Web pages present in DMOZ are highly likely to be archived, regardless
of language
ACM TOIS 36(1) 2017 / ACM SIGIR 2019 51
Old Dominion University Web Science and Digital Libraries (WS-DL)
Department of Computer Science https://ws-dl.cs.odu.edu/ @WebSciDL

More Related Content

Similar to Comparing the Archival Rate of Arabic, English, Danish, and Korean Language Web Pages

Lessons Learned From the Longitudinal Sampling of a Large Web Archive
Lessons Learned From the Longitudinal Sampling of a Large Web ArchiveLessons Learned From the Longitudinal Sampling of a Large Web Archive
Lessons Learned From the Longitudinal Sampling of a Large Web ArchiveKritika Garg
 
Introduction to Open Source GIS
Introduction to Open Source GISIntroduction to Open Source GIS
Introduction to Open Source GISSANGHEE SHIN
 
Archive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification FrameworkArchive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification FrameworkSawood Alam
 
Evolving the Web into a Global Dataspace – Advances and Applications
Evolving the Web into a Global Dataspace – Advances and ApplicationsEvolving the Web into a Global Dataspace – Advances and Applications
Evolving the Web into a Global Dataspace – Advances and ApplicationsChris Bizer
 
Alphabet (Google) SWOT Analysis 2018
Alphabet (Google) SWOT Analysis 2018Alphabet (Google) SWOT Analysis 2018
Alphabet (Google) SWOT Analysis 2018Ovidijus Jurevicius
 
Prof. Hendrik Speck - Attention Based Economies - the Economic Value of Googl...
Prof. Hendrik Speck - Attention Based Economies - the Economic Value of Googl...Prof. Hendrik Speck - Attention Based Economies - the Economic Value of Googl...
Prof. Hendrik Speck - Attention Based Economies - the Economic Value of Googl...Hendrik Speck
 
It is hard to compute fixity on archived web pages
It is hard to compute fixity on archived web pagesIt is hard to compute fixity on archived web pages
It is hard to compute fixity on archived web pagesmaturban
 

Similar to Comparing the Archival Rate of Arabic, English, Danish, and Korean Language Web Pages (9)

Lessons Learned From the Longitudinal Sampling of a Large Web Archive
Lessons Learned From the Longitudinal Sampling of a Large Web ArchiveLessons Learned From the Longitudinal Sampling of a Large Web Archive
Lessons Learned From the Longitudinal Sampling of a Large Web Archive
 
Introduction to Open Source GIS
Introduction to Open Source GISIntroduction to Open Source GIS
Introduction to Open Source GIS
 
Archive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification FrameworkArchive Assisted Archival Fixity Verification Framework
Archive Assisted Archival Fixity Verification Framework
 
Evolving the Web into a Global Dataspace – Advances and Applications
Evolving the Web into a Global Dataspace – Advances and ApplicationsEvolving the Web into a Global Dataspace – Advances and Applications
Evolving the Web into a Global Dataspace – Advances and Applications
 
Alphabet (Google) SWOT Analysis 2018
Alphabet (Google) SWOT Analysis 2018Alphabet (Google) SWOT Analysis 2018
Alphabet (Google) SWOT Analysis 2018
 
Prof. Hendrik Speck - Attention Based Economies - the Economic Value of Googl...
Prof. Hendrik Speck - Attention Based Economies - the Economic Value of Googl...Prof. Hendrik Speck - Attention Based Economies - the Economic Value of Googl...
Prof. Hendrik Speck - Attention Based Economies - the Economic Value of Googl...
 
Agrovoc Linked Open Data and the Voc Bench, Potentials for the Community
Agrovoc Linked Open Data and the Voc Bench, Potentials for the CommunityAgrovoc Linked Open Data and the Voc Bench, Potentials for the Community
Agrovoc Linked Open Data and the Voc Bench, Potentials for the Community
 
Nal 2011 05-19
Nal 2011 05-19Nal 2011 05-19
Nal 2011 05-19
 
It is hard to compute fixity on archived web pages
It is hard to compute fixity on archived web pagesIt is hard to compute fixity on archived web pages
It is hard to compute fixity on archived web pages
 

More from Michele Weigle

WS-DL’s Work towards Enabling Personal Use of Web Archives
WS-DL’s Work towards Enabling Personal Use of Web ArchivesWS-DL’s Work towards Enabling Personal Use of Web Archives
WS-DL’s Work towards Enabling Personal Use of Web ArchivesMichele Weigle
 
Intro to Web Archiving
Intro to Web ArchivingIntro to Web Archiving
Intro to Web ArchivingMichele Weigle
 
Enabling Personal Use of Web Archives
Enabling Personal Use of Web ArchivesEnabling Personal Use of Web Archives
Enabling Personal Use of Web ArchivesMichele Weigle
 
Visualizing Webpage Changes Over Time
Visualizing Webpage Changes Over TimeVisualizing Webpage Changes Over Time
Visualizing Webpage Changes Over TimeMichele Weigle
 
How to Write an Academic Paper
How to Write an Academic PaperHow to Write an Academic Paper
How to Write an Academic PaperMichele Weigle
 
How to Prepare and Give and Academic Presentation
How to Prepare and Give and Academic PresentationHow to Prepare and Give and Academic Presentation
How to Prepare and Give and Academic PresentationMichele Weigle
 
My Academic Story via Internet Archive
My Academic Story via Internet ArchiveMy Academic Story via Internet Archive
My Academic Story via Internet ArchiveMichele Weigle
 
A Retasking Framework For Wireless Sensor Networks
A Retasking Framework For Wireless Sensor NetworksA Retasking Framework For Wireless Sensor Networks
A Retasking Framework For Wireless Sensor NetworksMichele Weigle
 
Strategies for Sensor Data Aggregation in Support of Emergency Response
Strategies for Sensor Data Aggregation in Support of Emergency ResponseStrategies for Sensor Data Aggregation in Support of Emergency Response
Strategies for Sensor Data Aggregation in Support of Emergency ResponseMichele Weigle
 
Detecting Off-Topic Web Pages at #CUWARC
Detecting Off-Topic Web Pages at #CUWARCDetecting Off-Topic Web Pages at #CUWARC
Detecting Off-Topic Web Pages at #CUWARCMichele Weigle
 
Energy Harvesting-aware Design for Wireless Nanonetworks
Energy Harvesting-aware Design for Wireless NanonetworksEnergy Harvesting-aware Design for Wireless Nanonetworks
Energy Harvesting-aware Design for Wireless NanonetworksMichele Weigle
 
2015-capwic-gradschool
2015-capwic-gradschool2015-capwic-gradschool
2015-capwic-gradschoolMichele Weigle
 
2015-odu-ece-tools-for-past-web
2015-odu-ece-tools-for-past-web2015-odu-ece-tools-for-past-web
2015-odu-ece-tools-for-past-webMichele Weigle
 
Tools for Managing the Past Web
Tools for Managing the Past WebTools for Managing the Past Web
Tools for Managing the Past WebMichele Weigle
 
Archive What I See Now - 2014 NEH ODH Overview
Archive What I See Now - 2014 NEH ODH OverviewArchive What I See Now - 2014 NEH ODH Overview
Archive What I See Now - 2014 NEH ODH OverviewMichele Weigle
 
Telling Stories with Web Archives
Telling Stories with Web ArchivesTelling Stories with Web Archives
Telling Stories with Web ArchivesMichele Weigle
 
"Archive What I See Now" - NEH ODH overview
"Archive What I See Now" - NEH ODH overview"Archive What I See Now" - NEH ODH overview
"Archive What I See Now" - NEH ODH overviewMichele Weigle
 
TDMA Slot Reservation in Cluster-Based VANETs
TDMA Slot Reservation in Cluster-Based VANETsTDMA Slot Reservation in Cluster-Based VANETs
TDMA Slot Reservation in Cluster-Based VANETsMichele Weigle
 
Visualizing Digital Collections at Archive-It
Visualizing Digital Collections at Archive-ItVisualizing Digital Collections at Archive-It
Visualizing Digital Collections at Archive-ItMichele Weigle
 

More from Michele Weigle (20)

WS-DL’s Work towards Enabling Personal Use of Web Archives
WS-DL’s Work towards Enabling Personal Use of Web ArchivesWS-DL’s Work towards Enabling Personal Use of Web Archives
WS-DL’s Work towards Enabling Personal Use of Web Archives
 
Intro to Web Archiving
Intro to Web ArchivingIntro to Web Archiving
Intro to Web Archiving
 
Enabling Personal Use of Web Archives
Enabling Personal Use of Web ArchivesEnabling Personal Use of Web Archives
Enabling Personal Use of Web Archives
 
Visualizing Webpage Changes Over Time
Visualizing Webpage Changes Over TimeVisualizing Webpage Changes Over Time
Visualizing Webpage Changes Over Time
 
How to Write an Academic Paper
How to Write an Academic PaperHow to Write an Academic Paper
How to Write an Academic Paper
 
How to Prepare and Give and Academic Presentation
How to Prepare and Give and Academic PresentationHow to Prepare and Give and Academic Presentation
How to Prepare and Give and Academic Presentation
 
My Academic Story via Internet Archive
My Academic Story via Internet ArchiveMy Academic Story via Internet Archive
My Academic Story via Internet Archive
 
A Retasking Framework For Wireless Sensor Networks
A Retasking Framework For Wireless Sensor NetworksA Retasking Framework For Wireless Sensor Networks
A Retasking Framework For Wireless Sensor Networks
 
Strategies for Sensor Data Aggregation in Support of Emergency Response
Strategies for Sensor Data Aggregation in Support of Emergency ResponseStrategies for Sensor Data Aggregation in Support of Emergency Response
Strategies for Sensor Data Aggregation in Support of Emergency Response
 
Detecting Off-Topic Web Pages at #CUWARC
Detecting Off-Topic Web Pages at #CUWARCDetecting Off-Topic Web Pages at #CUWARC
Detecting Off-Topic Web Pages at #CUWARC
 
Energy Harvesting-aware Design for Wireless Nanonetworks
Energy Harvesting-aware Design for Wireless NanonetworksEnergy Harvesting-aware Design for Wireless Nanonetworks
Energy Harvesting-aware Design for Wireless Nanonetworks
 
2015-capwic-gradschool
2015-capwic-gradschool2015-capwic-gradschool
2015-capwic-gradschool
 
2015-odu-ece-tools-for-past-web
2015-odu-ece-tools-for-past-web2015-odu-ece-tools-for-past-web
2015-odu-ece-tools-for-past-web
 
Tools for Managing the Past Web
Tools for Managing the Past WebTools for Managing the Past Web
Tools for Managing the Past Web
 
Archive What I See Now - 2014 NEH ODH Overview
Archive What I See Now - 2014 NEH ODH OverviewArchive What I See Now - 2014 NEH ODH Overview
Archive What I See Now - 2014 NEH ODH Overview
 
Bits of Research
Bits of ResearchBits of Research
Bits of Research
 
Telling Stories with Web Archives
Telling Stories with Web ArchivesTelling Stories with Web Archives
Telling Stories with Web Archives
 
"Archive What I See Now" - NEH ODH overview
"Archive What I See Now" - NEH ODH overview"Archive What I See Now" - NEH ODH overview
"Archive What I See Now" - NEH ODH overview
 
TDMA Slot Reservation in Cluster-Based VANETs
TDMA Slot Reservation in Cluster-Based VANETsTDMA Slot Reservation in Cluster-Based VANETs
TDMA Slot Reservation in Cluster-Based VANETs
 
Visualizing Digital Collections at Archive-It
Visualizing Digital Collections at Archive-ItVisualizing Digital Collections at Archive-It
Visualizing Digital Collections at Archive-It
 

Recently uploaded

Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 

Recently uploaded (20)

Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 

Comparing the Archival Rate of Arabic, English, Danish, and Korean Language Web Pages

  • 1. Comparing the Archival Rate of Arabic, English, Danish, and Korean Language Web Pages Lulwah M. Alkwai, Michael L. Nelson, Michele C. Weigle (@weiglemc) Web Sciences and Digital Libraries (WS-DL) Group (@WebSciDL) Department of Computer Science Old Dominion University Norfolk, Virginia, USA July 24, 2019 / ACM SIGIR 2019 / Paris, France Published in ACM Transactions on Information Systems (TOIS), 36(1), July 2017 Extended version of Best Student Paper awardee at IEEE/ACM JCDL 2015
  • 2. @weiglemc, @WebSciDL 2000 Web archives are collections of web pages of the past ACM TOIS 36(1) 2017 / ACM SIGIR 2019 2 2007 2009 2012
  • 3. @weiglemc, @WebSciDL Web archives are essential for studying recent history and culture ACM TOIS 36(1) 2017 / ACM SIGIR 2019 3 https://twitter.com/NetPreserve/status/1141321443373920256 (photo cropped and enlarged) http://web.archive.org/web/19970222174751/http://www1.geocities.com/
  • 4. @weiglemc, @WebSciDL The Internet Archive holds the largest web archive ACM TOIS 36(1) 2017 / ACM SIGIR 2019 4 https://archive.org/web
  • 5. @weiglemc, @WebSciDL But it’s not the only one ACM TOIS 36(1) 2017 / ACM SIGIR 2019 5 http://timetravel.mementoweb.org/list/19990518173206/http://geocities.com https://en.wikipedia.org/wiki/List_of_Web_archiving_initiatives
  • 6. @weiglemc, @WebSciDL We’ve studied recent (2010s) events in the Middle East • Iranian Elections and Protests - Jun 2009 – SalahEldeen and Nelson, TPDL 2012 • Egyptian Revolution - Jan 20 - Mar 1, 2011 – SalahEldeen and Nelson, TPDL 2012 – AlNoamany, Weigle, Nelson, ACM WebSci 2017 • Syrian Uprising - Mar 2012 – SalahEldeen and Nelson, TPDL 2012 • Egypt’s Presidential Election - 2012 – AlNoamany, Weigle, Nelson, TPDL 2015, IJDL 2016 ACM TOIS 36(1) 2017 / ACM SIGIR 2019 6
  • 7. @weiglemc, @WebSciDL But, we can only study the past Web that already exists in the archives ACM TOIS 36(1) 2017 / ACM SIGIR 2019 7 Hany M. SalahEldeen, Michael L. Nelson, Losing My Revolution: How Many Resources Shared on Social Media Have Been Lost?, Proceedings of TPDL 2012. 11% of resources shared in social media disappear each year
  • 8. @weiglemc, @WebSciDL How well is the Arabic Web archived? • Arabic is the 4th most popular language on the Internet • Anecdotally known that web archives and search engines favor Western and English language pages ACM TOIS 36(1) 2017 / ACM SIGIR 2019 8
  • 9. @weiglemc, @WebSciDL We investigated the state of archival of Arabic language web pages in 2014-2015 • Gathered Arabic language web pages • Analyzed domains, TLDs, GeoIP • Analyzed presence in Google’s index and web archives • Compared this to the archival and indexing of English, Danish, and Korean language web pages ACM TOIS 36(1) 2017 / ACM SIGIR 2019 9
  • 10. @weiglemc, @WebSciDL How to gather and detect Arabic language web pages? ACM TOIS 36(1) 2017 / ACM SIGIR 2019 10
  • 11. @weiglemc, @WebSciDL Gathered URIs from Arabic website directories • 15,743 seed URIs – 15,092 were unique – 11,014 were live on the Web in March-May 2014 ACM TOIS 36(1) 2017 / ACM SIGIR 2019 11 Directory Registered Country Year Estab. Directory URI URIs DMOZ US 1999 Dmoz.org/world/arabic 4,086 Raddadi Saudi Arabia 2000 Raddadi.com 3,271 Star28 Lebanon 2004 Star28.com 8,386 Total 15,743
  • 12. @weiglemc, @WebSciDL We used four methods to determine the language of the 11,014 URIs • HTTP Content-Language header • HTML Title tag • Language Detection API • Trigrams ACM TOIS 36(1) 2017 / ACM SIGIR 2019 12 https://github.com/kent37/guess-language http://detectlanguage.com https://github.com/decultured/Python-Language-Detector Sample of test results evaluated by a native speaker (first author)
  • 13. @weiglemc, @WebSciDL HTTP Content-Language classified 41% of the seed URIs as Arabic ACM TOIS 36(1) 2017 / ACM SIGIR 2019 13 HTTP response header ex: Content-Language: ar HTML tag ex: <html dir="rtl" xmlns="http://www.w3.org/19 99/xhtml" xml:lang="ar" lang="ar">
  • 14. @weiglemc, @WebSciDL HTML Title tag classified 38% of the seed URIs as Arabic 14ACM TOIS 36(1) 2017 / ACM SIGIR 2019 Extract text from HTML title tag ex: <TITLE>‫الشامل‬ ‫العرب‬ ‫/<دليل‬TITLE> guess-language library https://github.com/kent37/guess-language
  • 15. @weiglemc, @WebSciDL The Language Detection API test classified 39% of the seed URIs as Arabic ACM TOIS 36(1) 2017 / ACM SIGIR 2019 15 Extract title and text Language Detection API http://detectlanguage.com
  • 16. @weiglemc, @WebSciDL The Trigram test classified 36% of the seed URIs as Arabic 16ACM TOIS 36(1) 2017 / ACM SIGIR 2019 Extract title and text Test sequences of three letters (trigrams) Python-Language- Detector tool https://github.com/decultured/Python- Language-Detector
  • 17. @weiglemc, @WebSciDL We took the union to obtain 7976 Arabic seed URIs 17ACM TOIS 36(1) 2017 / ACM SIGIR 2019 72.4% of the seed URIs were determined to be in Arabic
  • 18. @weiglemc, @WebSciDL We expanded the dataset by crawling the live Web and the past Web • Crawled the 7976 live Arabic seed URIs, 2 levels deep – gathered all URIs linked from each seed URI – then, gathered all URIs linked from those URIs – 575,242 additional URIs • Crawled the most recent memento of the Arabic seed URIs – 515,821 additional URIs • Total of 663,443 unique crawled URIs gathered ACM TOIS 36(1) 2017 / ACM SIGIR 2019 18
  • 19. @weiglemc, @WebSciDL 482,905 were live ACM TOIS 36(1) 2017 / ACM SIGIR 2019 19
  • 20. @weiglemc, @WebSciDL 292,670 were live and Arabic ACM TOIS 36(1) 2017 / ACM SIGIR 2019 20 Total Arabic Dataset = 300,646 URIs 7,976 seed URIs + 292,670 crawled URIs
  • 21. @weiglemc, @WebSciDL What are the characteristics of our Arabic language dataset? ACM TOIS 36(1) 2017 / ACM SIGIR 2019 21
  • 22. @weiglemc, @WebSciDL Dataset has 17,536 unique domains ACM TOIS 36(1) 2017 / ACM SIGIR 2019 22 Rank Domain URIs GeoIP Category 1 Alarab.net 284 US News 2 Aljarida.com 248 US News 3 Arabic.cnn.com 245 US News 4 Alarabiya.net 231 US News 5 Ar.wikipedia.org 230 US Encyclopedia 6 Aljazeera.net 213 US News 7 Moheet.com 142 US News 8 Facebook.com 133 US Social 9 Al-sharq.com 132 US Middle East Portal 10 Lakii.com 123 US General Portal 17 Kuwaitclub.com.kw 71 Kuwait Sport
  • 23. @weiglemc, @WebSciDL Rank Domain URIs GeoIP Category 1 Alarab.net 284 US News 2 Aljarida.com 248 US News 3 Arabic.cnn.com 245 US News 4 Alarabiya.net 231 US News 5 Ar.wikipedia.org 230 US Encyclopedia 6 Aljazeera.net 213 US News 7 Moheet.com 142 US News 8 Facebook.com 133 US Social 9 Al-sharq.com 132 US Middle East Portal 10 Lakii.com 123 US General Portal 17 Kuwaitclub.com.kw 71 Kuwait Sport First Arabic GeoIP is at rank 17 ACM TOIS 36(1) 2017 / ACM SIGIR 2019 23
  • 24. @weiglemc, @WebSciDL Rank Domain URIs GeoIP Category 1 Alarab.net 284 US News 2 Aljarida.com 248 US News 3 Arabic.cnn.com 245 US News 4 Alarabiya.net 231 US News 5 Ar.wikipedia.org 230 US Encyclopedia 6 Aljazeera.net 213 US News 7 Moheet.com 142 US News 8 Facebook.com 133 US Social 9 Al-sharq.com 132 US Middle East Portal 10 Lakii.com 123 US General Portal 17 Kuwaitclub.com.kw 71 Kuwait Sport 6 top domains are news websites ACM TOIS 36(1) 2017 / ACM SIGIR 2019 24
  • 25. @weiglemc, @WebSciDL Rank Domain URIs GeoIP Category 1 Alarab.net 284 US News 2 Aljarida.com 248 US News 3 Arabic.cnn.com 245 US News 4 Alarabiya.net 231 US News 5 Ar.wikipedia.org 230 US Encyclopedia 6 Aljazeera.net 213 US News 7 Moheet.com 142 US News 8 Facebook.com 133 US Social 9 Al-sharq.com 132 US Middle East Portal 10 Lakii.com 123 US General Portal 17 Kuwaitclub.com.kw 71 Kuwait Sport Popular Western domains are in the Top 10 ACM TOIS 36(1) 2017 / ACM SIGIR 2019 25
  • 26. @weiglemc, @WebSciDL TLD Percent com 57.97% net 15.07% org 6.40% gov.sa 1.94% info 1.68% edu.sa 1.27% ws 1.16% org.sa 0.97% com.sa 0.80% gov.eg 0.80% Other 11.94% Over half have a .com TLD ACM TOIS 36(1) 2017 / ACM SIGIR 2019 26
  • 27. @weiglemc, @WebSciDL TLD Percent com 57.97% net 15.07% org 6.40% gov.sa 1.94% info 1.68% edu.sa 1.27% ws 1.16% org.sa 0.97% com.sa 0.80% gov.eg 0.80% Other 11.94% Only ~10% have an Arabic ccTLD ACM TOIS 36(1) 2017 / ACM SIGIR 2019 27 ccTLD Country Percent .sa Saudi Arabia 5.33% .eg Egypt 2.00% .jo Jordan 2.00% .ae United Arab Emirates 1.06% .kw Kuwait 0.82%
  • 28. @weiglemc, @WebSciDL Most are geo-located in the US Geo-location Percent United States 57.97% Arabic Countries 10.53% Germany 9.75% Netherlands 5.29% France 4.37% Canada 3.31% United Kingdom 3.07% Other 5.71% ACM TOIS 36(1) 2017 / ACM SIGIR 2019 28
  • 29. @weiglemc, @WebSciDL Within Arabic countries, most are geo-located in Saudi Arabia Geo-location Percent United States 57.97% Arabic Countries 10.53% Germany 9.75% Netherlands 5.29% France 4.37% Canada 3.31% United Kingdom 3.07% Other 5.71% ACM TOIS 36(1) 2017 / ACM SIGIR 2019 29 Geo-location Percent Saudi Arabia 4.75% Egypt 1.97% Jordan 1.42% Kuwait 0.71% United Arab Emirates 0.67%
  • 30. @weiglemc, @WebSciDL How well are these Arabic web pages indexed and archived? ACM TOIS 36(1) 2017 / ACM SIGIR 2019 30
  • 31. @weiglemc, @WebSciDL 53.77% of Arabic language web pages are archived • Used Memento aggregator to determine if archived – checks multiple web archives • 161,678 URIs were archived • 97% of those were found in the Internet Archive • Mementos also found in 9 other archives ACM TOIS 36(1) 2017 / ACM SIGIR 2019 31
  • 32. @weiglemc, @WebSciDL 6 out of the 10 most archived are news websites URI-Rs Mementos Category gulfup.com 10,987 File Sharing masrawy.com 9,144 Egyptian portal arabic.cnn.com 9,022 News aljazeera.net 8,906 News maktoob.yahoo.com 8,478 Search Engine shorooknews.com 7,548 News arabnews.com 6,274 News bbc.co.uk/arabic 6,268 News ahram.org.eg 5,347 News google.com.sa 4,968 Search Engine ACM TOIS 36(1) 2017 / ACM SIGIR 2019 32
  • 33. @weiglemc, @WebSciDL Many are Arabic versions of globally popular sites URI-Rs Mementos Category gulfup.com 10,987 File Sharing masrawy.com 9,144 Egyptian portal arabic.cnn.com 9,022 News aljazeera.net 8,906 News maktoob.yahoo.com 8,478 Search Engine shorooknews.com 7,548 News arabnews.com 6,274 News bbc.co.uk/arabic 6,268 News ahram.org.eg 5,347 News google.com.sa 4,968 Search Engine ACM TOIS 36(1) 2017 / ACM SIGIR 2019 33
  • 34. @weiglemc, @WebSciDL Most mementos are from recent years 34 Analysis done in 2015
  • 35. @weiglemc, @WebSciDL We looked at indexing of Arabic seed URIs • Used Google Custom Search API to query – limited to 1000 queries/day – tested only seeds ACM TOIS 36(1) 2017 / ACM SIGIR 2019 35
  • 36. @weiglemc, @WebSciDL 69% of Arabic language seed URIs were indexed by Google Arabic Seed Dataset (Live, Indexed, Archived) Percent (1, 1, 1) 43.34% (1, 1, 0) 25.59% (1, 0, 1) 15.27% (1, 0, 0) 15.76% ACM TOIS 36(1) 2017 / ACM SIGIR 2019 36 All seeds were live • 82% of those listed in DMOZ were indexed • 74.6% of the top-level seeds (path depth = 0) were indexed
  • 37. @weiglemc, @WebSciDL 58% of seeds were archived ACM TOIS 36(1) 2017 / ACM SIGIR 2019 37 But, 42% were not archived Arabic Seed Dataset (Live, Indexed, Archived) Percent (1, 1, 1) 43.34% (1, 1, 0) 25.59% (1, 0, 1) 15.27% (1, 0, 0) 15.76%
  • 38. @weiglemc, @WebSciDL 43% were indexed and archived ACM TOIS 36(1) 2017 / ACM SIGIR 2019 38 Arabic Seed Dataset (Live, Indexed, Archived) Percent (1, 1, 1) 43.34% (1, 1, 0) 25.59% (1, 0, 1) 15.27% (1, 0, 0) 15.76% Good! (discovered and saved)
  • 39. @weiglemc, @WebSciDL 31% were not indexed by Google ACM TOIS 36(1) 2017 / ACM SIGIR 2019 39 Arabic Seed Dataset (Live, Indexed, Archived) Percent (1, 1, 1) 43.34% (1, 1, 0) 25.59% (1, 0, 1) 15.27% (1, 0, 0) 15.76%
  • 40. @weiglemc, @WebSciDL 16% were neither indexed nor archived ACM TOIS 36(1) 2017 / ACM SIGIR 2019 40 Bad! (undiscovered and not saved) Arabic Seed Dataset (Live, Indexed, Archived) Percent (1, 1, 1) 43.34% (1, 1, 0) 25.59% (1, 0, 1) 15.27% (1, 0, 0) 15.76%
  • 41. @weiglemc, @WebSciDL How does this compare to other languages? ACM TOIS 36(1) 2017 / ACM SIGIR 2019 41
  • 42. @weiglemc, @WebSciDL We chose English, Danish, and Korean • English – most popular language on the Internet – over 65 countries have English as an official language • Danish – European language – 96% of population in Denmark uses the Internet – Denmark has government initiative to archive Danish cultural heritage on the Web • Korean – Asian language – 92% of population in South Korea uses the Internet (highest in Asian countries) ACM TOIS 36(1) 2017 / ACM SIGIR 2019 42
  • 43. @weiglemc, @WebSciDL We gathered seed URIs from DMOZ • English – sample of size 10,000 • Danish – sample of size 10,000 • Korean – all 2,347 URIs in DMOZ ACM TOIS 36(1) 2017 / ACM SIGIR 2019 43
  • 44. @weiglemc, @WebSciDL Crawled the seed URIs to expand the dataset ACM TOIS 36(1) 2017 / ACM SIGIR 2019 44 Arabic English Danish Korean Seeds 15,742 10,000 10,000 2,347 Live 11,014 9,384 9,245 2,070 Language 7,976 8,576 6,331 1,157 Crawled 663,443 224,249 174,369 16,016 Live 482,905 176,261 131,484 11,099 Language 292,670 137,950 99,019 7,965 Total 300,646 146,526 105,350 9,482 December 2015 - March 2016 562,004 total
  • 45. @weiglemc, @WebSciDL English and Arabic web pages were .com, Danish and Korean were ccTLD ACM TOIS 36(1) 2017 / ACM SIGIR 2019 45 47.37% 84.64% 66.34% 60.28% 37.50%
  • 46. @weiglemc, @WebSciDL English and Arabic web pages were located in the US, Danish and Korean were in their countries ACM TOIS 36(1) 2017 / ACM SIGIR 2019 46 89.54% 83.12% 57.97%
  • 47. @weiglemc, @WebSciDL Arabic is archived less than English, but more than Danish and Korean ACM TOIS 36(1) 2017 / ACM SIGIR 2019 47 * Danish government archive is dark (not publicly available) * Arabic English Danish Korean Seeds and Crawled 300,646 146,526 105,350 9,482 Archived 161,678 107,398 41,703 3,972 Percent 53.77% 73.30% 39.59% 41.89%*
  • 48. @weiglemc, @WebSciDL >90% of pages listed in DMOZ were archived, regardless of language ACM TOIS 36(1) 2017 / ACM SIGIR 2019 48 * DMOZ closed in 2017, but curlie.org began in 2018 with DMOZ data Arabic English Danish Korean DMOZ Seeds 2,904 8,576 6,331 1,157 Archived 2,774 8,014 6,164 1,358 Percent 95.52% 93.44% 97.36% 89.52%
  • 49. @weiglemc, @WebSciDL There are a few caveats • DMOZ is no longer operational • Some web pages are multilingual • Not easy to characterize “the Arabic Web” – 67% of Arabic dataset had neither Arabic ccTLD nor Arabic geo- location • The language of a web page may shift over time ACM TOIS 36(1) 2017 / ACM SIGIR 2019 49
  • 50. @weiglemc, @WebSciDL How well are Arabic language web pages archived? ACM TOIS 36(1) 2017 / ACM SIGIR 2019 50
  • 51. @weiglemc, @WebSciDL Only about half of Arabic language web pages are archived • Analyzed over 500,000 web pages (2014-2016) • Arabic is archived less than English, but more than Danish and Korean • Most Arabic and English pages are geo-located in the US, Danish in Denmark, and Korean in South Korea • Web pages present in DMOZ are highly likely to be archived, regardless of language ACM TOIS 36(1) 2017 / ACM SIGIR 2019 51 Old Dominion University Web Science and Digital Libraries (WS-DL) Department of Computer Science https://ws-dl.cs.odu.edu/ @WebSciDL

Editor's Notes

  1. Since all results are somewhat similar we took any uri that resulted in Arabic in this tests in our collection of Arabic website The reliability of the tests to determine if a web page is in Arabic was measured by having a native reader (the first author) quickly evaluate a sample of pages. Next, we measured the number of URIs reported as Arabic. Figure 1 shows the intersection between the four language tests.
  2. Since all results are somewhat similar we took any uri that resulted in Arabic in this tests in our collection of Arabic website The reliability of the tests to determine if a web page is in Arabic was measured by having a native reader (the first author) quickly evaluate a sample of pages. Next, we measured the number of URIs reported as Arabic. Figure 1 shows the intersection between the four language tests.
  3. Since all results are somewhat similar we took any uri that resulted in Arabic in this tests in our collection of Arabic website The reliability of the tests to determine if a web page is in Arabic was measured by having a native reader (the first author) quickly evaluate a sample of pages. Next, we measured the number of URIs reported as Arabic. Figure 1 shows the intersection between the four language tests.
  4. Since all results are somewhat similar we took any uri that resulted in Arabic in this tests in our collection of Arabic website The reliability of the tests to determine if a web page is in Arabic was measured by having a native reader (the first author) quickly evaluate a sample of pages. Next, we measured the number of URIs reported as Arabic. Figure 1 shows the intersection between the four language tests.
  5. Since all results are somewhat similar we took any uri that resulted in Arabic in this tests in our collection of Arabic website The reliability of the tests to determine if a web page is in Arabic was measured by having a native reader (the first author) quickly evaluate a sample of pages. Next, we measured the number of URIs reported as Arabic. Figure 1 shows the intersection between the four language tests.
  6. 3.3 Crawling Arabic Seed URIs To increase the size of our dataset, we crawled the Arabic seed URIs between January-March 2014. Our first pass was to gather additional URIs linked from the live Web versions of our seed URIs. This resulted in collecting 575,242 URIs, all of which were available on the live Web. To gather even more URIs, we crawled the Arabic seed URIs that had at least one archived version (or, memento). We crawled the most recent memento and gathered 515,821 URIs. Of these, only 335,283 were available on the live Web. Combining the three sets (original URIs, crawled live, and crawled archived), we obtained a total of 663,443 unique URIs. Sankey Diagram Figures show the summary of collecting Arabic URIs for seed URIs and for crawled URIs. Combining the seed URIs and crawled URIs, we collected 300,646 Arabic URIs that we analyze in the remainder of the paper
  7. 3.3 Crawling Arabic Seed URIs To increase the size of our dataset, we crawled the Arabic seed URIs between January-March 2014. Our first pass was to gather additional URIs linked from the live Web versions of our seed URIs. This resulted in collecting 575,242 URIs, all of which were available on the live Web. To gather even more URIs, we crawled the Arabic seed URIs that had at least one archived version (or, memento). We crawled the most recent memento and gathered 515,821 URIs. Of these, only 335,283 were available on the live Web. Combining the three sets (original URIs, crawled live, and crawled archived), we obtained a total of 663,443 unique URIs. Sankey Diagram Figures show the summary of collecting Arabic URIs for seed URIs and for crawled URIs. Combining the seed URIs and crawled URIs, we collected 300,646 Arabic URIs that we analyze in the remainder of the paper
  8. 4.1 Unique Domains First, we investigate the number of unique domains in our dataset. Out of the 300,646 Arabic URIs,there are 17,536 unique domains. The most frequent domains are shown in Table 3. We also tested the GeoIP location of the top-level webpage of each of these domains and found that the top 16 are all located in the US.
  9. The first domain we find located in an Arabic country is the 17th most frequent.
  10. We note that several of these top domains are popular Western sites, such as cnn.com and wikipedia.org. This indicates that the Arabic language community is already using services on Western sites that are likely to be archived
  11. We investigate the top level domain code TLD (ccTLD), together termed effective TLD, of the unique Arabic language domains. Generic TLDs such as .com, .net, and .org are open for any registrant. In addition to TLDs, many sites also use the two-letter ccTLD of their home country. Although a small percentage of the websites add the ccTLD, it may be a good indication of the source of the website. Table 4 shows the distribution of the top 10 effective TLDs. We also checked if the ccTLD was from a country where Arabic is an official language (listed in Table 1). .ws is the Internet country code top-level domain (ccTLD) for Samoa The .ws country code has been marketed as a domain hack, with the .ws purportedly standing for "World Site" or "Web site", providing a "global" Internet presence to registrants, as it supports all internationalized domain names.
  12. We note that the top Arabic ccTLD, .sa for Saudi Arabia, is used in fewer URIs than the generic TLDs .com, .net, and .org.
  13. 4.6 GeoIP Location Earlier we looked at the ccTLD of the URIs to help determine where the hosts of the webpages might be located Now we want to look at the GeoIP location of the IP address of the unique hostnames Steps: First, we obtained the IP addresses of the hostnames using nslookup, which uses DNS to convert the hostname to its IP address. Then we used the MaxMind GeoLite29 database to determine location from the IP address. Which tests at 99.8% accuracy at the country level. We used this method to determine GeoIP for the Arabic URI dataset (300,646 URIs). Table 10 shows the top GeoIP locations, with Arabic countries grouped together.
  14. We found that less than 11% of the URIs are hosted in Arabic countries. Table 11 shows the top 5 GeoIP locations from Arabic countries.
  15. Table 7 lists the top 10 archived URI-Rs with the most mementos.
  16. Table 7 lists the top 10 archived URI-Rs with the most mementos.
  17. Figure 5 shows the number of URI-Ms with Memento- Datetimes in each year. URI-M – an archived snapshot of the URI-R at a specific date and time, which is called the Memento-Datetime, e.g., URI-Mi =URI-R@ti. Example: www.AlJazeera.net @ Jan-1-2015 14:24:27
  18. 4.7 Search Engine Indexing we are also interested to discover how well they are indexed in search engines such as Google. We used the Google Custom Search API to determine if the Arabic seed URIs are indexed by Google. We tested only the seed URIs because we were limited by the restriction of 1000 requests per day in the API. However, we note that the Google user web interface may produce different results than the Custom Search API. For the Arabic seed URIs, we can indicate if they were present on the live Web, in the Google index, and present in an archive, creating a (live, indexed, archived) tuple. In Table , we show the percentage of our Arabic seed URI dataset (7,976 URIs) that fell into each permutation of the tuple. We note that all of our Arabic seeds were present on the live Web at the time of our analysis.
  19. 4.7 Search Engine Indexing we are also interested to discover how well they are indexed in search engines such as Google. We used the Google Custom Search API to determine if the Arabic seed URIs are indexed by Google. We tested only the seed URIs because we were limited by the restriction of 1000 requests per day in the API. However, we note that the Google user web interface may produce different results than the Custom Search API. For the Arabic seed URIs, we can indicate if they were present on the live Web, in the Google index, and present in an archive, creating a (live, indexed, archived) tuple. In Table , we show the percentage of our Arabic seed URI dataset (7,976 URIs) that fell into each permutation of the tuple. We note that all of our Arabic seeds were present on the live Web at the time of our analysis.
  20. 4.7 Search Engine Indexing we are also interested to discover how well they are indexed in search engines such as Google. We used the Google Custom Search API to determine if the Arabic seed URIs are indexed by Google. We tested only the seed URIs because we were limited by the restriction of 1000 requests per day in the API. However, we note that the Google user web interface may produce different results than the Custom Search API. For the Arabic seed URIs, we can indicate if they were present on the live Web, in the Google index, and present in an archive, creating a (live, indexed, archived) tuple. In Table , we show the percentage of our Arabic seed URI dataset (7,976 URIs) that fell into each permutation of the tuple. We note that all of our Arabic seeds were present on the live Web at the time of our analysis.
  21. 4.7 Search Engine Indexing we are also interested to discover how well they are indexed in search engines such as Google. We used the Google Custom Search API to determine if the Arabic seed URIs are indexed by Google. We tested only the seed URIs because we were limited by the restriction of 1000 requests per day in the API. However, we note that the Google user web interface may produce different results than the Custom Search API. For the Arabic seed URIs, we can indicate if they were present on the live Web, in the Google index, and present in an archive, creating a (live, indexed, archived) tuple. In Table , we show the percentage of our Arabic seed URI dataset (7,976 URIs) that fell into each permutation of the tuple. We note that all of our Arabic seeds were present on the live Web at the time of our analysis.
  22. 4.7 Search Engine Indexing we are also interested to discover how well they are indexed in search engines such as Google. We used the Google Custom Search API to determine if the Arabic seed URIs are indexed by Google. We tested only the seed URIs because we were limited by the restriction of 1000 requests per day in the API. However, we note that the Google user web interface may produce different results than the Custom Search API. For the Arabic seed URIs, we can indicate if they were present on the live Web, in the Google index, and present in an archive, creating a (live, indexed, archived) tuple. In Table , we show the percentage of our Arabic seed URI dataset (7,976 URIs) that fell into each permutation of the tuple. We note that all of our Arabic seeds were present on the live Web at the time of our analysis.