1. How Well Are Arabic
Websites Archived?
Lulwah M. Alkwai, Michael L. Nelson, and Michele C. Weigle
Old Dominion University
Department of Computer Science
Norfolk, Virginia 23529 USA
JCDL 2015
Knoxville, TN
June 21-25, 2015
11. Top ten languages in the Internet
World Language Map
Source: Quick Maps of the World immigration -
http://www.allcountries.org/maps/world_language_maps.html
Source: Internet World Stats -
http://www.internetworldstats.com/stats7.htm
11
15. Ø The number of Arabic speaking Internet users has grown
rapidly
Ø There has been previous work on the coverage of web
archives
Ø Little has been done in terms of Arabic language content
15
Why are we doing this?
16. How Much of the Web Is Archived?
Ø Sample of URIs from four different
sources (DMOZ, Delicious, Bitly,
Search engine indexes)
Ø The archival percentages ranged
from 16% to 79%
2013, A follow-on study:
Ø Archival percentages had increased
from 33% to 95%
Ø These studies were not focused on
content from specific countries or
content in specific languages
16
17. A fair history of the Web?
Examining country balance in the Internet Archive
Ø Examined country balance in the
Internet Archive:
Country Domain Archived
US .com 92%
Taiwan .com.tw 73%
China .com.cn 58%
Singapore .com.sg 73%
17
Ø This work focused on TLD rather
than content language or location
18. Characterization of National Web Domains
Ø Used 10 national web domains
§ 120 million pages
§ 24 countries
§ They studied page sizes,
degrees, link based scores, etc.
§ They found that depth,
response code were similar
Ø In this work, additional methods are
required to determine if a site
belongs to a particular country
18
19. Characterizing a National Community Web
Ø Used Portuguese dataset:
§ (.pt) ccTLD
§ (.com,.net,.org,.tv) in Portuguese
language that has at least one
incoming link from (.pt) ccTLD
Ø They identify, collect, and characterize the
Portuguese Web
19
20. GeoIP only
ccTLD only
Both
Neither
² News: al-watan.com
² ccTLD: Not Arabic (.com)
² GeoIP: Arabic country (Qatar)
How do we classify Arabic websites?
20
21. GeoIP only
ccTLD only
Both
Neither
² E-Marketing: haraj.com.sa
² ccTLD: Arabic (.sa)
² GeoIP: Not an Arabic country (Ireland)
² News: al-watan.com
² ccTLD: Not Arabic (.com)
² GeoIP: Arabic country (Qatar)
21
How do we classify Arabic websites?
22. GeoIP only
ccTLD only
Both
Neither
² E-Marketing: haraj.com.sa
² ccTLD: Arabic (.sa)
² GeoIP: Not an Arabic country (Ireland)
² News: al-watan.com
² ccTLD: Not Arabic (.com)
² GeoIP: Arabic country (Qatar)
22
² Educational: uoh.edu.sa
² ccTLD: Arabic (.sa)
² GeoIP: Arabic country (SA)
How do we classify Arabic websites?
23. GeoIP only
ccTLD only
Both
Neither
² News: alarabiya.net
² ccTLD: Not Arabic (.net)
² GeoIP: Not Arabic country (US)
² E-Marketing: haraj.com.sa
² ccTLD: Arabic (.sa)
² GeoIP: Not an Arabic country (Ireland)
² News: al-watan.com
² ccTLD: Not Arabic (.com)
² GeoIP: Arabic country (Qatar)
23
² Educational: uoh.edu.sa
² ccTLD: Arabic (.sa)
² GeoIP: Arabic country (SA)
How do we classify Arabic websites?
24. Selecting seed URIs
Name Registered Year URI count
DMOZ US 1999 Dmoz.org/world/arabic 4,086
Raddadi Saudi Arabia 2000 Raddadi.com 3,271
Star28 Lebanon 2004 Star28.com 8,386
Total 15,743
• 15,092 unique seed URIs
• 11,014 URIs that existed in the live web
24
25. Determining a webpage language
• HTTP header Content-Language
• HTML title tag language
• Trigram method
• Language detection API client
25
26. >
curl
–I
www.alquds.com
HTTP/1.1
200
OK
Server:
nginx/1.6.2
Date:
Wed,
03
Jun
2015
19:11:31
GMT
Content-‐Type:
text/html;
charset=utf-‐8
Connection:
keep-‐alive
X-‐Powered-‐By:
PHP/5.3.3
X-‐Drupal-‐Cache:
HIT
Etag:
"1433361507-‐0"
Content-‐Language:
ar
…
HTTP header Content-Language
example#1
26
27. >
curl
–I
www.alquds.com
HTTP/1.1
200
OK
Server:
nginx/1.6.2
Date:
Wed,
03
Jun
2015
19:11:31
GMT
Content-‐Type:
text/html;
charset=utf-‐8
Connection:
keep-‐alive
X-‐Powered-‐By:
PHP/5.3.3
X-‐Drupal-‐Cache:
HIT
Etag:
"1433361507-‐0"
Content-‐Language:
ar
…
HTTP header Content-Language
example#1
27
37. https://code.google.com/p/guess-language/
37
Ø
curl
-‐s
www.cnn.com
|
grep
-‐io
"<title>[^<]*"
|
tail
-‐c+8
>
cnn_title.txt
>
Python
>>>
myfile=open("cnn_title.txt",
"r")
>>>
data=myfile.read()
>>>
from
guess_language
import
guess_language
>>>
guess_language(data)
'en'
HTML title tag language
example#2
38. https://code.google.com/p/guess-language/
38
Ø
curl
-‐s
www.cnn.com
|
grep
-‐io
"<title>[^<]*"
|
tail
-‐c+8
>
cnn_title.txt
>
Python
>>>
myfile=open("cnn_title.txt",
"r")
>>>
data=myfile.read()
>>>
from
guess_language
import
guess_language
>>>
guess_language(data)
'en'
HTML title tag language
example#2
39. § Built in C++ and wrapped as a python module
§ Identification is performed through basic trigram lookups
paired with unicode character set recognition
§ Accuracy is high for even short sample texts
https://github.com/decultured/Python-Language-Detector
Trigram method
39
40. https://github.com/decultured/Python-Language-Detector
>
curl
www.raddadi.com
>
raddadi.txt
>
Python
>>>
from
bs4
import
BeautifulSoup
>>>
soup
=
BeautifulSoup(open("raddadi.txt"))
>>>
for
script
in
soup(["script",
"style"]):
script.extract()
>>>
text
=
soup.get_text()
>>>
lines
=
(line.strip()
for
line
in
text.splitlines())
>>>
chunks
=
(phrase.strip()
for
line
in
lines
for
phrase
in
line.split("
"))
>>>
text
=
'n'.join(chunk
for
chunk
in
chunks
if
chunk)
Trigram method
example#1
40
41. >
curl
www.raddadi.com
>
raddadi.txt
>
Python
>>>
from
bs4
import
BeautifulSoup
>>>
soup
=
BeautifulSoup(open("raddadi.txt"))
>>>
for
script
in
soup(["script",
"style"]):
script.extract()
>>>
text
=
soup.get_text()
>>>
lines
=
(line.strip()
for
line
in
text.splitlines())
>>>
chunks
=
(phrase.strip()
for
line
in
lines
for
phrase
in
line.split("
"))
>>>
text
=
'n'.join(chunk
for
chunk
in
chunks
if
chunk)
https://github.com/decultured/Python-Language-Detector
>>>
import
sys
>>>
sys.path.append('languageDetector')
>>>
import
languageIdentifiera
>>>
languageIdentifier.load("languageDetector/
trigrams/")
>>>
print
languageIdentifier.identify(text,
300,
300)
ar
41
Trigram method
example#1
42. >
curl
www.raddadi.com
>
raddadi.txt
>
Python
>>>
from
bs4
import
BeautifulSoup
>>>
soup
=
BeautifulSoup(open("raddadi.txt"))
>>>
for
script
in
soup(["script",
"style"]):
script.extract()
>>>
text
=
soup.get_text()
>>>
lines
=
(line.strip()
for
line
in
text.splitlines())
>>>
chunks
=
(phrase.strip()
for
line
in
lines
for
phrase
in
line.split("
"))
>>>
text
=
'n'.join(chunk
for
chunk
in
chunks
if
chunk)
>>>
import
sys
>>>
sys.path.append('languageDetector')
>>>
import
languageIdentifiera
>>>
languageIdentifier.load("languageDetector/
trigrams/")
>>>
print
languageIdentifier.identify(text,
300,
300)
ar
https://github.com/decultured/Python-Language-Detector
42
Trigram method
example#1
43. https://github.com/decultured/Python-Language-Detector
>
curl
www.cnn.com
>
cnn.txt
>
Python
>>>
from
bs4
import
BeautifulSoup
>>>
soup
=
BeautifulSoup(open("cnn.txt"))
>>>
for
script
in
soup(["script",
"style"]):
script.extract()
>>>
text
=
soup.get_text()
>>>
lines
=
(line.strip()
for
line
in
text.splitlines())
>>>
chunks
=
(phrase.strip()
for
line
in
lines
for
phrase
in
line.split("
"))
>>>
text
=
'n'.join(chunk
for
chunk
in
chunks
if
chunk)
43
Trigram method
example#2
44. >
curl
www.cnn.com
>
cnn.txt
>
Python
>>>
from
bs4
import
BeautifulSoup
>>>
soup
=
BeautifulSoup(open("cnn.txt"))
>>>
for
script
in
soup(["script",
"style"]):
script.extract()
>>>
text
=
soup.get_text()
>>>
lines
=
(line.strip()
for
line
in
text.splitlines())
>>>
chunks
=
(phrase.strip()
for
line
in
lines
for
phrase
in
line.split("
"))
>>>
text
=
'n'.join(chunk
for
chunk
in
chunks
if
chunk)
https://github.com/decultured/Python-Language-Detector
>>>
import
sys
>>>
sys.path.append('languageDetector')
>>>
import
languageIdentifiera
>>>
languageIdentifier.load("languageDetector/
trigrams/")
>>>
print
languageIdentifier.identify(text,
300,
300)
en
44
Trigram method
example#2
45. >
curl
www.cnn.com
>
cnn.txt
>
Python
>>>
from
bs4
import
BeautifulSoup
>>>
soup
=
BeautifulSoup(open("cnn.txt"))
>>>
for
script
in
soup(["script",
"style"]):
script.extract()
>>>
text
=
soup.get_text()
>>>
lines
=
(line.strip()
for
line
in
text.splitlines())
>>>
chunks
=
(phrase.strip()
for
line
in
lines
for
phrase
in
line.split("
"))
>>>
text
=
'n'.join(chunk
for
chunk
in
chunks
if
chunk)
https://github.com/decultured/Python-Language-Detector
>>>
import
sys
>>>
sys.path.append('languageDetector')
>>>
import
languageIdentifiera
>>>
languageIdentifier.load("languageDetector/
trigrams/")
>>>
print
languageIdentifier.identify(text,
300,
300)
en
45
Trigram method
example#2
46. Language detection API client
• Returns detected language codes and scores
• You have to setup your personal API key,
(http://detectlanguage.com)
• Example of output:
https://detectlanguage.com
{"data":{"detections":
[{"language":"ar","isReliable":true,"confidence":9.54}]}}
46
47. • Returns detected language codes and scores
• You have to setup your personal API key,
(http://detectlanguage.com)
• Example of output:
https://detectlanguage.com
{"data":{"detections":
[{"language":"ar","isReliable":true,"confidence":9.54}]}}
• how much text you
pass
• how well it is
identified
False means that the
confidence is low
Language
code
47
Language detection API client
48. https://detectlanguage.com
>
curl
www.raddadi.com
>
raddadi.txt
>
Python
>>>
from
bs4
import
BeautifulSoup
>>>
soup
=
BeautifulSoup(open("raddadi.txt"))
>>>
for
script
in
soup(["script",
"style"]):
…
script.extract()
>>>
text
=
soup.get_text()
>>>
lines
=
(line.strip()
for
line
in
text.splitlines())
>>>
chunks
=
(phrase.strip()
for
line
in
lines
for
phrase
in
line.split("
"))
>>>
text
=
'n'.join(chunk
for
chunk
in
chunks
if
chunk)
Language detection API client
example#1
48
49. >
curl
www.raddadi.com
>
raddadi.txt
>
Python
>>>
from
bs4
import
BeautifulSoup
>>>
soup
=
BeautifulSoup(open("raddadi.txt"))
>>>
for
script
in
soup(["script",
"style"]):
…
script.extract()
>>>
text
=
soup.get_text()
>>>
lines
=
(line.strip()
for
line
in
text.splitlines())
>>>
chunks
=
(phrase.strip()
for
line
in
lines
for
phrase
in
line.split("
"))
>>>
text
=
'n'.join(chunk
for
chunk
in
chunks
if
chunk)
https://detectlanguage.com
>>>
import
detectlanguage
>>>
detectlanguage.configuration.api_key
=
"YOUR
API
KEY"
>>>
detectlanguage.detect(text)
{"data":{"detections":
[{"language":"ar","isReliable":true,"confidence":8.32},
{"language":"tk","isReliable":false,"confidence":0.01}]}}
49
Language detection API client
example#1
50. >
curl
www.raddadi.com
>
raddadi.txt
>
Python
>>>
from
bs4
import
BeautifulSoup
>>>
soup
=
BeautifulSoup(open("raddadi.txt"))
>>>
for
script
in
soup(["script",
"style"]):
…
script.extract()
>>>
text
=
soup.get_text()
>>>
lines
=
(line.strip()
for
line
in
text.splitlines())
>>>
chunks
=
(phrase.strip()
for
line
in
lines
for
phrase
in
line.split("
"))
>>>
text
=
'n'.join(chunk
for
chunk
in
chunks
if
chunk)
https://detectlanguage.com
>>>
import
detectlanguage
>>>
detectlanguage.configuration.api_key
=
"YOUR
API
KEY"
>>>
detectlanguage.detect(text)
{"data":{"detections":
[{"language":"ar","isReliable":true,"confidence":8.32},
{"language":"tk","isReliable":false,"confidence":0.01}]}}
50
Language detection API client
example#1
51. https://detectlanguage.com
>
curl
www.cnn.com
>
cnn.txt
>
Python
>>>
from
bs4
import
BeautifulSoup
>>>
soup
=
BeautifulSoup(open("cnn.txt"))
>>>
for
script
in
soup(["script",
"style"]):
…
script.extract()
>>>
text
=
soup.get_text()
>>>
lines
=
(line.strip()
for
line
in
text.splitlines())
>>>
chunks
=
(phrase.strip()
for
line
in
lines
for
phrase
in
line.split("
"))
>>>
text
=
'n'.join(chunk
for
chunk
in
chunks
if
chunk)
51
Language detection API client
example#2
52. >
curl
www.cnn.com
>
cnn.txt
>
Python
>>>
from
bs4
import
BeautifulSoup
>>>
soup
=
BeautifulSoup(open("cnn.txt"))
>>>
for
script
in
soup(["script",
"style"]):
…
script.extract()
>>>
text
=
soup.get_text()
>>>
lines
=
(line.strip()
for
line
in
text.splitlines())
>>>
chunks
=
(phrase.strip()
for
line
in
lines
for
phrase
in
line.split("
"))
>>>
text
=
'n'.join(chunk
for
chunk
in
chunks
if
chunk)
https://detectlanguage.com
>>>
import
detectlanguage
>>>
detectlanguage.configuration.api_key
=
"YOUR
API
KEY"
>>>
detectlanguage.detect(text)
{"data":{"detections":
[{"language":"en","isReliable":true,"confidence":6.14}]}}
52
Language detection API client
example#2
53. >
curl
www.cnn.com
>
cnn.txt
>
Python
>>>
from
bs4
import
BeautifulSoup
>>>
soup
=
BeautifulSoup(open("cnn.txt"))
>>>
for
script
in
soup(["script",
"style"]):
…
script.extract()
>>>
text
=
soup.get_text()
>>>
lines
=
(line.strip()
for
line
in
text.splitlines())
>>>
chunks
=
(phrase.strip()
for
line
in
lines
for
phrase
in
line.split("
"))
>>>
text
=
'n'.join(chunk
for
chunk
in
chunks
if
chunk)
https://detectlanguage.com
>>>
import
detectlanguage
>>>
detectlanguage.configuration.api_key
=
"YOUR
API
KEY"
>>>
detectlanguage.detect(text)
{"data":{"detections":
[{"language":"en","isReliable":true,"confidence":6.14}]}}
53
Language detection API client
example#2
64. 17,536 Unique domains
Rank Domain URIs GeoIP Category
1 Alarab.net 284 US News
2 Aljarida.com 248 US News
3 Arabic.cnn.com 245 US News
4 Alarabiya.net 231 US News
5 Ar.wikipedia.org 230 US Encyclopedia
6 Aljazeera.net 213 US News
7 Moheet.com 142 US News
8 Facebook.com 133 US Social
9 Al-sharq.com 132 US Middle East Portal
10 Lakii.com 123 US General Portal
17 Kuwaitclub.com.kw 71 Kuwait Sport
64
65. Rank Domain URIs GeoIP Category
1 Alarab.net 284 US News
2 Aljarida.com 248 US News
3 Arabic.cnn.com 245 US News
4 Alarabiya.net 231 US News
5 Ar.wikipedia.org 230 US Encyclopedia
6 Aljazeera.net 213 US News
7 Moheet.com 142 US News
8 Facebook.com 133 US Social
9 Al-sharq.com 132 US Middle East Portal
10 Lakii.com 123 US General Portal
17 Kuwaitclub.com.kw 71 Kuwait Sport
First Arabic GeoIP location is at rank 17 65
17,536 Unique domains
66. Rank Domain URIs GeoIP Category
1 Alarab.net 284 US News
2 Aljarida.com 248 US News
3 Arabic.cnn.com 245 US News
4 Alarabiya.net 231 US News
5 Ar.wikipedia.org 230 US Encyclopedia
6 Aljazeera.net 213 US News
7 Moheet.com 142 US News
8 Facebook.com 133 US Social
9 Al-sharq.com 132 US Middle East Portal
10 Lakii.com 123 US General Portal
17 Kuwaitclub.com.kw 71 Kuwait Sport
6 out of 10 top unique domains are news websites 66
17,536 Unique domains
67. Rank Domain URIs GeoIP Category
1 Alarab.net 284 US News
2 Aljarida.com 248 US News
3 Arabic.cnn.com 245 US News
4 Alarabiya.net 231 US News
5 Ar.wikipedia.org 230 US Encyclopedia
6 Aljazeera.net 213 US News
7 Moheet.com 142 US News
8 Facebook.com 133 US Social
9 Al-sharq.com 132 US Middle East Portal
10 Lakii.com 123 US General Portal
17 Kuwaitclub.com.kw 71 Kuwait Sport
Popular western pages are in the top unique domains 67
17,536 Unique domains
68. TLD Percent
com 57.97%
net 15.07%
org 6.40%
gov.sa 1.94%
info 1.68%
edu.sa 1.27%
ws 1.16%
org.sa 0.97%
com.sa 0.80%
gov.eg 0.80%
Other 11.94%
Almost 58% are .com
68
69. TLD Percent
com 57.97%
net 15.07%
org 6.40%
gov.sa 1.94%
info 1.68%
edu.sa 1.27%
ws 1.16%
org.sa 0.97%
com.sa 0.80%
gov.eg 0.80%
Other 11.94%
Almost 58% are .com
69
70. TLD Percent
com 57.97%
net 15.07%
org 6.40%
gov.sa 1.94%
info 1.68%
edu.sa 1.27%
ws 1.16%
org.sa 0.97%
com.sa 0.80%
gov.eg 0.80%
Other 11.94%
Small percentage of Arabic TLD
70
71. TLD Country Percent
.sa Saudi Arabia 5.33%
.eg Egypt 2.00%
.jo Jordan 2.00%
.ae United Arab Emirates 1.06%
.kw Kuwait 0.82%
Small percentage of Arabic TLD
71
72. TLD Country Percent
.sa Saudi Arabia 5.33%
.eg Egypt 2.00%
.jo Jordan 2.00%
.ae United Arab Emirates 1.06%
.kw Kuwait 0.82%
Small percentage of Arabic TLD
72
73. Path Depth Example Percent
0 Example.com 17.30%
1 Example.com/a 40.42%
2 Example.com/a/b 24.45%
3 Example.com/a/b/c 10.81%
4+ Example.com/a/b/c/d 7.02%
More than 57% are of depth 0 and 1
73
74. Path Depth Example Percent
0 Example.com 17.30%
1 Example.com/a 40.42%
2 Example.com/a/b 24.45%
3 Example.com/a/b/c 10.81%
4+ Example.com/a/b/c/d 7.02%
74
More than 57% are of depth 0 and 1
75. 53.77% of Arabic URIs are archived
• January-March 2015
• ODU CS Memento Aggregator
Median=16
75
76. URI-Rs Memento Category
gulfup.com 10,987 File Sharing
masrawy.com 9,144 Egyptian portal
arabic.cnn.com 9,022 News
aljazeera.net 8,906 News
maktoob.yahoo.com 8,478 Search Engine
shorooknews.com 7,548 News
arabnews.com 6,274 News
bbc.co.uk/arabic 6,268 News
ahram.org.eg 5,347 News
google.com.sa 4,968 Search Engine
Most of the top archived URI-Rs are news
websites
76
77. URI-Rs Memento Category
gulfup.com 10,987 File Sharing
masrawy.com 9,144 Egyptian portal
arabic.cnn.com 9,022 News
aljazeera.net 8,906 News
maktoob.yahoo.com 8,478 Search Engine
shorooknews.com 7,548 News
arabnews.com 6,274 News
bbc.co.uk/arabic 6,268 News
ahram.org.eg 5,347 News
google.com.sa 4,968 Search Engine
77
Most of the top archived URI-Rs are news
websites
80. Two methods to determine the presence in
each archive
1. Percent of URI-Rs present in each archive
e.g.
http://aljazeera.net
2. Percent of URI-Ms present in each archive
e.g.
http://wayback.archive-it.org/all/20070727215420/http://
www.aljazeera.net/
e.g.
http://web.archive.org/web/20150618104846/http://aljazeera.net/
80
81. Internet Archive Archive.today Webcitation Total
URI-R1 2 0 0 2
URI-R2 2 0 0 2
URI-R3 1 1 0 2
URI-R4 1 1 0 2
URI-R5 0 1 1 2
Total 6 3 1 10
Presence in each archive example
81
82. 1- Percent of URI-Rs present in
each archive
Archive Total Percentage
Internet Archive 4/5=0.8 80%
Archive.today 3/5=0.6 60%
Webcitation 1/5=0.2 20%
Total 160%
Internet Archive Archive.today Webcitation Total
URI-R1 2 0 0 2
URI-R2 2 0 0 2
URI-R3 1 1 0 2
URI-R4 1 1 0 2
URI-R5 0 1 1 2
Total 6 3 1 10
82
Presence in each archive example
83. Internet Archive Archive.today Webcitation Total
URI-R1 2 0 0 2
URI-R2 2 0 0 2
URI-R3 1 1 0 2
URI-R4 1 1 0 2
URI-R5 0 1 1 2
Total 6 3 1 10
Archive Total Percentage
Internet Archive 6/10=0.6 60%
Archive.today 3/10=0.3 30%
Webcitation 1/10=0.1 10%
Total 100%
2- Percent of URI-Ms present in
each archive
Archive Total Percentage
Internet Archive 4/5=0.8 80%
Archive.today 3/5=0.6 60%
Webcitation 1/5=0.2 20%
Total 160%
83
1- Percent of URI-Rs present in
each archive
Presence in each archive example
84. Archive Percent
Internet Archive 97.04%
Archive.today 6.58%
Webcitation 6.00%
Archive-It 5.49%
British Library Archive 1.06%
UK Parliament Web Archive 0.88%
Icelandic Web Archive 0.87%
UK National Archives 0.62%
Proni 0.21%
Stanford 0.11%
Total 118.86%
Archive Percent
Internet Archive 72.87%
Archive-It 21.26%
Archive.today 2.14%
Webcitation 2.08%
Icelandic Web Archive 1.17%
British Library Archive 0.29%
UK Parliament Web Archive 0.10%
Proni 0.05%
UK National Archives 0.04%
Stanford <0.01%
Total 100%
84
1- Percent of URI-Rs present in each
archive
2- Percent of URI-Ms present in each
archive
Presence in each archive
85. Archive Percent
Internet Archive 97.04%
Archive.today 6.58%
Webcitation 6.00%
Archive-It 5.49%
British Library Archive 1.06%
UK Parliament Web Archive 0.88%
Icelandic Web Archive 0.87%
UK National Archives 0.62%
Proni 0.21%
Stanford 0.11%
Total 118.86%
Archive Percent
Internet Archive 72.87%
Archive-It 21.26%
Archive.today 2.14%
Webcitation 2.08%
Icelandic Web Archive 1.17%
British Library Archive 0.29%
UK Parliament Web Archive 0.10%
Proni 0.05%
UK National Archives 0.04%
Stanford <0.01%
Total 100%
85
1- Percent of URI-Rs present in each
archive
2- Percent of URI-Ms present in each
archive
Presence in each archive
86. Archive Percent
Internet Archive 97.04%
Archive.today 6.58%
Webcitation 6.00%
Archive-It 5.49%
British Library Archive 1.06%
UK Parliament Web Archive 0.88%
Icelandic Web Archive 0.87%
UK National Archives 0.62%
Proni 0.21%
Stanford 0.11%
Total 118.86%
Archive Percent
Internet Archive 72.87%
Archive-It 21.26%
Archive.today 2.14%
Webcitation 2.08%
Icelandic Web Archive 1.17%
British Library Archive 0.29%
UK Parliament Web Archive 0.10%
Proni 0.05%
UK National Archives 0.04%
Stanford <0.01%
Total 100%
Presence in each archive
86
1- Percent of URI-Rs present in each
archive
2- Percent of URI-Ms present in each
archive
87. Average archiving period (days)
Average archiving period = (LM-FM) / number of mementos
16,732 URIs have only one memento
Median=48 days
87
88. Values less than 1 indicate
that the URI is archived
multiple times per day
The larger the
period, the more
irregularly the URI
was captured by
the archives
Median=48 days
Average archiving period = (LM-FM) / number of mementos
16,732 URIs have only one memento 88
Average archiving period (days)
89. Creation date for archived Arabic URIs
Source: http://ws-dl.blogspot.com/2014/11/2014-11-14-carbon-dating-web-version-20.html
We used CarbonDate for creation date estimate
89
92. Archive Percent
United States 57.97%
Arabic Countries 10.53%
Germany 9.75%
Netherlands 5.29%
France 4.37%
Canada 3.31%
United Kingdom 3.07%
Other 5.71%
Top GeoIP locations
92
93. Archive Percent
United States 57.97%
Arabic Countries 10.53%
Germany 9.75%
Netherlands 5.29%
France 4.37%
Canada 3.31%
United Kingdom 3.07%
Other 5.71%
Top GeoIP locations
93
94. Archive Percent
United States 57.97%
Arabic Countries 10.53%
Germany 9.75%
Netherlands 5.29%
France 4.37%
Canada 3.31%
United Kingdom 3.07%
Other 5.71%
Archive Percent
Saudi Arabia 4.75%
Egypt 1.97%
Jordan 1.42%
Kuwait 0.71%
United Arab
Emirates
0.67%
Top GeoIP locations
94
95. Archive Percent
United States 57.97%
Arabic Countries 10.53%
Germany 9.75%
Netherlands 5.29%
France 4.37%
Canada 3.31%
United Kingdom 3.07%
Other 5.71%
Archive Percent
Saudi Arabia 4.75%
Egypt 1.97%
Jordan 1.42%
Kuwait 0.71%
United Arab
Emirates
0.67%
Top GeoIP locations
95
96. Seed Data Set
(Live, Indexed, Archived) Percent
(1, 1, 1) 43.34%
(1, 1, 0) 25.59%
(1, 0, 1) 15.27%
(1, 0, 0) 15.76%
Status of Arabic seed URIs
96
97. Seed Data Set
(Live, Indexed, Archived) Percent
(1, 1, 1) 43.34%
(1, 1, 0) 25.59%
(1, 0, 1) 15.27%
(1, 0, 0) 15.76%
(Good)
discovered
and saved
97
Status of Arabic seed URIs
98. Seed Data Set
(Live, Indexed, Archived) Percent
(1, 1, 1) 43.34%
(1, 1, 0) 25.59%
(1, 0, 1) 15.27%
(1, 0, 0) 15.76%
(Good)
discovered
and saved
(Bad)
undiscovered
and not saved
98
Status of Arabic seed URIs
99. Seed Data Set
(Live, Indexed, Archived) Percent
(1, 1, 1) 43.34%
(1, 1, 0) 25.59%
(1, 0, 1) 15.27%
(1, 0, 0) 15.76%
31% were not indexed by Google
99
Status of Arabic seed URIs
100. 18% have
creation dates
over 1 year
before the first
memento was
archived
19.48% of the URIs have an estimated creation date that is the same
as first memento date
Difference between creation date and first
memento
100
101. Seed Data Set
Arabic Archived Indexed
DMOZ 34.43% 95.52% 82.13%
Raddadi 19.88% 45.44% 65.83%
Star28 45.69% 41.54% 65.23%
DMOZ URIs are more likely to be found and
archived
101
102. Seed Data Set
Arabic Archived Indexed
DMOZ 34.43% 95.52% 82.13%
Raddadi 19.88% 45.44% 65.83%
Star28 45.69% 41.54% 65.23%
102
DMOZ URIs are more likely to be found and
archived
103. Seed Data Set
Arabic Archived Indexed
DMOZ 34.43% 95.52% 82.13%
Raddadi 19.88% 45.44% 65.83%
Star28 45.69% 41.54% 65.23%
103
DMOZ URIs are more likely to be found and
archived
104. Full Data Set
Total Archived Category Total Archived
Arabic 33.18% 33.56% AR ccTLD 14.84% 28.09%
AR GeoIP 10.53% 13.11%
AR both 7.81% 59.50%
Neither 66.82% 65.22% Neither 66.82% 65.22%
Hosted in Western countries would be more
likely to be archived
104
105. Full Data Set
Total Archived Category Total Archived
Arabic 33.18% 33.56% AR ccTLD 14.84% 28.09%
AR GeoIP 10.53% 13.11%
AR both 7.81% 59.50%
Neither 66.82% 65.22% Neither 66.82% 65.22%
105
Hosted in Western countries would be more
likely to be archived
106. Seed Data Set
Total Indexed Category Total Indexed
Arabic 15.01% 78.29% AR ccTLD 6.61% 76.09%
AR GeoIP 2.37% 73.54%
AR both 6.03% 85.24%
Neither 84.99% 65.22% Neither 84.99% 67.09%
URIs that had some Arabic location had a
higher indexing rate
106
107. Seed Data Set
Total Indexed Category Total Indexed
Arabic 15.01% 78.29% AR ccTLD 6.61% 76.09%
AR GeoIP 2.37% 73.54%
AR both 6.03% 85.24%
Neither 84.99% 65.22% Neither 84.99% 67.09%
URIs that had some Arabic location had a
higher indexing rate
107
108. The spread of memento was not affected by
location or ccTLD
Ø Kolmogorov-Smirnov test
Category Mean
Ar GeoIP 0.5010
Ar ccTLD 0.5013
Both 0.5016
Neither 0.5005
Category D-Value P-Value
Ar ccTLD
vs. neither
0.017 <0.002
Ar GeoIP
vs. neither
0.014 <0.002
108
109. Just because a webpage is older it does not
mean that it is archived more
Because of low historical archiving rates
109
110. We look in the last three years 110
Just because a webpage is older it does not
mean that it is archived more
111. We look in the last three years 111
Just because a webpage is older it does not
mean that it is archived more
112. In the last three years the older the
resource is the more memento it has
112
113. Full Data Set Seed Data Set
Path Depth Total Archived Total Indexed
0 17.30% 86.29% 86.05% 74.60%
1 40.42% 53.49% 9.77% 38.91%
2 24.45% 45.57% 3.72% 17.85%
3+ 17.83% 34.24% 0.50% 57.50%
Top level URIs are more likely to be
archived and indexed
113
114. Full Data Set Seed Data Set
Path Depth Total Archived Total Indexed
0 17.30% 86.29% 86.05% 74.60%
1 40.42% 53.49% 9.77% 38.91%
2 24.45% 45.57% 3.72% 17.85%
3+ 17.83% 34.24% 0.50% 57.50%
114
Top level URIs are more likely to be
archived and indexed
115. Full Data Set Seed Data Set
Path Depth Total Archived Total Indexed
0 17.30% 86.29% 86.05% 74.60%
1 40.42% 53.49% 9.77% 38.91%
2 24.45% 45.57% 3.72% 17.85%
3+ 17.83% 34.24% 0.50% 57.50%
115
Top level URIs are more likely to be
archived and indexed
116. • Collected URIs from three Arabic directories (7,976):
Ø DMOZ
Ø Raddadi.com
Ø Star28.com
• Crawl seed dataset (1,299,671)
• Check if they are unique (663,443)
• Check if they are live (482,905)
• Check for Arabic Language (300,646)
Summary of collection methods
116
117. § Our Arabic language dataset was not largely located in Arabic
countries
Ø Only 14.84% had an Arabic ccTLD
Ø Only 10.53% had a GeoIP in an Arabic country
Ø Popular Western domains (e.g., cnn.com, wikipedia.org) appeared in the
top 10
§ Arabic webpages are not particularly well archived or indexed
Ø 46% were not archived
Ø 31% were not indexed by Google
§ An Arabic webpage is more likely to be...
Ø indexed if it is present in a directory
Ø archived if it is present in DMOZ
Ø archived if it has neither Arabic GeoIP nor Arabic ccTLD
For right now, if you want your Arabic language webpage to be archived,
host it outside of an Arabic country and get it listed in DMOZ
Findings
117
120. GeoIP Location
• We obtained the IP addresses of the hostnames
using nslookup, (which uses DNS to convert the
hostname to its IP address)
• We used the MaxMind GeoLite29 database to
determine location from the IP address. (Which
tests at 99.8% accuracy at the country level)
h,p://dev.maxmind.com/geoip/geoip2/geolite2/
h,p://dev.maxmind.com/faq/how-‐‑accurate-‐‑are-‐‑the-‐‑ geoip-‐‑databases/
120