SlideShare a Scribd company logo
1 of 120
Download to read offline
How Well Are Arabic
Websites Archived?
Lulwah M. Alkwai, Michael L. Nelson, and Michele C. Weigle
Old Dominion University
Department of Computer Science
Norfolk, Virginia 23529 USA
JCDL 2015
Knoxville, TN
June 21-25, 2015
Archived events on
English sites vs. Arabic sites
2
http://www.foxnews.com/us/2015/05/26/2-shot-dead-in-
bloody-memorial-day-weekend-in-baltimore-capping-off-
deadliest/
Search: Baltimore (one week old)
Archived events on
English sites vs. Arabic sites
3
http://www.foxnews.com/us/2015/05/26/2-shot-dead-in-
bloody-memorial-day-weekend-in-baltimore-capping-off-
deadliest/
Search: Baltimore (one week old) Search: Yemen Houthis (one week old)
http://www.yemenakhbar.com/yemen-news/178683.html
Archived events on
English sites vs. Arabic sites
4
Search: Baltimore (one week old)
http://www.foxnews.com/us/2015/05/26/2-shot-dead-in-
bloody-memorial-day-weekend-in-baltimore-capping-off-
deadliest/
Search: Yemen Houthis (one week old)
Archived events on
English sites vs. Arabic sites
5
http://www.yemenakhbar.com/yemen-news/178683.html
Search: Baltimore (one week old)
http://www.foxnews.com/us/2015/05/26/2-shot-dead-in-
bloody-memorial-day-weekend-in-baltimore-capping-off-
deadliest/
Search: Yemen Houthis (one week old)
Archived events on
English sites vs. Arabic sites
6
http://www.yemenakhbar.com/yemen-news/178683.html
Search: Baltimore (one week old)
http://www.foxnews.com/us/2015/05/26/2-shot-dead-in-
bloody-memorial-day-weekend-in-baltimore-capping-off-
deadliest/
Search: Yemen Houthis (one week old)
Archived events on
English sites vs. Arabic sites
7
http://www.yemenakhbar.com/yemen-news/178683.html
English sports websites are more archived than
Arabic
www.espn.go.com www.kooora.com
8
English e-Marketing websites are more archived
than Arabic
www.amazon.com www.haraj.com.sa
9
English encyclopedia websites are more archived
than Arabic
en.wikipedia.org ar.wikipedia.org 10
Top ten languages in the Internet
World Language Map
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
Source: Quick Maps of the World immigration -
http://www.allcountries.org/maps/world_language_maps.html
	
	
	
	
	
	
	
	
	
	
	
	
	
	
	
                                                                                                	
                                                                                                	
Source: Internet World Stats -
http://www.internetworldstats.com/stats7.htm
11
2009 2013
Countries Population Internet Users Penetration Population Internet Users Penetration
1 Algeria 34,178,188 4,100,000 12.00% 38,813,722 6,404,264 16.50%
2 Bahrain 728,709 402,900 55.30% 1,314,089 1,182,680 90.00%
3 Comoros 752,438 23,000 3.10% 766,865 49,846 6.50%
4 Djibouti 724,622 19,200 2.60% 810,179 76,967 9.50%
5 Egypt 78,866,635 16,636,000 21.10% 86,895,099 43,065,211 49.60%
6 Iraq 28,945,569 300,000 1.00% 32,585,692 2,997,884 9.20%
7 Jordan 6,269,285 1,595,200 25.40% 6,528,061 2,885,403 44.20%
8 Kuwait 2,692,526 1,000,000 37.10% 2,742,711 2,069,650 75.50%
9 Lebanon 4,017,095 945,000 23.50% 4,136,895 2,916,511 70.50%
10 Libya 6,324,357 323,000 5.10% 6,244,174 1,030,289 16.50%
11 Mauritania 73,129,486 60,000 1.90% 3,516,806 218,042 6.20%
12 Morocco 31,285,174 10,442,500 33.40% 32,987,206 18,472,835 56.00%
13 Oman 3,418,085 557,000 16.30% 3,219,775 2,139,540 66.40%
14 Qatar 833,285 436,000 52.30% 2,123,160 1,811,055 85.30%
15 Saudi Arabia 28,686,633 7,761,800 27.10% 27,345,986 16,544,322 60.50%
16 Somalia 9,832,017 102,000 1.00% 10,428,043 156,420 1.50%
17 South Sudan - - - 11,562,695 100 0.00%
18 Sudan 41,087,825 4,200,000 10.20% 35,482,233 8,054,467 22.70%
19 Syria 21,762,978 3,565,000 16.40% 22,597,531 5,920,553 26.20%
20 Tunisia 10,486,339 3,500,000 33.40% 10,937,521 4,790,634 43.80%
21 UAE 4,798,491 3,558,000 74.10% 9,206,000 8,101,280 88.00%
22 Palestine 2,461,267 355,500 14.40% 2,731,052 1,512,273 55.40%
23 Yemen 22,858,238 370,000 1.60% 26,052,966 5,210,593 20.00%
Arabic Total 344,139,242 60,252,100 17.50% 379,028,461 135,610,819 35.8 %
World Total 6,767,805,208 1,802,330,457 26.6 % 7,181,858,619 2,802,478,934 39.0 %
Source: http://www.internetworldstats.com/stats19.htm
Arabic speaking Internet users
12
2009 2013
Countries Population Internet Users Penetration Population Internet Users Penetration
1 Algeria 34,178,188 4,100,000 12.00% 38,813,722 6,404,264 16.50%
2 Bahrain 728,709 402,900 55.30% 1,314,089 1,182,680 90.00%
3 Comoros 752,438 23,000 3.10% 766,865 49,846 6.50%
4 Djibouti 724,622 19,200 2.60% 810,179 76,967 9.50%
5 Egypt 78,866,635 16,636,000 21.10% 86,895,099 43,065,211 49.60%
6 Iraq 28,945,569 300,000 1.00% 32,585,692 2,997,884 9.20%
7 Jordan 6,269,285 1,595,200 25.40% 6,528,061 2,885,403 44.20%
8 Kuwait 2,692,526 1,000,000 37.10% 2,742,711 2,069,650 75.50%
9 Lebanon 4,017,095 945,000 23.50% 4,136,895 2,916,511 70.50%
10 Libya 6,324,357 323,000 5.10% 6,244,174 1,030,289 16.50%
11 Mauritania 73,129,486 60,000 1.90% 3,516,806 218,042 6.20%
12 Morocco 31,285,174 10,442,500 33.40% 32,987,206 18,472,835 56.00%
13 Oman 3,418,085 557,000 16.30% 3,219,775 2,139,540 66.40%
14 Qatar 833,285 436,000 52.30% 2,123,160 1,811,055 85.30%
15 Saudi Arabia 28,686,633 7,761,800 27.10% 27,345,986 16,544,322 60.50%
16 Somalia 9,832,017 102,000 1.00% 10,428,043 156,420 1.50%
17 South Sudan - - - 11,562,695 100 0.00%
18 Sudan 41,087,825 4,200,000 10.20% 35,482,233 8,054,467 22.70%
19 Syria 21,762,978 3,565,000 16.40% 22,597,531 5,920,553 26.20%
20 Tunisia 10,486,339 3,500,000 33.40% 10,937,521 4,790,634 43.80%
21 UAE 4,798,491 3,558,000 74.10% 9,206,000 8,101,280 88.00%
22 Palestine 2,461,267 355,500 14.40% 2,731,052 1,512,273 55.40%
23 Yemen 22,858,238 370,000 1.60% 26,052,966 5,210,593 20.00%
Arabic Total 344,139,242 60,252,100 17.50% 379,028,461 135,610,819 35.8 %
World Total 6,767,805,208 1,802,330,457 26.6 % 7,181,858,619 2,802,478,934 39.0 %
Source: http://www.internetworldstats.com/stats19.htm
2009
Arabic Total=17.5%
World Total=26.6%
Arabic speaking Internet users
13
2009 2013
Countries Population Internet Users Penetration Population Internet Users Penetration
1 Algeria 34,178,188 4,100,000 12.00% 38,813,722 6,404,264 16.50%
2 Bahrain 728,709 402,900 55.30% 1,314,089 1,182,680 90.00%
3 Comoros 752,438 23,000 3.10% 766,865 49,846 6.50%
4 Djibouti 724,622 19,200 2.60% 810,179 76,967 9.50%
5 Egypt 78,866,635 16,636,000 21.10% 86,895,099 43,065,211 49.60%
6 Iraq 28,945,569 300,000 1.00% 32,585,692 2,997,884 9.20%
7 Jordan 6,269,285 1,595,200 25.40% 6,528,061 2,885,403 44.20%
8 Kuwait 2,692,526 1,000,000 37.10% 2,742,711 2,069,650 75.50%
9 Lebanon 4,017,095 945,000 23.50% 4,136,895 2,916,511 70.50%
10 Libya 6,324,357 323,000 5.10% 6,244,174 1,030,289 16.50%
11 Mauritania 73,129,486 60,000 1.90% 3,516,806 218,042 6.20%
12 Morocco 31,285,174 10,442,500 33.40% 32,987,206 18,472,835 56.00%
13 Oman 3,418,085 557,000 16.30% 3,219,775 2,139,540 66.40%
14 Qatar 833,285 436,000 52.30% 2,123,160 1,811,055 85.30%
15 Saudi Arabia 28,686,633 7,761,800 27.10% 27,345,986 16,544,322 60.50%
16 Somalia 9,832,017 102,000 1.00% 10,428,043 156,420 1.50%
17 South Sudan - - - 11,562,695 100 0.00%
18 Sudan 41,087,825 4,200,000 10.20% 35,482,233 8,054,467 22.70%
19 Syria 21,762,978 3,565,000 16.40% 22,597,531 5,920,553 26.20%
20 Tunisia 10,486,339 3,500,000 33.40% 10,937,521 4,790,634 43.80%
21 UAE 4,798,491 3,558,000 74.10% 9,206,000 8,101,280 88.00%
22 Palestine 2,461,267 355,500 14.40% 2,731,052 1,512,273 55.40%
23 Yemen 22,858,238 370,000 1.60% 26,052,966 5,210,593 20.00%
Arabic Total 344,139,242 60,252,100 17.50% 379,028,461 135,610,819 35.8 %
World Total 6,767,805,208 1,802,330,457 26.6 % 7,181,858,619 2,802,478,934 39.0 %
2013
Arabic Total=35.8%
World Total=39.0%
Source: http://www.internetworldstats.com/stats19.htm
2009
Arabic Total=17.5%
World Total=26.6%
14
Arabic speaking Internet users
Ø  The number of Arabic speaking Internet users has grown
rapidly
Ø  There has been previous work on the coverage of web
archives
Ø  Little has been done in terms of Arabic language content
15
Why are we doing this?
How Much of the Web Is Archived?
Ø  Sample of URIs from four different
sources (DMOZ, Delicious, Bitly,
Search engine indexes)
Ø  The archival percentages ranged
from 16% to 79%
2013, A follow-on study:
Ø  Archival percentages had increased
from 33% to 95%
Ø  These studies were not focused on
content from specific countries or
content in specific languages
16
A fair history of the Web?
Examining country balance in the Internet Archive
Ø  Examined country balance in the
Internet Archive:
Country Domain Archived
US .com 92%
Taiwan .com.tw 73%
China .com.cn 58%
Singapore .com.sg 73%
17
Ø  This work focused on TLD rather
than content language or location
Characterization of National Web Domains
Ø  Used 10 national web domains
§  120 million pages
§  24 countries
§  They studied page sizes,
degrees, link based scores, etc.
§  They found that depth,
response code were similar
Ø  In this work, additional methods are
required to determine if a site
belongs to a particular country
18
Characterizing a National Community Web
Ø  Used Portuguese dataset:
§  (.pt) ccTLD
§  (.com,.net,.org,.tv) in Portuguese
language that has at least one
incoming link from (.pt) ccTLD
Ø  They identify, collect, and characterize the
Portuguese Web
19
GeoIP  only	
 ccTLD  only	
Both	
 Neither	
²  News: al-watan.com
²  ccTLD: Not Arabic (.com)
²  GeoIP: Arabic country (Qatar)
How do we classify Arabic websites?
20
GeoIP  only	
 ccTLD  only	
Both	
 Neither	
²  E-Marketing: haraj.com.sa
²  ccTLD: Arabic (.sa)
²  GeoIP: Not an Arabic country (Ireland)
²  News: al-watan.com
²  ccTLD: Not Arabic (.com)
²  GeoIP: Arabic country (Qatar)
21
How do we classify Arabic websites?
GeoIP  only	
 ccTLD  only	
Both	
 Neither	
²  E-Marketing: haraj.com.sa
²  ccTLD: Arabic (.sa)
²  GeoIP: Not an Arabic country (Ireland)
²  News: al-watan.com
²  ccTLD: Not Arabic (.com)
²  GeoIP: Arabic country (Qatar)
22
²  Educational: uoh.edu.sa
²  ccTLD: Arabic (.sa)
²  GeoIP: Arabic country (SA)
How do we classify Arabic websites?
GeoIP  only	
 ccTLD  only	
Both	
 Neither	
²  News: alarabiya.net
²  ccTLD: Not Arabic (.net)
²  GeoIP: Not Arabic country (US)
²  E-Marketing: haraj.com.sa
²  ccTLD: Arabic (.sa)
²  GeoIP: Not an Arabic country (Ireland)
²  News: al-watan.com
²  ccTLD: Not Arabic (.com)
²  GeoIP: Arabic country (Qatar)
23
²  Educational: uoh.edu.sa
²  ccTLD: Arabic (.sa)
²  GeoIP: Arabic country (SA)
How do we classify Arabic websites?
Selecting seed URIs
Name Registered Year URI count
DMOZ US 1999 Dmoz.org/world/arabic 4,086
Raddadi Saudi Arabia 2000 Raddadi.com 3,271
Star28 Lebanon 2004 Star28.com 8,386
Total 15,743
•  15,092 unique seed URIs
•  11,014 URIs that existed in the live web
24
Determining a webpage language
•  HTTP header Content-Language
•  HTML title tag language
•  Trigram method
•  Language detection API client
25
>	
  curl	
  –I	
  www.alquds.com	
  
HTTP/1.1	
  200	
  OK	
  
Server:	
  nginx/1.6.2	
  
Date:	
  Wed,	
  03	
  Jun	
  2015	
  19:11:31	
  GMT	
  
Content-­‐Type:	
  text/html;	
  charset=utf-­‐8	
  
Connection:	
  keep-­‐alive	
  
X-­‐Powered-­‐By:	
  PHP/5.3.3	
  
X-­‐Drupal-­‐Cache:	
  HIT	
  
Etag:	
  "1433361507-­‐0"	
  
Content-­‐Language:	
  ar	
  
…	
  
HTTP header Content-Language
example#1
26
>	
  curl	
  –I	
  www.alquds.com	
  
HTTP/1.1	
  200	
  OK	
  
Server:	
  nginx/1.6.2	
  
Date:	
  Wed,	
  03	
  Jun	
  2015	
  19:11:31	
  GMT	
  
Content-­‐Type:	
  text/html;	
  charset=utf-­‐8	
  
Connection:	
  keep-­‐alive	
  
X-­‐Powered-­‐By:	
  PHP/5.3.3	
  
X-­‐Drupal-­‐Cache:	
  HIT	
  
Etag:	
  "1433361507-­‐0"	
  
Content-­‐Language:	
  ar	
  
…	
  
HTTP header Content-Language
example#1
27
>	
  curl	
  –I	
  www.raddadi.com	
  
HTTP/1.1	
  200	
  OK	
  
Server:	
  nginx/1.8.0	
  
Date:	
  Sat,	
  06	
  Jun	
  2015	
  22:47:09	
  GMT	
  
Content-­‐Type:	
  text/html	
  
Connection:	
  keep-­‐alive	
  
…	
  
HTTP header Content-Language
example#2
28
>	
  curl	
  –I	
  www.raddadi.com	
  
HTTP/1.1	
  200	
  OK	
  
Server:	
  nginx/1.8.0	
  
Date:	
  Sat,	
  06	
  Jun	
  2015	
  22:47:09	
  GMT	
  
Content-­‐Type:	
  text/html	
  
Connection:	
  keep-­‐alive	
  
…	
   >	
  curl	
  www.raddadi.com	
  
<!DOCTYPE	
  html	
  PUBLIC	
  "-­‐//W3C//DTD	
  XHTML	
  1.0	
  
Transitional//EN"	
  "http://www.w3.org/TR/
xhtml1/DTD/xhtml1-­‐transitional.dtd">	
  
	
  
<html	
  dir="rtl"	
  xmlns="http://www.w3.org/
1999/xhtml"	
  xml:lang="ar"	
  lang="ar"	
  >	
  
<head>	
  
HTTP header Content-Language
example#2
29
>	
  curl	
  –I	
  www.raddadi.com	
  
HTTP/1.1	
  200	
  OK	
  
Server:	
  nginx/1.8.0	
  
Date:	
  Sat,	
  06	
  Jun	
  2015	
  22:47:09	
  GMT	
  
Content-­‐Type:	
  text/html	
  
Connection:	
  keep-­‐alive	
  
…	
   >	
  curl	
  www.raddadi.com	
  
<!DOCTYPE	
  html	
  PUBLIC	
  "-­‐//W3C//DTD	
  XHTML	
  1.0	
  
Transitional//EN"	
  "http://www.w3.org/TR/
xhtml1/DTD/xhtml1-­‐transitional.dtd">	
  
	
  
<html	
  dir="rtl"	
  xmlns="http://www.w3.org/
1999/xhtml"	
  xml:lang="ar"	
  lang="ar"	
  >	
  
<head>	
  
HTTP header Content-Language
example#2
30
https://code.google.com/p/guess-language/
>	
  curl	
  www.star28.com	
  
…	
  
<META	
  name="Copyright"	
  content="©	
  2011	
  
www.star28.com">	
  
<META	
  name="DISTRIBUTION"	
  content="GLOBAL">	
  
<META	
  name="REVISIT-­‐AFTER"	
  content="1	
  DAYS">	
  
<TITLE> ‫دليل‬‫العرب‬‫الشامل‬ </TITLE>	
  
<META	
  name="description"	
  content=" ‫دليل‬‫للمواقع‬
‫العربية‬‫و‬‫أفضل‬‫املواقع‬‫العاملية‬,‫يحدث‬‫باستمرار‬ ">	
  
<META	
  name="keywords"	
  content=" ‫دليل‬‫مواقع‬,‫جتارة‬,‫جتارة‬ ,‫مواقع‬ ‫دليل‬
‫العاب‬,‫جافا‬‫سكربت‬,‫رياضة‬,‫منتديات‬,‫علوم‬,‫كومبيوتر‬,‫اسالم‬,‫اخبار‬,‫اخبار‬ ,‫اسالم‬ ,‫كومبيوتر‬ ,‫علوم‬ ,‫منتديات‬ ,‫رياضة‬ ,‫سكربت‬ ‫جافا‬ ,‫العاب‬
‫صحف‬,‫تلفزيون‬,‫سياحة‬,‫تعليم‬,‫زواج‬,‫توظيف‬ ">
…
HTML title tag language
31
>	
  curl	
  www.star28.com	
  
…	
  
<META	
  name="Copyright"	
  content="©	
  2011	
  
www.star28.com">	
  
<META	
  name="DISTRIBUTION"	
  content="GLOBAL">	
  
<META	
  name="REVISIT-­‐AFTER"	
  content="1	
  DAYS">	
  
<TITLE> ‫دليل‬‫العرب‬‫الشامل‬ </TITLE>	
  
<META	
  name="description"	
  content=" ‫دليل‬‫للمواقع‬
‫العربية‬‫و‬‫أفضل‬‫املواقع‬‫العاملية‬,‫يحدث‬‫باستمرار‬ ">	
  
<META	
  name="keywords"	
  content=" ‫دليل‬‫مواقع‬,‫جتارة‬,‫جتارة‬ ,‫مواقع‬ ‫دليل‬
‫العاب‬,‫جافا‬‫سكربت‬,‫رياضة‬,‫منتديات‬,‫علوم‬,‫كومبيوتر‬,‫اسالم‬,‫اخبار‬,‫اخبار‬ ,‫اسالم‬ ,‫كومبيوتر‬ ,‫علوم‬ ,‫منتديات‬ ,‫رياضة‬ ,‫سكربت‬ ‫جافا‬ ,‫العاب‬
‫صحف‬,‫تلفزيون‬,‫سياحة‬,‫تعليم‬,‫زواج‬,‫توظيف‬ ">
…
https://code.google.com/p/guess-language/
Then we use guess-language Python library to determine
the language
HTML title tag language
32
https://code.google.com/p/guess-language/
Ø  	
  curl	
  -­‐s	
  www.gulfup.com	
  	
  |	
  grep	
  -­‐io	
  "<title>[^<]*"	
  |	
  
tail	
  -­‐c+8	
  >	
  gulfup_title.txt	
  
33
HTML title tag language
example#1
https://code.google.com/p/guess-language/
34
Ø  	
  curl	
  -­‐s	
  www.gulfup.com	
  	
  |	
  grep	
  -­‐io	
  "<title>[^<]*"	
  |	
  
tail	
  -­‐c+8	
  >	
  gulfup_title.txt	
  
>	
  Python	
  
>>>	
  myfile=open("gulfup_title.txt",	
  "r")	
  
>>>	
  data=myfile.read()	
  
>>>	
  from	
  guess_language	
  import	
  guess_language	
  
>>>	
  guess_language(data)	
  
'ar'	
  
HTML title tag language
example#1
https://code.google.com/p/guess-language/
35
Ø  	
  curl	
  -­‐s	
  www.gulfup.com	
  	
  |	
  grep	
  -­‐io	
  "<title>[^<]*"	
  |	
  
tail	
  -­‐c+8	
  >	
  gulfup_title.txt	
  
>	
  Python	
  
>>>	
  myfile=open("gulfup_title.txt",	
  "r")	
  
>>>	
  data=myfile.read()	
  
>>>	
  from	
  guess_language	
  import	
  guess_language	
  
>>>	
  guess_language(data)	
  
'ar'	
  
HTML title tag language
example#1
https://code.google.com/p/guess-language/
36
Ø  	
  curl	
  -­‐s	
  www.cnn.com	
  	
  |	
  grep	
  -­‐io	
  "<title>[^<]*"	
  |	
  
tail	
  -­‐c+8	
  >	
  cnn_title.txt	
  
HTML title tag language
example#2
https://code.google.com/p/guess-language/
37
Ø  	
  curl	
  -­‐s	
  www.cnn.com	
  	
  |	
  grep	
  -­‐io	
  "<title>[^<]*"	
  |	
  
tail	
  -­‐c+8	
  >	
  cnn_title.txt	
  
>	
  Python	
  
>>>	
  myfile=open("cnn_title.txt",	
  "r")	
  
>>>	
  data=myfile.read()	
  
>>>	
  from	
  guess_language	
  import	
  guess_language	
  
>>>	
  guess_language(data)	
  
'en'	
  
HTML title tag language
example#2
https://code.google.com/p/guess-language/
38
Ø  	
  curl	
  -­‐s	
  www.cnn.com	
  	
  |	
  grep	
  -­‐io	
  "<title>[^<]*"	
  |	
  
tail	
  -­‐c+8	
  >	
  cnn_title.txt	
  
>	
  Python	
  
>>>	
  myfile=open("cnn_title.txt",	
  "r")	
  
>>>	
  data=myfile.read()	
  
>>>	
  from	
  guess_language	
  import	
  guess_language	
  
>>>	
  guess_language(data)	
  
'en'	
  
HTML title tag language
example#2
§  Built in C++ and wrapped as a python module
§  Identification is performed through basic trigram lookups
paired with unicode character set recognition
§  Accuracy is high for even short sample texts
https://github.com/decultured/Python-Language-Detector
Trigram method
39
https://github.com/decultured/Python-Language-Detector
>	
  curl	
  www.raddadi.com	
  >	
  raddadi.txt	
  
>	
  Python	
  
>>>	
  from	
  bs4	
  import	
  BeautifulSoup	
  
>>>	
  soup	
  =	
  BeautifulSoup(open("raddadi.txt"))	
  
>>>	
  for	
  script	
  in	
  soup(["script",	
  "style"]):	
  
	
  	
  	
  	
  script.extract()	
  
>>>	
  text	
  =	
  soup.get_text()	
  
>>>	
  lines	
  =	
  (line.strip()	
  for	
  line	
  in	
  
text.splitlines())	
  
>>>	
  chunks	
  =	
  (phrase.strip()	
  for	
  line	
  in	
  lines	
  for	
  
phrase	
  in	
  line.split("	
  	
  "))	
  
>>>	
  text	
  =	
  'n'.join(chunk	
  for	
  chunk	
  in	
  chunks	
  if	
  
chunk)	
  
Trigram method
example#1
40
>	
  curl	
  www.raddadi.com	
  >	
  raddadi.txt	
  
>	
  Python	
  
>>>	
  from	
  bs4	
  import	
  BeautifulSoup	
  
>>>	
  soup	
  =	
  BeautifulSoup(open("raddadi.txt"))	
  
>>>	
  for	
  script	
  in	
  soup(["script",	
  "style"]):	
  
	
  	
  	
  	
  script.extract()	
  
>>>	
  text	
  =	
  soup.get_text()	
  
>>>	
  lines	
  =	
  (line.strip()	
  for	
  line	
  in	
  
text.splitlines())	
  
>>>	
  chunks	
  =	
  (phrase.strip()	
  for	
  line	
  in	
  lines	
  for	
  
phrase	
  in	
  line.split("	
  	
  "))	
  
>>>	
  text	
  =	
  'n'.join(chunk	
  for	
  chunk	
  in	
  chunks	
  if	
  
chunk)	
  
https://github.com/decultured/Python-Language-Detector
>>>	
  import	
  sys	
  
>>>	
  sys.path.append('languageDetector')	
  
>>>	
  import	
  languageIdentifiera	
  
>>>	
  languageIdentifier.load("languageDetector/
trigrams/")	
  
>>>	
  print	
  	
  languageIdentifier.identify(text,	
  300,	
  300)	
  
ar	
  
41
Trigram method
example#1
>	
  curl	
  www.raddadi.com	
  >	
  raddadi.txt	
  
>	
  Python	
  
>>>	
  from	
  bs4	
  import	
  BeautifulSoup	
  
>>>	
  soup	
  =	
  BeautifulSoup(open("raddadi.txt"))	
  
>>>	
  for	
  script	
  in	
  soup(["script",	
  "style"]):	
  
	
  	
  	
  	
  script.extract()	
  
>>>	
  text	
  =	
  soup.get_text()	
  
>>>	
  lines	
  =	
  (line.strip()	
  for	
  line	
  in	
  
text.splitlines())	
  
>>>	
  chunks	
  =	
  (phrase.strip()	
  for	
  line	
  in	
  lines	
  for	
  
phrase	
  in	
  line.split("	
  	
  "))	
  
>>>	
  text	
  =	
  'n'.join(chunk	
  for	
  chunk	
  in	
  chunks	
  if	
  
chunk)	
  
>>>	
  import	
  sys	
  
>>>	
  sys.path.append('languageDetector')	
  
>>>	
  import	
  languageIdentifiera	
  
>>>	
  languageIdentifier.load("languageDetector/
trigrams/")	
  
>>>	
  print	
  	
  languageIdentifier.identify(text,	
  300,	
  300)	
  
ar	
  
https://github.com/decultured/Python-Language-Detector
42
Trigram method
example#1
https://github.com/decultured/Python-Language-Detector
>	
  curl	
  www.cnn.com	
  >	
  cnn.txt	
  
>	
  Python	
  
>>>	
  from	
  bs4	
  import	
  BeautifulSoup	
  
>>>	
  soup	
  =	
  BeautifulSoup(open("cnn.txt"))	
  
>>>	
  for	
  script	
  in	
  soup(["script",	
  "style"]):	
  
	
  	
  	
  	
  script.extract()	
  
>>>	
  text	
  =	
  soup.get_text()	
  
>>>	
  lines	
  =	
  (line.strip()	
  for	
  line	
  in	
  
text.splitlines())	
  
>>>	
  chunks	
  =	
  (phrase.strip()	
  for	
  line	
  in	
  lines	
  for	
  
phrase	
  in	
  line.split("	
  	
  "))	
  
>>>	
  text	
  =	
  'n'.join(chunk	
  for	
  chunk	
  in	
  chunks	
  if	
  
chunk)	
  
43
Trigram method
example#2
>	
  curl	
  www.cnn.com	
  >	
  cnn.txt	
  
>	
  Python	
  
>>>	
  from	
  bs4	
  import	
  BeautifulSoup	
  
>>>	
  soup	
  =	
  BeautifulSoup(open("cnn.txt"))	
  
>>>	
  for	
  script	
  in	
  soup(["script",	
  "style"]):	
  
	
  	
  	
  	
  script.extract()	
  
>>>	
  text	
  =	
  soup.get_text()	
  
>>>	
  lines	
  =	
  (line.strip()	
  for	
  line	
  in	
  
text.splitlines())	
  
>>>	
  chunks	
  =	
  (phrase.strip()	
  for	
  line	
  in	
  lines	
  for	
  
phrase	
  in	
  line.split("	
  	
  "))	
  
>>>	
  text	
  =	
  'n'.join(chunk	
  for	
  chunk	
  in	
  chunks	
  if	
  
chunk)	
  
https://github.com/decultured/Python-Language-Detector
>>>	
  import	
  sys	
  
>>>	
  sys.path.append('languageDetector')	
  
>>>	
  import	
  languageIdentifiera	
  
>>>	
  languageIdentifier.load("languageDetector/
trigrams/")	
  
>>>	
  print	
  	
  languageIdentifier.identify(text,	
  300,	
  300)	
  
en	
  
44
Trigram method
example#2
>	
  curl	
  www.cnn.com	
  >	
  cnn.txt	
  
>	
  Python	
  
>>>	
  from	
  bs4	
  import	
  BeautifulSoup	
  
>>>	
  soup	
  =	
  BeautifulSoup(open("cnn.txt"))	
  
>>>	
  for	
  script	
  in	
  soup(["script",	
  "style"]):	
  
	
  	
  	
  	
  script.extract()	
  
>>>	
  text	
  =	
  soup.get_text()	
  
>>>	
  lines	
  =	
  (line.strip()	
  for	
  line	
  in	
  
text.splitlines())	
  
>>>	
  chunks	
  =	
  (phrase.strip()	
  for	
  line	
  in	
  lines	
  for	
  
phrase	
  in	
  line.split("	
  	
  "))	
  
>>>	
  text	
  =	
  'n'.join(chunk	
  for	
  chunk	
  in	
  chunks	
  if	
  
chunk)	
  
https://github.com/decultured/Python-Language-Detector
>>>	
  import	
  sys	
  
>>>	
  sys.path.append('languageDetector')	
  
>>>	
  import	
  languageIdentifiera	
  
>>>	
  languageIdentifier.load("languageDetector/
trigrams/")	
  
>>>	
  print	
  	
  languageIdentifier.identify(text,	
  300,	
  300)	
  
en	
  
45
Trigram method
example#2
Language detection API client
•  Returns detected language codes and scores
•  You have to setup your personal API key,
(http://detectlanguage.com)
•  Example of output:
https://detectlanguage.com
{"data":{"detections":
[{"language":"ar","isReliable":true,"confidence":9.54}]}}	
  
	
  
46
•  Returns detected language codes and scores
•  You have to setup your personal API key,
(http://detectlanguage.com)
•  Example of output:
https://detectlanguage.com
{"data":{"detections":
[{"language":"ar","isReliable":true,"confidence":9.54}]}}	
  
	
  
•  how much text you
pass
•  how well it is
identified
False means that the
confidence is low
Language
code
47
Language detection API client
https://detectlanguage.com
>	
  curl	
  www.raddadi.com	
  >	
  raddadi.txt	
  
>	
  Python	
  
>>>	
  from	
  bs4	
  import	
  BeautifulSoup	
  
>>>	
  soup	
  =	
  BeautifulSoup(open("raddadi.txt"))	
  
>>>	
  for	
  script	
  in	
  soup(["script",	
  "style"]):	
  
…	
  	
  	
  script.extract()	
  	
  
>>>	
  text	
  =	
  soup.get_text()	
  
>>>	
  lines	
  =	
  (line.strip()	
  for	
  line	
  in	
  
text.splitlines())	
  
>>>	
  chunks	
  =	
  (phrase.strip()	
  for	
  line	
  in	
  lines	
  for	
  
phrase	
  in	
  line.split("	
  	
  "))	
  
>>>	
  text	
  =	
  'n'.join(chunk	
  for	
  chunk	
  in	
  chunks	
  if	
  
chunk)	
  
Language detection API client
example#1
48
>	
  curl	
  www.raddadi.com	
  >	
  raddadi.txt	
  
>	
  Python	
  
>>>	
  from	
  bs4	
  import	
  BeautifulSoup	
  
>>>	
  soup	
  =	
  BeautifulSoup(open("raddadi.txt"))	
  
>>>	
  for	
  script	
  in	
  soup(["script",	
  "style"]):	
  
…	
  	
  	
  script.extract()	
  	
  
>>>	
  text	
  =	
  soup.get_text()	
  
>>>	
  lines	
  =	
  (line.strip()	
  for	
  line	
  in	
  
text.splitlines())	
  
>>>	
  chunks	
  =	
  (phrase.strip()	
  for	
  line	
  in	
  lines	
  for	
  
phrase	
  in	
  line.split("	
  	
  "))	
  
>>>	
  text	
  =	
  'n'.join(chunk	
  for	
  chunk	
  in	
  chunks	
  if	
  
chunk)	
  
https://detectlanguage.com
>>>	
  import	
  detectlanguage	
  
>>>	
  detectlanguage.configuration.api_key	
  =	
  "YOUR	
  API	
  KEY"	
  
>>>	
  detectlanguage.detect(text)	
  
{"data":{"detections":
[{"language":"ar","isReliable":true,"confidence":8.32},
{"language":"tk","isReliable":false,"confidence":0.01}]}}	
  
49
Language detection API client
example#1
>	
  curl	
  www.raddadi.com	
  >	
  raddadi.txt	
  
>	
  Python	
  
>>>	
  from	
  bs4	
  import	
  BeautifulSoup	
  
>>>	
  soup	
  =	
  BeautifulSoup(open("raddadi.txt"))	
  
>>>	
  for	
  script	
  in	
  soup(["script",	
  "style"]):	
  
…	
  	
  	
  script.extract()	
  	
  
>>>	
  text	
  =	
  soup.get_text()	
  
>>>	
  lines	
  =	
  (line.strip()	
  for	
  line	
  in	
  
text.splitlines())	
  
>>>	
  chunks	
  =	
  (phrase.strip()	
  for	
  line	
  in	
  lines	
  for	
  
phrase	
  in	
  line.split("	
  	
  "))	
  
>>>	
  text	
  =	
  'n'.join(chunk	
  for	
  chunk	
  in	
  chunks	
  if	
  
chunk)	
  
https://detectlanguage.com
>>>	
  import	
  detectlanguage	
  
>>>	
  detectlanguage.configuration.api_key	
  =	
  "YOUR	
  API	
  KEY"	
  
>>>	
  detectlanguage.detect(text)	
  
{"data":{"detections":
[{"language":"ar","isReliable":true,"confidence":8.32},
{"language":"tk","isReliable":false,"confidence":0.01}]}}	
  
50
Language detection API client
example#1
https://detectlanguage.com
>	
  curl	
  www.cnn.com	
  >	
  cnn.txt	
  
>	
  Python	
  
>>>	
  from	
  bs4	
  import	
  BeautifulSoup	
  
>>>	
  soup	
  =	
  BeautifulSoup(open("cnn.txt"))	
  
>>>	
  for	
  script	
  in	
  soup(["script",	
  "style"]):	
  
…	
  	
  	
  script.extract()	
  	
  
>>>	
  text	
  =	
  soup.get_text()	
  
>>>	
  lines	
  =	
  (line.strip()	
  for	
  line	
  in	
  
text.splitlines())	
  
>>>	
  chunks	
  =	
  (phrase.strip()	
  for	
  line	
  in	
  lines	
  for	
  
phrase	
  in	
  line.split("	
  	
  "))	
  
>>>	
  text	
  =	
  'n'.join(chunk	
  for	
  chunk	
  in	
  chunks	
  if	
  
chunk)	
  
51
Language detection API client
example#2
>	
  curl	
  www.cnn.com	
  >	
  cnn.txt	
  
>	
  Python	
  
>>>	
  from	
  bs4	
  import	
  BeautifulSoup	
  
>>>	
  soup	
  =	
  BeautifulSoup(open("cnn.txt"))	
  
>>>	
  for	
  script	
  in	
  soup(["script",	
  "style"]):	
  
…	
  	
  	
  script.extract()	
  	
  
>>>	
  text	
  =	
  soup.get_text()	
  
>>>	
  lines	
  =	
  (line.strip()	
  for	
  line	
  in	
  
text.splitlines())	
  
>>>	
  chunks	
  =	
  (phrase.strip()	
  for	
  line	
  in	
  lines	
  for	
  
phrase	
  in	
  line.split("	
  	
  "))	
  
>>>	
  text	
  =	
  'n'.join(chunk	
  for	
  chunk	
  in	
  chunks	
  if	
  
chunk)	
  
https://detectlanguage.com
>>>	
  import	
  detectlanguage	
  
>>>	
  detectlanguage.configuration.api_key	
  =	
  "YOUR	
  API	
  KEY"	
  
>>>	
  detectlanguage.detect(text)	
  
{"data":{"detections":
[{"language":"en","isReliable":true,"confidence":6.14}]}}	
  
52
Language detection API client
example#2
>	
  curl	
  www.cnn.com	
  >	
  cnn.txt	
  
>	
  Python	
  
>>>	
  from	
  bs4	
  import	
  BeautifulSoup	
  
>>>	
  soup	
  =	
  BeautifulSoup(open("cnn.txt"))	
  
>>>	
  for	
  script	
  in	
  soup(["script",	
  "style"]):	
  
…	
  	
  	
  script.extract()	
  	
  
>>>	
  text	
  =	
  soup.get_text()	
  
>>>	
  lines	
  =	
  (line.strip()	
  for	
  line	
  in	
  
text.splitlines())	
  
>>>	
  chunks	
  =	
  (phrase.strip()	
  for	
  line	
  in	
  lines	
  for	
  
phrase	
  in	
  line.split("	
  	
  "))	
  
>>>	
  text	
  =	
  'n'.join(chunk	
  for	
  chunk	
  in	
  chunks	
  if	
  
chunk)	
  
https://detectlanguage.com
>>>	
  import	
  detectlanguage	
  
>>>	
  detectlanguage.configuration.api_key	
  =	
  "YOUR	
  API	
  KEY"	
  
>>>	
  detectlanguage.detect(text)	
  
{"data":{"detections":
[{"language":"en","isReliable":true,"confidence":6.14}]}}	
  
53
Language detection API client
example#2
Language test intersection testing for Arabic
language
54
~41%
55
~38%	
~41%	
Language test intersection testing for Arabic
language
56
~41%	
~38%	
~36%	
Language test intersection testing for Arabic
language
57
~41%	
~38%	
~36%	
~39%	
Language test intersection testing for Arabic
language
58
~41%	
~38%	
~36%	
~39%	
872
~8%	
Language test intersection testing for Arabic
language
Language test intersection testing for Arabic
language
59
~41%	
~38%	
~36%	
~39%	
Total Arabic = 7,976
Crawling Arabic seed URIs
Unique:663,443
60
Crawling Arabic seed URIs
61
62
Crawling Arabic seed URIs
Total Arabic URIs Dataset = (7,976+292,670) = 300,646 63
Crawling Arabic seed URIs
17,536 Unique domains
Rank Domain URIs GeoIP Category
1 Alarab.net 284 US News
2 Aljarida.com 248 US News
3 Arabic.cnn.com 245 US News
4 Alarabiya.net 231 US News
5 Ar.wikipedia.org 230 US Encyclopedia
6 Aljazeera.net 213 US News
7 Moheet.com 142 US News
8 Facebook.com 133 US Social
9 Al-sharq.com 132 US Middle East Portal
10 Lakii.com 123 US General Portal
17 Kuwaitclub.com.kw 71 Kuwait Sport
64
Rank Domain URIs GeoIP Category
1 Alarab.net 284 US News
2 Aljarida.com 248 US News
3 Arabic.cnn.com 245 US News
4 Alarabiya.net 231 US News
5 Ar.wikipedia.org 230 US Encyclopedia
6 Aljazeera.net 213 US News
7 Moheet.com 142 US News
8 Facebook.com 133 US Social
9 Al-sharq.com 132 US Middle East Portal
10 Lakii.com 123 US General Portal
17 Kuwaitclub.com.kw 71 Kuwait Sport
First Arabic GeoIP location is at rank 17 65
17,536 Unique domains
Rank Domain URIs GeoIP Category
1 Alarab.net 284 US News
2 Aljarida.com 248 US News
3 Arabic.cnn.com 245 US News
4 Alarabiya.net 231 US News
5 Ar.wikipedia.org 230 US Encyclopedia
6 Aljazeera.net 213 US News
7 Moheet.com 142 US News
8 Facebook.com 133 US Social
9 Al-sharq.com 132 US Middle East Portal
10 Lakii.com 123 US General Portal
17 Kuwaitclub.com.kw 71 Kuwait Sport
6 out of 10 top unique domains are news websites 66
17,536 Unique domains
Rank Domain URIs GeoIP Category
1 Alarab.net 284 US News
2 Aljarida.com 248 US News
3 Arabic.cnn.com 245 US News
4 Alarabiya.net 231 US News
5 Ar.wikipedia.org 230 US Encyclopedia
6 Aljazeera.net 213 US News
7 Moheet.com 142 US News
8 Facebook.com 133 US Social
9 Al-sharq.com 132 US Middle East Portal
10 Lakii.com 123 US General Portal
17 Kuwaitclub.com.kw 71 Kuwait Sport
Popular western pages are in the top unique domains 67
17,536 Unique domains
TLD Percent
com 57.97%
net 15.07%
org 6.40%
gov.sa 1.94%
info 1.68%
edu.sa 1.27%
ws 1.16%
org.sa 0.97%
com.sa 0.80%
gov.eg 0.80%
Other 11.94%
Almost 58% are .com
68
TLD Percent
com 57.97%
net 15.07%
org 6.40%
gov.sa 1.94%
info 1.68%
edu.sa 1.27%
ws 1.16%
org.sa 0.97%
com.sa 0.80%
gov.eg 0.80%
Other 11.94%
Almost 58% are .com
69
TLD Percent
com 57.97%
net 15.07%
org 6.40%
gov.sa 1.94%
info 1.68%
edu.sa 1.27%
ws 1.16%
org.sa 0.97%
com.sa 0.80%
gov.eg 0.80%
Other 11.94%
Small percentage of Arabic TLD
70
TLD Country Percent
.sa Saudi Arabia 5.33%
.eg Egypt 2.00%
.jo Jordan 2.00%
.ae United Arab Emirates 1.06%
.kw Kuwait 0.82%
Small percentage of Arabic TLD
71
TLD Country Percent
.sa Saudi Arabia 5.33%
.eg Egypt 2.00%
.jo Jordan 2.00%
.ae United Arab Emirates 1.06%
.kw Kuwait 0.82%
Small percentage of Arabic TLD
72
Path Depth Example Percent
0 Example.com 17.30%
1 Example.com/a 40.42%
2 Example.com/a/b 24.45%
3 Example.com/a/b/c 10.81%
4+ Example.com/a/b/c/d 7.02%
More than 57% are of depth 0 and 1
73
Path Depth Example Percent
0 Example.com 17.30%
1 Example.com/a 40.42%
2 Example.com/a/b 24.45%
3 Example.com/a/b/c 10.81%
4+ Example.com/a/b/c/d 7.02%
74
More than 57% are of depth 0 and 1
53.77% of Arabic URIs are archived
•  January-March 2015
•  ODU CS Memento Aggregator
Median=16
75
URI-Rs Memento Category
gulfup.com 10,987 File Sharing
masrawy.com 9,144 Egyptian portal
arabic.cnn.com 9,022 News
aljazeera.net 8,906 News
maktoob.yahoo.com 8,478 Search Engine
shorooknews.com 7,548 News
arabnews.com 6,274 News
bbc.co.uk/arabic 6,268 News
ahram.org.eg 5,347 News
google.com.sa 4,968 Search Engine
Most of the top archived URI-Rs are news
websites
76
URI-Rs Memento Category
gulfup.com 10,987 File Sharing
masrawy.com 9,144 Egyptian portal
arabic.cnn.com 9,022 News
aljazeera.net 8,906 News
maktoob.yahoo.com 8,478 Search Engine
shorooknews.com 7,548 News
arabnews.com 6,274 News
bbc.co.uk/arabic 6,268 News
ahram.org.eg 5,347 News
google.com.sa 4,968 Search Engine
77
Most of the top archived URI-Rs are news
websites
Archiving has accelerated since 2011
78
March
2015
79
Archiving has accelerated since 2011
Two methods to determine the presence in
each archive
1.  Percent of URI-Rs present in each archive
e.g.
http://aljazeera.net
2.  Percent of URI-Ms present in each archive
e.g.
http://wayback.archive-it.org/all/20070727215420/http://
www.aljazeera.net/
e.g.
http://web.archive.org/web/20150618104846/http://aljazeera.net/
80
Internet Archive Archive.today Webcitation Total
URI-R1 2 0 0 2
URI-R2 2 0 0 2
URI-R3 1 1 0 2
URI-R4 1 1 0 2
URI-R5 0 1 1 2
Total 6 3 1 10
Presence in each archive example
81
1- Percent of URI-Rs present in
each archive
Archive Total Percentage
Internet Archive 4/5=0.8 80%
Archive.today 3/5=0.6 60%
Webcitation 1/5=0.2 20%
Total 160%
Internet Archive Archive.today Webcitation Total
URI-R1 2 0 0 2
URI-R2 2 0 0 2
URI-R3 1 1 0 2
URI-R4 1 1 0 2
URI-R5 0 1 1 2
Total 6 3 1 10
82
Presence in each archive example
Internet Archive Archive.today Webcitation Total
URI-R1 2 0 0 2
URI-R2 2 0 0 2
URI-R3 1 1 0 2
URI-R4 1 1 0 2
URI-R5 0 1 1 2
Total 6 3 1 10
Archive Total Percentage
Internet Archive 6/10=0.6 60%
Archive.today 3/10=0.3 30%
Webcitation 1/10=0.1 10%
Total 100%
2- Percent of URI-Ms present in
each archive
Archive Total Percentage
Internet Archive 4/5=0.8 80%
Archive.today 3/5=0.6 60%
Webcitation 1/5=0.2 20%
Total 160%
83
1- Percent of URI-Rs present in
each archive
Presence in each archive example
Archive Percent
Internet Archive 97.04%
Archive.today 6.58%
Webcitation 6.00%
Archive-It 5.49%
British Library Archive 1.06%
UK Parliament Web Archive 0.88%
Icelandic Web Archive 0.87%
UK National Archives 0.62%
Proni 0.21%
Stanford 0.11%
Total 118.86%
Archive Percent
Internet Archive 72.87%
Archive-It 21.26%
Archive.today 2.14%
Webcitation 2.08%
Icelandic Web Archive 1.17%
British Library Archive 0.29%
UK Parliament Web Archive 0.10%
Proni 0.05%
UK National Archives 0.04%
Stanford <0.01%
Total 100%
84
1- Percent of URI-Rs present in each
archive
2- Percent of URI-Ms present in each
archive
Presence in each archive
Archive Percent
Internet Archive 97.04%
Archive.today 6.58%
Webcitation 6.00%
Archive-It 5.49%
British Library Archive 1.06%
UK Parliament Web Archive 0.88%
Icelandic Web Archive 0.87%
UK National Archives 0.62%
Proni 0.21%
Stanford 0.11%
Total 118.86%
Archive Percent
Internet Archive 72.87%
Archive-It 21.26%
Archive.today 2.14%
Webcitation 2.08%
Icelandic Web Archive 1.17%
British Library Archive 0.29%
UK Parliament Web Archive 0.10%
Proni 0.05%
UK National Archives 0.04%
Stanford <0.01%
Total 100%
85
1- Percent of URI-Rs present in each
archive
2- Percent of URI-Ms present in each
archive
Presence in each archive
Archive Percent
Internet Archive 97.04%
Archive.today 6.58%
Webcitation 6.00%
Archive-It 5.49%
British Library Archive 1.06%
UK Parliament Web Archive 0.88%
Icelandic Web Archive 0.87%
UK National Archives 0.62%
Proni 0.21%
Stanford 0.11%
Total 118.86%
Archive Percent
Internet Archive 72.87%
Archive-It 21.26%
Archive.today 2.14%
Webcitation 2.08%
Icelandic Web Archive 1.17%
British Library Archive 0.29%
UK Parliament Web Archive 0.10%
Proni 0.05%
UK National Archives 0.04%
Stanford <0.01%
Total 100%
Presence in each archive
86
1- Percent of URI-Rs present in each
archive
2- Percent of URI-Ms present in each
archive
Average archiving period (days)
Average archiving period = (LM-FM) / number of mementos
16,732 URIs have only one memento
Median=48 days
87
Values less than 1 indicate
that the URI is archived
multiple times per day
The larger the
period, the more
irregularly the URI
was captured by
the archives
Median=48 days
Average archiving period = (LM-FM) / number of mementos
16,732 URIs have only one memento 88
Average archiving period (days)
Creation date for archived Arabic URIs
Source: http://ws-dl.blogspot.com/2014/11/2014-11-14-carbon-dating-web-version-20.html
We used CarbonDate for creation date estimate
89
Source: http://ws-dl.blogspot.com/2014/11/2014-11-14-carbon-dating-web-version-20.html
We used CarbonDate for creation date estimate
18 years
90
Creation date for archived Arabic URIs
Source: http://ws-dl.blogspot.com/2014/11/2014-11-14-carbon-dating-web-version-20.html
2013 is the most frequent year
We used CarbonDate for creation date estimate
18 years
91
Creation date for archived Arabic URIs
Archive Percent
United States 57.97%
Arabic Countries 10.53%
Germany 9.75%
Netherlands 5.29%
France 4.37%
Canada 3.31%
United Kingdom 3.07%
Other 5.71%
Top GeoIP locations
92
Archive Percent
United States 57.97%
Arabic Countries 10.53%
Germany 9.75%
Netherlands 5.29%
France 4.37%
Canada 3.31%
United Kingdom 3.07%
Other 5.71%
Top GeoIP locations
93
Archive Percent
United States 57.97%
Arabic Countries 10.53%
Germany 9.75%
Netherlands 5.29%
France 4.37%
Canada 3.31%
United Kingdom 3.07%
Other 5.71%
Archive Percent
Saudi Arabia 4.75%
Egypt 1.97%
Jordan 1.42%
Kuwait 0.71%
United Arab
Emirates
0.67%
Top GeoIP locations
94
Archive Percent
United States 57.97%
Arabic Countries 10.53%
Germany 9.75%
Netherlands 5.29%
France 4.37%
Canada 3.31%
United Kingdom 3.07%
Other 5.71%
Archive Percent
Saudi Arabia 4.75%
Egypt 1.97%
Jordan 1.42%
Kuwait 0.71%
United Arab
Emirates
0.67%
Top GeoIP locations
95
Seed Data Set
(Live, Indexed, Archived) Percent
(1, 1, 1) 43.34%
(1, 1, 0) 25.59%
(1, 0, 1) 15.27%
(1, 0, 0) 15.76%
Status of Arabic seed URIs
96
Seed Data Set
(Live, Indexed, Archived) Percent
(1, 1, 1) 43.34%
(1, 1, 0) 25.59%
(1, 0, 1) 15.27%
(1, 0, 0) 15.76%
(Good)
discovered
and saved
97
Status of Arabic seed URIs
Seed Data Set
(Live, Indexed, Archived) Percent
(1, 1, 1) 43.34%
(1, 1, 0) 25.59%
(1, 0, 1) 15.27%
(1, 0, 0) 15.76%
(Good)
discovered
and saved
(Bad)
undiscovered
and not saved
98
Status of Arabic seed URIs
Seed Data Set
(Live, Indexed, Archived) Percent
(1, 1, 1) 43.34%
(1, 1, 0) 25.59%
(1, 0, 1) 15.27%
(1, 0, 0) 15.76%
31% were not indexed by Google
99
Status of Arabic seed URIs
18% have
creation dates
over 1 year
before the first
memento was
archived
19.48% of the URIs have an estimated creation date that is the same
as first memento date
Difference between creation date and first
memento
100
Seed Data Set
Arabic Archived Indexed
DMOZ 34.43% 95.52% 82.13%
Raddadi 19.88% 45.44% 65.83%
Star28 45.69% 41.54% 65.23%
DMOZ URIs are more likely to be found and
archived
101
Seed Data Set
Arabic Archived Indexed
DMOZ 34.43% 95.52% 82.13%
Raddadi 19.88% 45.44% 65.83%
Star28 45.69% 41.54% 65.23%
102
DMOZ URIs are more likely to be found and
archived
Seed Data Set
Arabic Archived Indexed
DMOZ 34.43% 95.52% 82.13%
Raddadi 19.88% 45.44% 65.83%
Star28 45.69% 41.54% 65.23%
103
DMOZ URIs are more likely to be found and
archived
Full Data Set
Total Archived Category Total Archived
Arabic 33.18% 33.56% AR ccTLD 14.84% 28.09%
AR GeoIP 10.53% 13.11%
AR both 7.81% 59.50%
Neither 66.82% 65.22% Neither 66.82% 65.22%
Hosted in Western countries would be more
likely to be archived
104
Full Data Set
Total Archived Category Total Archived
Arabic 33.18% 33.56% AR ccTLD 14.84% 28.09%
AR GeoIP 10.53% 13.11%
AR both 7.81% 59.50%
Neither 66.82% 65.22% Neither 66.82% 65.22%
105
Hosted in Western countries would be more
likely to be archived
Seed Data Set
Total Indexed Category Total Indexed
Arabic 15.01% 78.29% AR ccTLD 6.61% 76.09%
AR GeoIP 2.37% 73.54%
AR both 6.03% 85.24%
Neither 84.99% 65.22% Neither 84.99% 67.09%
URIs that had some Arabic location had a
higher indexing rate
106
Seed Data Set
Total Indexed Category Total Indexed
Arabic 15.01% 78.29% AR ccTLD 6.61% 76.09%
AR GeoIP 2.37% 73.54%
AR both 6.03% 85.24%
Neither 84.99% 65.22% Neither 84.99% 67.09%
URIs that had some Arabic location had a
higher indexing rate
107
The spread of memento was not affected by
location or ccTLD
Ø  Kolmogorov-Smirnov test
Category Mean
Ar GeoIP 0.5010
Ar ccTLD 0.5013
Both 0.5016
Neither 0.5005
Category D-Value P-Value
Ar ccTLD
vs. neither
0.017 <0.002
Ar GeoIP
vs. neither
0.014 <0.002
108
Just because a webpage is older it does not
mean that it is archived more
Because of low historical archiving rates
109
We look in the last three years 110
Just because a webpage is older it does not
mean that it is archived more
We look in the last three years 111
Just because a webpage is older it does not
mean that it is archived more
In the last three years the older the
resource is the more memento it has
112
Full Data Set Seed Data Set
Path Depth Total Archived Total Indexed
0 17.30% 86.29% 86.05% 74.60%
1 40.42% 53.49% 9.77% 38.91%
2 24.45% 45.57% 3.72% 17.85%
3+ 17.83% 34.24% 0.50% 57.50%
Top level URIs are more likely to be
archived and indexed
113
Full Data Set Seed Data Set
Path Depth Total Archived Total Indexed
0 17.30% 86.29% 86.05% 74.60%
1 40.42% 53.49% 9.77% 38.91%
2 24.45% 45.57% 3.72% 17.85%
3+ 17.83% 34.24% 0.50% 57.50%
114
Top level URIs are more likely to be
archived and indexed
Full Data Set Seed Data Set
Path Depth Total Archived Total Indexed
0 17.30% 86.29% 86.05% 74.60%
1 40.42% 53.49% 9.77% 38.91%
2 24.45% 45.57% 3.72% 17.85%
3+ 17.83% 34.24% 0.50% 57.50%
115
Top level URIs are more likely to be
archived and indexed
•  Collected URIs from three Arabic directories (7,976):
Ø  DMOZ
Ø  Raddadi.com
Ø  Star28.com
•  Crawl seed dataset (1,299,671)
•  Check if they are unique (663,443)
•  Check if they are live (482,905)
•  Check for Arabic Language (300,646)
Summary of collection methods
116
§  Our Arabic language dataset was not largely located in Arabic
countries
Ø  Only 14.84% had an Arabic ccTLD
Ø  Only 10.53% had a GeoIP in an Arabic country
Ø  Popular Western domains (e.g., cnn.com, wikipedia.org) appeared in the
top 10
§  Arabic webpages are not particularly well archived or indexed
Ø  46% were not archived
Ø  31% were not indexed by Google
§  An Arabic webpage is more likely to be...
Ø  indexed if it is present in a directory
Ø  archived if it is present in DMOZ
Ø  archived if it has neither Arabic GeoIP nor Arabic ccTLD
For right now, if you want your Arabic language webpage to be archived,
host it outside of an Arabic country and get it listed in DMOZ
Findings
117
118
Backup Slides
119
GeoIP Location
•  We obtained the IP addresses of the hostnames
using nslookup, (which uses DNS to convert the
hostname to its IP address)
•  We used the MaxMind GeoLite29 database to
determine location from the IP address. (Which
tests at 99.8% accuracy at the country level)
h,p://dev.maxmind.com/geoip/geoip2/geolite2/  	
h,p://dev.maxmind.com/faq/how-­‐‑accurate-­‐‑are-­‐‑the-­‐‑  geoip-­‐‑databases/  	
120

More Related Content

Viewers also liked

Detecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web ArchivesDetecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web ArchivesYasmin AlNoamany, PhD
 
Detecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web ArchivesDetecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web ArchivesYasmin AlNoamany, PhD
 
Characteristics of Social Media Stories
Characteristics of Social Media StoriesCharacteristics of Social Media Stories
Characteristics of Social Media StoriesYasmin AlNoamany, PhD
 
Quantifying Orphaned Annotations in Hypothes.is
Quantifying Orphaned Annotations in Hypothes.isQuantifying Orphaned Annotations in Hypothes.is
Quantifying Orphaned Annotations in Hypothes.ismaturban
 
Web Archives and Data Challenges - Archives Unleashed
Web Archives and Data Challenges - Archives UnleashedWeb Archives and Data Challenges - Archives Unleashed
Web Archives and Data Challenges - Archives Unleashedmwe400
 
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...Michael Nelson
 
Social Feed Manager presentation at Archives Unleashed 3.0
Social Feed Manager presentation at Archives Unleashed 3.0Social Feed Manager presentation at Archives Unleashed 3.0
Social Feed Manager presentation at Archives Unleashed 3.0Justin Littman
 
Twitter Analysis: Fake News
Twitter Analysis: Fake  NewsTwitter Analysis: Fake  News
Twitter Analysis: Fake NewsErika Siregar
 
Good News/ Bad News
Good News/ Bad NewsGood News/ Bad News
Good News/ Bad NewsLulwahMA
 
Every Identity, its Ontology
Every Identity, its OntologyEvery Identity, its Ontology
Every Identity, its OntologyRobert Sanderson
 
Finding Pages on the Unarchived Web (DL 2014)
Finding Pages on the Unarchived Web (DL 2014)Finding Pages on the Unarchived Web (DL 2014)
Finding Pages on the Unarchived Web (DL 2014)TimelessFuture
 

Viewers also liked (12)

Detecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web ArchivesDetecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web Archives
 
Detecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web ArchivesDetecting Off-Topic Pages in Web Archives
Detecting Off-Topic Pages in Web Archives
 
Characteristics of Social Media Stories
Characteristics of Social Media StoriesCharacteristics of Social Media Stories
Characteristics of Social Media Stories
 
Quantifying Orphaned Annotations in Hypothes.is
Quantifying Orphaned Annotations in Hypothes.isQuantifying Orphaned Annotations in Hypothes.is
Quantifying Orphaned Annotations in Hypothes.is
 
Web Archives and Data Challenges - Archives Unleashed
Web Archives and Data Challenges - Archives UnleashedWeb Archives and Data Challenges - Archives Unleashed
Web Archives and Data Challenges - Archives Unleashed
 
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
Web Archiving Activities of ODU’s Web Science and Digital Library Research G...
 
Social Feed Manager presentation at Archives Unleashed 3.0
Social Feed Manager presentation at Archives Unleashed 3.0Social Feed Manager presentation at Archives Unleashed 3.0
Social Feed Manager presentation at Archives Unleashed 3.0
 
Twitter Analysis: Fake News
Twitter Analysis: Fake  NewsTwitter Analysis: Fake  News
Twitter Analysis: Fake News
 
Good News/ Bad News
Good News/ Bad NewsGood News/ Bad News
Good News/ Bad News
 
PID Signposting Pattern
PID Signposting PatternPID Signposting Pattern
PID Signposting Pattern
 
Every Identity, its Ontology
Every Identity, its OntologyEvery Identity, its Ontology
Every Identity, its Ontology
 
Finding Pages on the Unarchived Web (DL 2014)
Finding Pages on the Unarchived Web (DL 2014)Finding Pages on the Unarchived Web (DL 2014)
Finding Pages on the Unarchived Web (DL 2014)
 

Similar to JCDL2015: How Well are Arabic Websites Archived?

An Unsteady Course: Challenges to Growth in Africa's Air Transport Industry
An Unsteady Course: Challenges to Growth in Africa's Air Transport IndustryAn Unsteady Course: Challenges to Growth in Africa's Air Transport Industry
An Unsteady Course: Challenges to Growth in Africa's Air Transport IndustryDr Lendy Spires
 
Глобальний індекс миролюбності
Глобальний індекс миролюбностіГлобальний індекс миролюбності
Глобальний індекс миролюбностіtsnua
 
Turkish Digital Landscape in a Nutshell
Turkish Digital Landscape in a NutshellTurkish Digital Landscape in a Nutshell
Turkish Digital Landscape in a NutshellIAB Turkey
 
SDGs in OIC Countries: Data, Finance and Implementation
SDGs in OIC Countries: Data, Finance and ImplementationSDGs in OIC Countries: Data, Finance and Implementation
SDGs in OIC Countries: Data, Finance and ImplementationSDGsPlus
 
Research project on investing in consumer brands and companies
Research project on investing in consumer brands and companies Research project on investing in consumer brands and companies
Research project on investing in consumer brands and companies Saar Gur
 
David dean e friction refresh tunis ais 04jun15v3
David dean e friction refresh tunis ais 04jun15v3David dean e friction refresh tunis ais 04jun15v3
David dean e friction refresh tunis ais 04jun15v3AFRINIC
 
Looking Ahead: What 2023 Holds for Digital
Looking Ahead: What 2023 Holds for DigitalLooking Ahead: What 2023 Holds for Digital
Looking Ahead: What 2023 Holds for DigitalKepios
 
McAnerin Budget Mix
McAnerin Budget MixMcAnerin Budget Mix
McAnerin Budget MixIan McAnerin
 
IPSOS Statistics Internet Middle East
IPSOS Statistics Internet Middle EastIPSOS Statistics Internet Middle East
IPSOS Statistics Internet Middle EastDigital Rational
 
Big CInema Data: Analysing global cinema showtimes
Big CInema Data: Analysing global cinema showtimesBig CInema Data: Analysing global cinema showtimes
Big CInema Data: Analysing global cinema showtimesDeb Verhoeven
 
How korean students use IT for study
How korean students use IT for studyHow korean students use IT for study
How korean students use IT for studyUnggul Sagena
 
Black Hat: MENA Market Fact Sheet – 2010
Black Hat: MENA Market Fact Sheet – 2010Black Hat: MENA Market Fact Sheet – 2010
Black Hat: MENA Market Fact Sheet – 2010United Interactive™
 
글로벌 웹사이트 접근성비교 4th 접근성캠프
글로벌 웹사이트 접근성비교 4th 접근성캠프글로벌 웹사이트 접근성비교 4th 접근성캠프
글로벌 웹사이트 접근성비교 4th 접근성캠프선영 박
 
Trendeo Industrial investment in Africa may 2018
Trendeo Industrial investment in Africa may 2018Trendeo Industrial investment in Africa may 2018
Trendeo Industrial investment in Africa may 2018Trendeo
 
Strategica india report fdi
Strategica india report fdiStrategica india report fdi
Strategica india report fdiSaurav Sanyal
 
QNBFS Daily Market Report November 04, 2021
QNBFS Daily Market Report November 04, 2021QNBFS Daily Market Report November 04, 2021
QNBFS Daily Market Report November 04, 2021QNB Group
 
Online Audience Measurement
Online Audience Measurement Online Audience Measurement
Online Audience Measurement Ipsos
 
Goldbach Group | IAB Europe AdEx Benchmark 2013 Report
Goldbach Group | IAB Europe AdEx Benchmark 2013 ReportGoldbach Group | IAB Europe AdEx Benchmark 2013 Report
Goldbach Group | IAB Europe AdEx Benchmark 2013 ReportGoldbach Group AG
 

Similar to JCDL2015: How Well are Arabic Websites Archived? (20)

An Unsteady Course: Challenges to Growth in Africa's Air Transport Industry
An Unsteady Course: Challenges to Growth in Africa's Air Transport IndustryAn Unsteady Course: Challenges to Growth in Africa's Air Transport Industry
An Unsteady Course: Challenges to Growth in Africa's Air Transport Industry
 
Tech Room Power Point
Tech Room Power PointTech Room Power Point
Tech Room Power Point
 
Глобальний індекс миролюбності
Глобальний індекс миролюбностіГлобальний індекс миролюбності
Глобальний індекс миролюбності
 
Turkish Digital Landscape in a Nutshell
Turkish Digital Landscape in a NutshellTurkish Digital Landscape in a Nutshell
Turkish Digital Landscape in a Nutshell
 
SDGs in OIC Countries: Data, Finance and Implementation
SDGs in OIC Countries: Data, Finance and ImplementationSDGs in OIC Countries: Data, Finance and Implementation
SDGs in OIC Countries: Data, Finance and Implementation
 
Research project on investing in consumer brands and companies
Research project on investing in consumer brands and companies Research project on investing in consumer brands and companies
Research project on investing in consumer brands and companies
 
David dean e friction refresh tunis ais 04jun15v3
David dean e friction refresh tunis ais 04jun15v3David dean e friction refresh tunis ais 04jun15v3
David dean e friction refresh tunis ais 04jun15v3
 
Looking Ahead: What 2023 Holds for Digital
Looking Ahead: What 2023 Holds for DigitalLooking Ahead: What 2023 Holds for Digital
Looking Ahead: What 2023 Holds for Digital
 
McAnerin Budget Mix
McAnerin Budget MixMcAnerin Budget Mix
McAnerin Budget Mix
 
IPSOS Statistics Internet Middle East
IPSOS Statistics Internet Middle EastIPSOS Statistics Internet Middle East
IPSOS Statistics Internet Middle East
 
Big CInema Data: Analysing global cinema showtimes
Big CInema Data: Analysing global cinema showtimesBig CInema Data: Analysing global cinema showtimes
Big CInema Data: Analysing global cinema showtimes
 
How korean students use IT for study
How korean students use IT for studyHow korean students use IT for study
How korean students use IT for study
 
Black Hat: MENA Market Fact Sheet – 2010
Black Hat: MENA Market Fact Sheet – 2010Black Hat: MENA Market Fact Sheet – 2010
Black Hat: MENA Market Fact Sheet – 2010
 
글로벌 웹사이트 접근성비교 4th 접근성캠프
글로벌 웹사이트 접근성비교 4th 접근성캠프글로벌 웹사이트 접근성비교 4th 접근성캠프
글로벌 웹사이트 접근성비교 4th 접근성캠프
 
H.E. Mr Gita Wirjawan's Keynote Address in the 6th Asia Think Tank Summit
H.E. Mr Gita Wirjawan's Keynote Address in the 6th Asia Think Tank SummitH.E. Mr Gita Wirjawan's Keynote Address in the 6th Asia Think Tank Summit
H.E. Mr Gita Wirjawan's Keynote Address in the 6th Asia Think Tank Summit
 
Trendeo Industrial investment in Africa may 2018
Trendeo Industrial investment in Africa may 2018Trendeo Industrial investment in Africa may 2018
Trendeo Industrial investment in Africa may 2018
 
Strategica india report fdi
Strategica india report fdiStrategica india report fdi
Strategica india report fdi
 
QNBFS Daily Market Report November 04, 2021
QNBFS Daily Market Report November 04, 2021QNBFS Daily Market Report November 04, 2021
QNBFS Daily Market Report November 04, 2021
 
Online Audience Measurement
Online Audience Measurement Online Audience Measurement
Online Audience Measurement
 
Goldbach Group | IAB Europe AdEx Benchmark 2013 Report
Goldbach Group | IAB Europe AdEx Benchmark 2013 ReportGoldbach Group | IAB Europe AdEx Benchmark 2013 Report
Goldbach Group | IAB Europe AdEx Benchmark 2013 Report
 

Recently uploaded

Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxDyna Gilbert
 
Unidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptxUnidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptxmibuzondetrabajo
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作ys8omjxb
 
Internet of Things Presentation (IoT).pptx
Internet of Things Presentation (IoT).pptxInternet of Things Presentation (IoT).pptx
Internet of Things Presentation (IoT).pptxErYashwantJagtap
 
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Sonam Pathan
 
NSX-T and Service Interfaces presentation
NSX-T and Service Interfaces presentationNSX-T and Service Interfaces presentation
NSX-T and Service Interfaces presentationMarko4394
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationLinaWolf1
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一z xss
 
Q4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxQ4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxeditsforyah
 
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书rnrncn29
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhimiss dipika
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书zdzoqco
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predieusebiomeyer
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Paul Calvano
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书rnrncn29
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa494f574xmv
 

Recently uploaded (17)

Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in  Rk Puram 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Rk Puram 🔝 9953056974 🔝 Delhi escort Service
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptx
 
Unidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptxUnidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptx
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
 
Internet of Things Presentation (IoT).pptx
Internet of Things Presentation (IoT).pptxInternet of Things Presentation (IoT).pptx
Internet of Things Presentation (IoT).pptx
 
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
Call Girls In The Ocean Pearl Retreat Hotel New Delhi 9873777170
 
NSX-T and Service Interfaces presentation
NSX-T and Service Interfaces presentationNSX-T and Service Interfaces presentation
NSX-T and Service Interfaces presentation
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 Documentation
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
 
Q4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxQ4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptx
 
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhi
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predi
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa
 

JCDL2015: How Well are Arabic Websites Archived?

  • 1. How Well Are Arabic Websites Archived? Lulwah M. Alkwai, Michael L. Nelson, and Michele C. Weigle Old Dominion University Department of Computer Science Norfolk, Virginia 23529 USA JCDL 2015 Knoxville, TN June 21-25, 2015
  • 2. Archived events on English sites vs. Arabic sites 2
  • 4. http://www.foxnews.com/us/2015/05/26/2-shot-dead-in- bloody-memorial-day-weekend-in-baltimore-capping-off- deadliest/ Search: Baltimore (one week old) Search: Yemen Houthis (one week old) http://www.yemenakhbar.com/yemen-news/178683.html Archived events on English sites vs. Arabic sites 4
  • 5. Search: Baltimore (one week old) http://www.foxnews.com/us/2015/05/26/2-shot-dead-in- bloody-memorial-day-weekend-in-baltimore-capping-off- deadliest/ Search: Yemen Houthis (one week old) Archived events on English sites vs. Arabic sites 5 http://www.yemenakhbar.com/yemen-news/178683.html
  • 6. Search: Baltimore (one week old) http://www.foxnews.com/us/2015/05/26/2-shot-dead-in- bloody-memorial-day-weekend-in-baltimore-capping-off- deadliest/ Search: Yemen Houthis (one week old) Archived events on English sites vs. Arabic sites 6 http://www.yemenakhbar.com/yemen-news/178683.html
  • 7. Search: Baltimore (one week old) http://www.foxnews.com/us/2015/05/26/2-shot-dead-in- bloody-memorial-day-weekend-in-baltimore-capping-off- deadliest/ Search: Yemen Houthis (one week old) Archived events on English sites vs. Arabic sites 7 http://www.yemenakhbar.com/yemen-news/178683.html
  • 8. English sports websites are more archived than Arabic www.espn.go.com www.kooora.com 8
  • 9. English e-Marketing websites are more archived than Arabic www.amazon.com www.haraj.com.sa 9
  • 10. English encyclopedia websites are more archived than Arabic en.wikipedia.org ar.wikipedia.org 10
  • 11. Top ten languages in the Internet World Language Map Source: Quick Maps of the World immigration - http://www.allcountries.org/maps/world_language_maps.html                                                                                                                                                                                                 Source: Internet World Stats - http://www.internetworldstats.com/stats7.htm 11
  • 12. 2009 2013 Countries Population Internet Users Penetration Population Internet Users Penetration 1 Algeria 34,178,188 4,100,000 12.00% 38,813,722 6,404,264 16.50% 2 Bahrain 728,709 402,900 55.30% 1,314,089 1,182,680 90.00% 3 Comoros 752,438 23,000 3.10% 766,865 49,846 6.50% 4 Djibouti 724,622 19,200 2.60% 810,179 76,967 9.50% 5 Egypt 78,866,635 16,636,000 21.10% 86,895,099 43,065,211 49.60% 6 Iraq 28,945,569 300,000 1.00% 32,585,692 2,997,884 9.20% 7 Jordan 6,269,285 1,595,200 25.40% 6,528,061 2,885,403 44.20% 8 Kuwait 2,692,526 1,000,000 37.10% 2,742,711 2,069,650 75.50% 9 Lebanon 4,017,095 945,000 23.50% 4,136,895 2,916,511 70.50% 10 Libya 6,324,357 323,000 5.10% 6,244,174 1,030,289 16.50% 11 Mauritania 73,129,486 60,000 1.90% 3,516,806 218,042 6.20% 12 Morocco 31,285,174 10,442,500 33.40% 32,987,206 18,472,835 56.00% 13 Oman 3,418,085 557,000 16.30% 3,219,775 2,139,540 66.40% 14 Qatar 833,285 436,000 52.30% 2,123,160 1,811,055 85.30% 15 Saudi Arabia 28,686,633 7,761,800 27.10% 27,345,986 16,544,322 60.50% 16 Somalia 9,832,017 102,000 1.00% 10,428,043 156,420 1.50% 17 South Sudan - - - 11,562,695 100 0.00% 18 Sudan 41,087,825 4,200,000 10.20% 35,482,233 8,054,467 22.70% 19 Syria 21,762,978 3,565,000 16.40% 22,597,531 5,920,553 26.20% 20 Tunisia 10,486,339 3,500,000 33.40% 10,937,521 4,790,634 43.80% 21 UAE 4,798,491 3,558,000 74.10% 9,206,000 8,101,280 88.00% 22 Palestine 2,461,267 355,500 14.40% 2,731,052 1,512,273 55.40% 23 Yemen 22,858,238 370,000 1.60% 26,052,966 5,210,593 20.00% Arabic Total 344,139,242 60,252,100 17.50% 379,028,461 135,610,819 35.8 % World Total 6,767,805,208 1,802,330,457 26.6 % 7,181,858,619 2,802,478,934 39.0 % Source: http://www.internetworldstats.com/stats19.htm Arabic speaking Internet users 12
  • 13. 2009 2013 Countries Population Internet Users Penetration Population Internet Users Penetration 1 Algeria 34,178,188 4,100,000 12.00% 38,813,722 6,404,264 16.50% 2 Bahrain 728,709 402,900 55.30% 1,314,089 1,182,680 90.00% 3 Comoros 752,438 23,000 3.10% 766,865 49,846 6.50% 4 Djibouti 724,622 19,200 2.60% 810,179 76,967 9.50% 5 Egypt 78,866,635 16,636,000 21.10% 86,895,099 43,065,211 49.60% 6 Iraq 28,945,569 300,000 1.00% 32,585,692 2,997,884 9.20% 7 Jordan 6,269,285 1,595,200 25.40% 6,528,061 2,885,403 44.20% 8 Kuwait 2,692,526 1,000,000 37.10% 2,742,711 2,069,650 75.50% 9 Lebanon 4,017,095 945,000 23.50% 4,136,895 2,916,511 70.50% 10 Libya 6,324,357 323,000 5.10% 6,244,174 1,030,289 16.50% 11 Mauritania 73,129,486 60,000 1.90% 3,516,806 218,042 6.20% 12 Morocco 31,285,174 10,442,500 33.40% 32,987,206 18,472,835 56.00% 13 Oman 3,418,085 557,000 16.30% 3,219,775 2,139,540 66.40% 14 Qatar 833,285 436,000 52.30% 2,123,160 1,811,055 85.30% 15 Saudi Arabia 28,686,633 7,761,800 27.10% 27,345,986 16,544,322 60.50% 16 Somalia 9,832,017 102,000 1.00% 10,428,043 156,420 1.50% 17 South Sudan - - - 11,562,695 100 0.00% 18 Sudan 41,087,825 4,200,000 10.20% 35,482,233 8,054,467 22.70% 19 Syria 21,762,978 3,565,000 16.40% 22,597,531 5,920,553 26.20% 20 Tunisia 10,486,339 3,500,000 33.40% 10,937,521 4,790,634 43.80% 21 UAE 4,798,491 3,558,000 74.10% 9,206,000 8,101,280 88.00% 22 Palestine 2,461,267 355,500 14.40% 2,731,052 1,512,273 55.40% 23 Yemen 22,858,238 370,000 1.60% 26,052,966 5,210,593 20.00% Arabic Total 344,139,242 60,252,100 17.50% 379,028,461 135,610,819 35.8 % World Total 6,767,805,208 1,802,330,457 26.6 % 7,181,858,619 2,802,478,934 39.0 % Source: http://www.internetworldstats.com/stats19.htm 2009 Arabic Total=17.5% World Total=26.6% Arabic speaking Internet users 13
  • 14. 2009 2013 Countries Population Internet Users Penetration Population Internet Users Penetration 1 Algeria 34,178,188 4,100,000 12.00% 38,813,722 6,404,264 16.50% 2 Bahrain 728,709 402,900 55.30% 1,314,089 1,182,680 90.00% 3 Comoros 752,438 23,000 3.10% 766,865 49,846 6.50% 4 Djibouti 724,622 19,200 2.60% 810,179 76,967 9.50% 5 Egypt 78,866,635 16,636,000 21.10% 86,895,099 43,065,211 49.60% 6 Iraq 28,945,569 300,000 1.00% 32,585,692 2,997,884 9.20% 7 Jordan 6,269,285 1,595,200 25.40% 6,528,061 2,885,403 44.20% 8 Kuwait 2,692,526 1,000,000 37.10% 2,742,711 2,069,650 75.50% 9 Lebanon 4,017,095 945,000 23.50% 4,136,895 2,916,511 70.50% 10 Libya 6,324,357 323,000 5.10% 6,244,174 1,030,289 16.50% 11 Mauritania 73,129,486 60,000 1.90% 3,516,806 218,042 6.20% 12 Morocco 31,285,174 10,442,500 33.40% 32,987,206 18,472,835 56.00% 13 Oman 3,418,085 557,000 16.30% 3,219,775 2,139,540 66.40% 14 Qatar 833,285 436,000 52.30% 2,123,160 1,811,055 85.30% 15 Saudi Arabia 28,686,633 7,761,800 27.10% 27,345,986 16,544,322 60.50% 16 Somalia 9,832,017 102,000 1.00% 10,428,043 156,420 1.50% 17 South Sudan - - - 11,562,695 100 0.00% 18 Sudan 41,087,825 4,200,000 10.20% 35,482,233 8,054,467 22.70% 19 Syria 21,762,978 3,565,000 16.40% 22,597,531 5,920,553 26.20% 20 Tunisia 10,486,339 3,500,000 33.40% 10,937,521 4,790,634 43.80% 21 UAE 4,798,491 3,558,000 74.10% 9,206,000 8,101,280 88.00% 22 Palestine 2,461,267 355,500 14.40% 2,731,052 1,512,273 55.40% 23 Yemen 22,858,238 370,000 1.60% 26,052,966 5,210,593 20.00% Arabic Total 344,139,242 60,252,100 17.50% 379,028,461 135,610,819 35.8 % World Total 6,767,805,208 1,802,330,457 26.6 % 7,181,858,619 2,802,478,934 39.0 % 2013 Arabic Total=35.8% World Total=39.0% Source: http://www.internetworldstats.com/stats19.htm 2009 Arabic Total=17.5% World Total=26.6% 14 Arabic speaking Internet users
  • 15. Ø  The number of Arabic speaking Internet users has grown rapidly Ø  There has been previous work on the coverage of web archives Ø  Little has been done in terms of Arabic language content 15 Why are we doing this?
  • 16. How Much of the Web Is Archived? Ø  Sample of URIs from four different sources (DMOZ, Delicious, Bitly, Search engine indexes) Ø  The archival percentages ranged from 16% to 79% 2013, A follow-on study: Ø  Archival percentages had increased from 33% to 95% Ø  These studies were not focused on content from specific countries or content in specific languages 16
  • 17. A fair history of the Web? Examining country balance in the Internet Archive Ø  Examined country balance in the Internet Archive: Country Domain Archived US .com 92% Taiwan .com.tw 73% China .com.cn 58% Singapore .com.sg 73% 17 Ø  This work focused on TLD rather than content language or location
  • 18. Characterization of National Web Domains Ø  Used 10 national web domains §  120 million pages §  24 countries §  They studied page sizes, degrees, link based scores, etc. §  They found that depth, response code were similar Ø  In this work, additional methods are required to determine if a site belongs to a particular country 18
  • 19. Characterizing a National Community Web Ø  Used Portuguese dataset: §  (.pt) ccTLD §  (.com,.net,.org,.tv) in Portuguese language that has at least one incoming link from (.pt) ccTLD Ø  They identify, collect, and characterize the Portuguese Web 19
  • 20. GeoIP  only ccTLD  only Both Neither ²  News: al-watan.com ²  ccTLD: Not Arabic (.com) ²  GeoIP: Arabic country (Qatar) How do we classify Arabic websites? 20
  • 21. GeoIP  only ccTLD  only Both Neither ²  E-Marketing: haraj.com.sa ²  ccTLD: Arabic (.sa) ²  GeoIP: Not an Arabic country (Ireland) ²  News: al-watan.com ²  ccTLD: Not Arabic (.com) ²  GeoIP: Arabic country (Qatar) 21 How do we classify Arabic websites?
  • 22. GeoIP  only ccTLD  only Both Neither ²  E-Marketing: haraj.com.sa ²  ccTLD: Arabic (.sa) ²  GeoIP: Not an Arabic country (Ireland) ²  News: al-watan.com ²  ccTLD: Not Arabic (.com) ²  GeoIP: Arabic country (Qatar) 22 ²  Educational: uoh.edu.sa ²  ccTLD: Arabic (.sa) ²  GeoIP: Arabic country (SA) How do we classify Arabic websites?
  • 23. GeoIP  only ccTLD  only Both Neither ²  News: alarabiya.net ²  ccTLD: Not Arabic (.net) ²  GeoIP: Not Arabic country (US) ²  E-Marketing: haraj.com.sa ²  ccTLD: Arabic (.sa) ²  GeoIP: Not an Arabic country (Ireland) ²  News: al-watan.com ²  ccTLD: Not Arabic (.com) ²  GeoIP: Arabic country (Qatar) 23 ²  Educational: uoh.edu.sa ²  ccTLD: Arabic (.sa) ²  GeoIP: Arabic country (SA) How do we classify Arabic websites?
  • 24. Selecting seed URIs Name Registered Year URI count DMOZ US 1999 Dmoz.org/world/arabic 4,086 Raddadi Saudi Arabia 2000 Raddadi.com 3,271 Star28 Lebanon 2004 Star28.com 8,386 Total 15,743 •  15,092 unique seed URIs •  11,014 URIs that existed in the live web 24
  • 25. Determining a webpage language •  HTTP header Content-Language •  HTML title tag language •  Trigram method •  Language detection API client 25
  • 26. >  curl  –I  www.alquds.com   HTTP/1.1  200  OK   Server:  nginx/1.6.2   Date:  Wed,  03  Jun  2015  19:11:31  GMT   Content-­‐Type:  text/html;  charset=utf-­‐8   Connection:  keep-­‐alive   X-­‐Powered-­‐By:  PHP/5.3.3   X-­‐Drupal-­‐Cache:  HIT   Etag:  "1433361507-­‐0"   Content-­‐Language:  ar   …   HTTP header Content-Language example#1 26
  • 27. >  curl  –I  www.alquds.com   HTTP/1.1  200  OK   Server:  nginx/1.6.2   Date:  Wed,  03  Jun  2015  19:11:31  GMT   Content-­‐Type:  text/html;  charset=utf-­‐8   Connection:  keep-­‐alive   X-­‐Powered-­‐By:  PHP/5.3.3   X-­‐Drupal-­‐Cache:  HIT   Etag:  "1433361507-­‐0"   Content-­‐Language:  ar   …   HTTP header Content-Language example#1 27
  • 28. >  curl  –I  www.raddadi.com   HTTP/1.1  200  OK   Server:  nginx/1.8.0   Date:  Sat,  06  Jun  2015  22:47:09  GMT   Content-­‐Type:  text/html   Connection:  keep-­‐alive   …   HTTP header Content-Language example#2 28
  • 29. >  curl  –I  www.raddadi.com   HTTP/1.1  200  OK   Server:  nginx/1.8.0   Date:  Sat,  06  Jun  2015  22:47:09  GMT   Content-­‐Type:  text/html   Connection:  keep-­‐alive   …   >  curl  www.raddadi.com   <!DOCTYPE  html  PUBLIC  "-­‐//W3C//DTD  XHTML  1.0   Transitional//EN"  "http://www.w3.org/TR/ xhtml1/DTD/xhtml1-­‐transitional.dtd">     <html  dir="rtl"  xmlns="http://www.w3.org/ 1999/xhtml"  xml:lang="ar"  lang="ar"  >   <head>   HTTP header Content-Language example#2 29
  • 30. >  curl  –I  www.raddadi.com   HTTP/1.1  200  OK   Server:  nginx/1.8.0   Date:  Sat,  06  Jun  2015  22:47:09  GMT   Content-­‐Type:  text/html   Connection:  keep-­‐alive   …   >  curl  www.raddadi.com   <!DOCTYPE  html  PUBLIC  "-­‐//W3C//DTD  XHTML  1.0   Transitional//EN"  "http://www.w3.org/TR/ xhtml1/DTD/xhtml1-­‐transitional.dtd">     <html  dir="rtl"  xmlns="http://www.w3.org/ 1999/xhtml"  xml:lang="ar"  lang="ar"  >   <head>   HTTP header Content-Language example#2 30
  • 31. https://code.google.com/p/guess-language/ >  curl  www.star28.com   …   <META  name="Copyright"  content="©  2011   www.star28.com">   <META  name="DISTRIBUTION"  content="GLOBAL">   <META  name="REVISIT-­‐AFTER"  content="1  DAYS">   <TITLE> ‫دليل‬‫العرب‬‫الشامل‬ </TITLE>   <META  name="description"  content=" ‫دليل‬‫للمواقع‬ ‫العربية‬‫و‬‫أفضل‬‫املواقع‬‫العاملية‬,‫يحدث‬‫باستمرار‬ ">   <META  name="keywords"  content=" ‫دليل‬‫مواقع‬,‫جتارة‬,‫جتارة‬ ,‫مواقع‬ ‫دليل‬ ‫العاب‬,‫جافا‬‫سكربت‬,‫رياضة‬,‫منتديات‬,‫علوم‬,‫كومبيوتر‬,‫اسالم‬,‫اخبار‬,‫اخبار‬ ,‫اسالم‬ ,‫كومبيوتر‬ ,‫علوم‬ ,‫منتديات‬ ,‫رياضة‬ ,‫سكربت‬ ‫جافا‬ ,‫العاب‬ ‫صحف‬,‫تلفزيون‬,‫سياحة‬,‫تعليم‬,‫زواج‬,‫توظيف‬ "> … HTML title tag language 31
  • 32. >  curl  www.star28.com   …   <META  name="Copyright"  content="©  2011   www.star28.com">   <META  name="DISTRIBUTION"  content="GLOBAL">   <META  name="REVISIT-­‐AFTER"  content="1  DAYS">   <TITLE> ‫دليل‬‫العرب‬‫الشامل‬ </TITLE>   <META  name="description"  content=" ‫دليل‬‫للمواقع‬ ‫العربية‬‫و‬‫أفضل‬‫املواقع‬‫العاملية‬,‫يحدث‬‫باستمرار‬ ">   <META  name="keywords"  content=" ‫دليل‬‫مواقع‬,‫جتارة‬,‫جتارة‬ ,‫مواقع‬ ‫دليل‬ ‫العاب‬,‫جافا‬‫سكربت‬,‫رياضة‬,‫منتديات‬,‫علوم‬,‫كومبيوتر‬,‫اسالم‬,‫اخبار‬,‫اخبار‬ ,‫اسالم‬ ,‫كومبيوتر‬ ,‫علوم‬ ,‫منتديات‬ ,‫رياضة‬ ,‫سكربت‬ ‫جافا‬ ,‫العاب‬ ‫صحف‬,‫تلفزيون‬,‫سياحة‬,‫تعليم‬,‫زواج‬,‫توظيف‬ "> … https://code.google.com/p/guess-language/ Then we use guess-language Python library to determine the language HTML title tag language 32
  • 33. https://code.google.com/p/guess-language/ Ø   curl  -­‐s  www.gulfup.com    |  grep  -­‐io  "<title>[^<]*"  |   tail  -­‐c+8  >  gulfup_title.txt   33 HTML title tag language example#1
  • 34. https://code.google.com/p/guess-language/ 34 Ø   curl  -­‐s  www.gulfup.com    |  grep  -­‐io  "<title>[^<]*"  |   tail  -­‐c+8  >  gulfup_title.txt   >  Python   >>>  myfile=open("gulfup_title.txt",  "r")   >>>  data=myfile.read()   >>>  from  guess_language  import  guess_language   >>>  guess_language(data)   'ar'   HTML title tag language example#1
  • 35. https://code.google.com/p/guess-language/ 35 Ø   curl  -­‐s  www.gulfup.com    |  grep  -­‐io  "<title>[^<]*"  |   tail  -­‐c+8  >  gulfup_title.txt   >  Python   >>>  myfile=open("gulfup_title.txt",  "r")   >>>  data=myfile.read()   >>>  from  guess_language  import  guess_language   >>>  guess_language(data)   'ar'   HTML title tag language example#1
  • 36. https://code.google.com/p/guess-language/ 36 Ø   curl  -­‐s  www.cnn.com    |  grep  -­‐io  "<title>[^<]*"  |   tail  -­‐c+8  >  cnn_title.txt   HTML title tag language example#2
  • 37. https://code.google.com/p/guess-language/ 37 Ø   curl  -­‐s  www.cnn.com    |  grep  -­‐io  "<title>[^<]*"  |   tail  -­‐c+8  >  cnn_title.txt   >  Python   >>>  myfile=open("cnn_title.txt",  "r")   >>>  data=myfile.read()   >>>  from  guess_language  import  guess_language   >>>  guess_language(data)   'en'   HTML title tag language example#2
  • 38. https://code.google.com/p/guess-language/ 38 Ø   curl  -­‐s  www.cnn.com    |  grep  -­‐io  "<title>[^<]*"  |   tail  -­‐c+8  >  cnn_title.txt   >  Python   >>>  myfile=open("cnn_title.txt",  "r")   >>>  data=myfile.read()   >>>  from  guess_language  import  guess_language   >>>  guess_language(data)   'en'   HTML title tag language example#2
  • 39. §  Built in C++ and wrapped as a python module §  Identification is performed through basic trigram lookups paired with unicode character set recognition §  Accuracy is high for even short sample texts https://github.com/decultured/Python-Language-Detector Trigram method 39
  • 40. https://github.com/decultured/Python-Language-Detector >  curl  www.raddadi.com  >  raddadi.txt   >  Python   >>>  from  bs4  import  BeautifulSoup   >>>  soup  =  BeautifulSoup(open("raddadi.txt"))   >>>  for  script  in  soup(["script",  "style"]):          script.extract()   >>>  text  =  soup.get_text()   >>>  lines  =  (line.strip()  for  line  in   text.splitlines())   >>>  chunks  =  (phrase.strip()  for  line  in  lines  for   phrase  in  line.split("    "))   >>>  text  =  'n'.join(chunk  for  chunk  in  chunks  if   chunk)   Trigram method example#1 40
  • 41. >  curl  www.raddadi.com  >  raddadi.txt   >  Python   >>>  from  bs4  import  BeautifulSoup   >>>  soup  =  BeautifulSoup(open("raddadi.txt"))   >>>  for  script  in  soup(["script",  "style"]):          script.extract()   >>>  text  =  soup.get_text()   >>>  lines  =  (line.strip()  for  line  in   text.splitlines())   >>>  chunks  =  (phrase.strip()  for  line  in  lines  for   phrase  in  line.split("    "))   >>>  text  =  'n'.join(chunk  for  chunk  in  chunks  if   chunk)   https://github.com/decultured/Python-Language-Detector >>>  import  sys   >>>  sys.path.append('languageDetector')   >>>  import  languageIdentifiera   >>>  languageIdentifier.load("languageDetector/ trigrams/")   >>>  print    languageIdentifier.identify(text,  300,  300)   ar   41 Trigram method example#1
  • 42. >  curl  www.raddadi.com  >  raddadi.txt   >  Python   >>>  from  bs4  import  BeautifulSoup   >>>  soup  =  BeautifulSoup(open("raddadi.txt"))   >>>  for  script  in  soup(["script",  "style"]):          script.extract()   >>>  text  =  soup.get_text()   >>>  lines  =  (line.strip()  for  line  in   text.splitlines())   >>>  chunks  =  (phrase.strip()  for  line  in  lines  for   phrase  in  line.split("    "))   >>>  text  =  'n'.join(chunk  for  chunk  in  chunks  if   chunk)   >>>  import  sys   >>>  sys.path.append('languageDetector')   >>>  import  languageIdentifiera   >>>  languageIdentifier.load("languageDetector/ trigrams/")   >>>  print    languageIdentifier.identify(text,  300,  300)   ar   https://github.com/decultured/Python-Language-Detector 42 Trigram method example#1
  • 43. https://github.com/decultured/Python-Language-Detector >  curl  www.cnn.com  >  cnn.txt   >  Python   >>>  from  bs4  import  BeautifulSoup   >>>  soup  =  BeautifulSoup(open("cnn.txt"))   >>>  for  script  in  soup(["script",  "style"]):          script.extract()   >>>  text  =  soup.get_text()   >>>  lines  =  (line.strip()  for  line  in   text.splitlines())   >>>  chunks  =  (phrase.strip()  for  line  in  lines  for   phrase  in  line.split("    "))   >>>  text  =  'n'.join(chunk  for  chunk  in  chunks  if   chunk)   43 Trigram method example#2
  • 44. >  curl  www.cnn.com  >  cnn.txt   >  Python   >>>  from  bs4  import  BeautifulSoup   >>>  soup  =  BeautifulSoup(open("cnn.txt"))   >>>  for  script  in  soup(["script",  "style"]):          script.extract()   >>>  text  =  soup.get_text()   >>>  lines  =  (line.strip()  for  line  in   text.splitlines())   >>>  chunks  =  (phrase.strip()  for  line  in  lines  for   phrase  in  line.split("    "))   >>>  text  =  'n'.join(chunk  for  chunk  in  chunks  if   chunk)   https://github.com/decultured/Python-Language-Detector >>>  import  sys   >>>  sys.path.append('languageDetector')   >>>  import  languageIdentifiera   >>>  languageIdentifier.load("languageDetector/ trigrams/")   >>>  print    languageIdentifier.identify(text,  300,  300)   en   44 Trigram method example#2
  • 45. >  curl  www.cnn.com  >  cnn.txt   >  Python   >>>  from  bs4  import  BeautifulSoup   >>>  soup  =  BeautifulSoup(open("cnn.txt"))   >>>  for  script  in  soup(["script",  "style"]):          script.extract()   >>>  text  =  soup.get_text()   >>>  lines  =  (line.strip()  for  line  in   text.splitlines())   >>>  chunks  =  (phrase.strip()  for  line  in  lines  for   phrase  in  line.split("    "))   >>>  text  =  'n'.join(chunk  for  chunk  in  chunks  if   chunk)   https://github.com/decultured/Python-Language-Detector >>>  import  sys   >>>  sys.path.append('languageDetector')   >>>  import  languageIdentifiera   >>>  languageIdentifier.load("languageDetector/ trigrams/")   >>>  print    languageIdentifier.identify(text,  300,  300)   en   45 Trigram method example#2
  • 46. Language detection API client •  Returns detected language codes and scores •  You have to setup your personal API key, (http://detectlanguage.com) •  Example of output: https://detectlanguage.com {"data":{"detections": [{"language":"ar","isReliable":true,"confidence":9.54}]}}     46
  • 47. •  Returns detected language codes and scores •  You have to setup your personal API key, (http://detectlanguage.com) •  Example of output: https://detectlanguage.com {"data":{"detections": [{"language":"ar","isReliable":true,"confidence":9.54}]}}     •  how much text you pass •  how well it is identified False means that the confidence is low Language code 47 Language detection API client
  • 48. https://detectlanguage.com >  curl  www.raddadi.com  >  raddadi.txt   >  Python   >>>  from  bs4  import  BeautifulSoup   >>>  soup  =  BeautifulSoup(open("raddadi.txt"))   >>>  for  script  in  soup(["script",  "style"]):   …      script.extract()     >>>  text  =  soup.get_text()   >>>  lines  =  (line.strip()  for  line  in   text.splitlines())   >>>  chunks  =  (phrase.strip()  for  line  in  lines  for   phrase  in  line.split("    "))   >>>  text  =  'n'.join(chunk  for  chunk  in  chunks  if   chunk)   Language detection API client example#1 48
  • 49. >  curl  www.raddadi.com  >  raddadi.txt   >  Python   >>>  from  bs4  import  BeautifulSoup   >>>  soup  =  BeautifulSoup(open("raddadi.txt"))   >>>  for  script  in  soup(["script",  "style"]):   …      script.extract()     >>>  text  =  soup.get_text()   >>>  lines  =  (line.strip()  for  line  in   text.splitlines())   >>>  chunks  =  (phrase.strip()  for  line  in  lines  for   phrase  in  line.split("    "))   >>>  text  =  'n'.join(chunk  for  chunk  in  chunks  if   chunk)   https://detectlanguage.com >>>  import  detectlanguage   >>>  detectlanguage.configuration.api_key  =  "YOUR  API  KEY"   >>>  detectlanguage.detect(text)   {"data":{"detections": [{"language":"ar","isReliable":true,"confidence":8.32}, {"language":"tk","isReliable":false,"confidence":0.01}]}}   49 Language detection API client example#1
  • 50. >  curl  www.raddadi.com  >  raddadi.txt   >  Python   >>>  from  bs4  import  BeautifulSoup   >>>  soup  =  BeautifulSoup(open("raddadi.txt"))   >>>  for  script  in  soup(["script",  "style"]):   …      script.extract()     >>>  text  =  soup.get_text()   >>>  lines  =  (line.strip()  for  line  in   text.splitlines())   >>>  chunks  =  (phrase.strip()  for  line  in  lines  for   phrase  in  line.split("    "))   >>>  text  =  'n'.join(chunk  for  chunk  in  chunks  if   chunk)   https://detectlanguage.com >>>  import  detectlanguage   >>>  detectlanguage.configuration.api_key  =  "YOUR  API  KEY"   >>>  detectlanguage.detect(text)   {"data":{"detections": [{"language":"ar","isReliable":true,"confidence":8.32}, {"language":"tk","isReliable":false,"confidence":0.01}]}}   50 Language detection API client example#1
  • 51. https://detectlanguage.com >  curl  www.cnn.com  >  cnn.txt   >  Python   >>>  from  bs4  import  BeautifulSoup   >>>  soup  =  BeautifulSoup(open("cnn.txt"))   >>>  for  script  in  soup(["script",  "style"]):   …      script.extract()     >>>  text  =  soup.get_text()   >>>  lines  =  (line.strip()  for  line  in   text.splitlines())   >>>  chunks  =  (phrase.strip()  for  line  in  lines  for   phrase  in  line.split("    "))   >>>  text  =  'n'.join(chunk  for  chunk  in  chunks  if   chunk)   51 Language detection API client example#2
  • 52. >  curl  www.cnn.com  >  cnn.txt   >  Python   >>>  from  bs4  import  BeautifulSoup   >>>  soup  =  BeautifulSoup(open("cnn.txt"))   >>>  for  script  in  soup(["script",  "style"]):   …      script.extract()     >>>  text  =  soup.get_text()   >>>  lines  =  (line.strip()  for  line  in   text.splitlines())   >>>  chunks  =  (phrase.strip()  for  line  in  lines  for   phrase  in  line.split("    "))   >>>  text  =  'n'.join(chunk  for  chunk  in  chunks  if   chunk)   https://detectlanguage.com >>>  import  detectlanguage   >>>  detectlanguage.configuration.api_key  =  "YOUR  API  KEY"   >>>  detectlanguage.detect(text)   {"data":{"detections": [{"language":"en","isReliable":true,"confidence":6.14}]}}   52 Language detection API client example#2
  • 53. >  curl  www.cnn.com  >  cnn.txt   >  Python   >>>  from  bs4  import  BeautifulSoup   >>>  soup  =  BeautifulSoup(open("cnn.txt"))   >>>  for  script  in  soup(["script",  "style"]):   …      script.extract()     >>>  text  =  soup.get_text()   >>>  lines  =  (line.strip()  for  line  in   text.splitlines())   >>>  chunks  =  (phrase.strip()  for  line  in  lines  for   phrase  in  line.split("    "))   >>>  text  =  'n'.join(chunk  for  chunk  in  chunks  if   chunk)   https://detectlanguage.com >>>  import  detectlanguage   >>>  detectlanguage.configuration.api_key  =  "YOUR  API  KEY"   >>>  detectlanguage.detect(text)   {"data":{"detections": [{"language":"en","isReliable":true,"confidence":6.14}]}}   53 Language detection API client example#2
  • 54. Language test intersection testing for Arabic language 54 ~41%
  • 55. 55 ~38% ~41% Language test intersection testing for Arabic language
  • 56. 56 ~41% ~38% ~36% Language test intersection testing for Arabic language
  • 59. Language test intersection testing for Arabic language 59 ~41% ~38% ~36% ~39% Total Arabic = 7,976
  • 60. Crawling Arabic seed URIs Unique:663,443 60
  • 63. Total Arabic URIs Dataset = (7,976+292,670) = 300,646 63 Crawling Arabic seed URIs
  • 64. 17,536 Unique domains Rank Domain URIs GeoIP Category 1 Alarab.net 284 US News 2 Aljarida.com 248 US News 3 Arabic.cnn.com 245 US News 4 Alarabiya.net 231 US News 5 Ar.wikipedia.org 230 US Encyclopedia 6 Aljazeera.net 213 US News 7 Moheet.com 142 US News 8 Facebook.com 133 US Social 9 Al-sharq.com 132 US Middle East Portal 10 Lakii.com 123 US General Portal 17 Kuwaitclub.com.kw 71 Kuwait Sport 64
  • 65. Rank Domain URIs GeoIP Category 1 Alarab.net 284 US News 2 Aljarida.com 248 US News 3 Arabic.cnn.com 245 US News 4 Alarabiya.net 231 US News 5 Ar.wikipedia.org 230 US Encyclopedia 6 Aljazeera.net 213 US News 7 Moheet.com 142 US News 8 Facebook.com 133 US Social 9 Al-sharq.com 132 US Middle East Portal 10 Lakii.com 123 US General Portal 17 Kuwaitclub.com.kw 71 Kuwait Sport First Arabic GeoIP location is at rank 17 65 17,536 Unique domains
  • 66. Rank Domain URIs GeoIP Category 1 Alarab.net 284 US News 2 Aljarida.com 248 US News 3 Arabic.cnn.com 245 US News 4 Alarabiya.net 231 US News 5 Ar.wikipedia.org 230 US Encyclopedia 6 Aljazeera.net 213 US News 7 Moheet.com 142 US News 8 Facebook.com 133 US Social 9 Al-sharq.com 132 US Middle East Portal 10 Lakii.com 123 US General Portal 17 Kuwaitclub.com.kw 71 Kuwait Sport 6 out of 10 top unique domains are news websites 66 17,536 Unique domains
  • 67. Rank Domain URIs GeoIP Category 1 Alarab.net 284 US News 2 Aljarida.com 248 US News 3 Arabic.cnn.com 245 US News 4 Alarabiya.net 231 US News 5 Ar.wikipedia.org 230 US Encyclopedia 6 Aljazeera.net 213 US News 7 Moheet.com 142 US News 8 Facebook.com 133 US Social 9 Al-sharq.com 132 US Middle East Portal 10 Lakii.com 123 US General Portal 17 Kuwaitclub.com.kw 71 Kuwait Sport Popular western pages are in the top unique domains 67 17,536 Unique domains
  • 68. TLD Percent com 57.97% net 15.07% org 6.40% gov.sa 1.94% info 1.68% edu.sa 1.27% ws 1.16% org.sa 0.97% com.sa 0.80% gov.eg 0.80% Other 11.94% Almost 58% are .com 68
  • 69. TLD Percent com 57.97% net 15.07% org 6.40% gov.sa 1.94% info 1.68% edu.sa 1.27% ws 1.16% org.sa 0.97% com.sa 0.80% gov.eg 0.80% Other 11.94% Almost 58% are .com 69
  • 70. TLD Percent com 57.97% net 15.07% org 6.40% gov.sa 1.94% info 1.68% edu.sa 1.27% ws 1.16% org.sa 0.97% com.sa 0.80% gov.eg 0.80% Other 11.94% Small percentage of Arabic TLD 70
  • 71. TLD Country Percent .sa Saudi Arabia 5.33% .eg Egypt 2.00% .jo Jordan 2.00% .ae United Arab Emirates 1.06% .kw Kuwait 0.82% Small percentage of Arabic TLD 71
  • 72. TLD Country Percent .sa Saudi Arabia 5.33% .eg Egypt 2.00% .jo Jordan 2.00% .ae United Arab Emirates 1.06% .kw Kuwait 0.82% Small percentage of Arabic TLD 72
  • 73. Path Depth Example Percent 0 Example.com 17.30% 1 Example.com/a 40.42% 2 Example.com/a/b 24.45% 3 Example.com/a/b/c 10.81% 4+ Example.com/a/b/c/d 7.02% More than 57% are of depth 0 and 1 73
  • 74. Path Depth Example Percent 0 Example.com 17.30% 1 Example.com/a 40.42% 2 Example.com/a/b 24.45% 3 Example.com/a/b/c 10.81% 4+ Example.com/a/b/c/d 7.02% 74 More than 57% are of depth 0 and 1
  • 75. 53.77% of Arabic URIs are archived •  January-March 2015 •  ODU CS Memento Aggregator Median=16 75
  • 76. URI-Rs Memento Category gulfup.com 10,987 File Sharing masrawy.com 9,144 Egyptian portal arabic.cnn.com 9,022 News aljazeera.net 8,906 News maktoob.yahoo.com 8,478 Search Engine shorooknews.com 7,548 News arabnews.com 6,274 News bbc.co.uk/arabic 6,268 News ahram.org.eg 5,347 News google.com.sa 4,968 Search Engine Most of the top archived URI-Rs are news websites 76
  • 77. URI-Rs Memento Category gulfup.com 10,987 File Sharing masrawy.com 9,144 Egyptian portal arabic.cnn.com 9,022 News aljazeera.net 8,906 News maktoob.yahoo.com 8,478 Search Engine shorooknews.com 7,548 News arabnews.com 6,274 News bbc.co.uk/arabic 6,268 News ahram.org.eg 5,347 News google.com.sa 4,968 Search Engine 77 Most of the top archived URI-Rs are news websites
  • 78. Archiving has accelerated since 2011 78
  • 80. Two methods to determine the presence in each archive 1.  Percent of URI-Rs present in each archive e.g. http://aljazeera.net 2.  Percent of URI-Ms present in each archive e.g. http://wayback.archive-it.org/all/20070727215420/http:// www.aljazeera.net/ e.g. http://web.archive.org/web/20150618104846/http://aljazeera.net/ 80
  • 81. Internet Archive Archive.today Webcitation Total URI-R1 2 0 0 2 URI-R2 2 0 0 2 URI-R3 1 1 0 2 URI-R4 1 1 0 2 URI-R5 0 1 1 2 Total 6 3 1 10 Presence in each archive example 81
  • 82. 1- Percent of URI-Rs present in each archive Archive Total Percentage Internet Archive 4/5=0.8 80% Archive.today 3/5=0.6 60% Webcitation 1/5=0.2 20% Total 160% Internet Archive Archive.today Webcitation Total URI-R1 2 0 0 2 URI-R2 2 0 0 2 URI-R3 1 1 0 2 URI-R4 1 1 0 2 URI-R5 0 1 1 2 Total 6 3 1 10 82 Presence in each archive example
  • 83. Internet Archive Archive.today Webcitation Total URI-R1 2 0 0 2 URI-R2 2 0 0 2 URI-R3 1 1 0 2 URI-R4 1 1 0 2 URI-R5 0 1 1 2 Total 6 3 1 10 Archive Total Percentage Internet Archive 6/10=0.6 60% Archive.today 3/10=0.3 30% Webcitation 1/10=0.1 10% Total 100% 2- Percent of URI-Ms present in each archive Archive Total Percentage Internet Archive 4/5=0.8 80% Archive.today 3/5=0.6 60% Webcitation 1/5=0.2 20% Total 160% 83 1- Percent of URI-Rs present in each archive Presence in each archive example
  • 84. Archive Percent Internet Archive 97.04% Archive.today 6.58% Webcitation 6.00% Archive-It 5.49% British Library Archive 1.06% UK Parliament Web Archive 0.88% Icelandic Web Archive 0.87% UK National Archives 0.62% Proni 0.21% Stanford 0.11% Total 118.86% Archive Percent Internet Archive 72.87% Archive-It 21.26% Archive.today 2.14% Webcitation 2.08% Icelandic Web Archive 1.17% British Library Archive 0.29% UK Parliament Web Archive 0.10% Proni 0.05% UK National Archives 0.04% Stanford <0.01% Total 100% 84 1- Percent of URI-Rs present in each archive 2- Percent of URI-Ms present in each archive Presence in each archive
  • 85. Archive Percent Internet Archive 97.04% Archive.today 6.58% Webcitation 6.00% Archive-It 5.49% British Library Archive 1.06% UK Parliament Web Archive 0.88% Icelandic Web Archive 0.87% UK National Archives 0.62% Proni 0.21% Stanford 0.11% Total 118.86% Archive Percent Internet Archive 72.87% Archive-It 21.26% Archive.today 2.14% Webcitation 2.08% Icelandic Web Archive 1.17% British Library Archive 0.29% UK Parliament Web Archive 0.10% Proni 0.05% UK National Archives 0.04% Stanford <0.01% Total 100% 85 1- Percent of URI-Rs present in each archive 2- Percent of URI-Ms present in each archive Presence in each archive
  • 86. Archive Percent Internet Archive 97.04% Archive.today 6.58% Webcitation 6.00% Archive-It 5.49% British Library Archive 1.06% UK Parliament Web Archive 0.88% Icelandic Web Archive 0.87% UK National Archives 0.62% Proni 0.21% Stanford 0.11% Total 118.86% Archive Percent Internet Archive 72.87% Archive-It 21.26% Archive.today 2.14% Webcitation 2.08% Icelandic Web Archive 1.17% British Library Archive 0.29% UK Parliament Web Archive 0.10% Proni 0.05% UK National Archives 0.04% Stanford <0.01% Total 100% Presence in each archive 86 1- Percent of URI-Rs present in each archive 2- Percent of URI-Ms present in each archive
  • 87. Average archiving period (days) Average archiving period = (LM-FM) / number of mementos 16,732 URIs have only one memento Median=48 days 87
  • 88. Values less than 1 indicate that the URI is archived multiple times per day The larger the period, the more irregularly the URI was captured by the archives Median=48 days Average archiving period = (LM-FM) / number of mementos 16,732 URIs have only one memento 88 Average archiving period (days)
  • 89. Creation date for archived Arabic URIs Source: http://ws-dl.blogspot.com/2014/11/2014-11-14-carbon-dating-web-version-20.html We used CarbonDate for creation date estimate 89
  • 90. Source: http://ws-dl.blogspot.com/2014/11/2014-11-14-carbon-dating-web-version-20.html We used CarbonDate for creation date estimate 18 years 90 Creation date for archived Arabic URIs
  • 91. Source: http://ws-dl.blogspot.com/2014/11/2014-11-14-carbon-dating-web-version-20.html 2013 is the most frequent year We used CarbonDate for creation date estimate 18 years 91 Creation date for archived Arabic URIs
  • 92. Archive Percent United States 57.97% Arabic Countries 10.53% Germany 9.75% Netherlands 5.29% France 4.37% Canada 3.31% United Kingdom 3.07% Other 5.71% Top GeoIP locations 92
  • 93. Archive Percent United States 57.97% Arabic Countries 10.53% Germany 9.75% Netherlands 5.29% France 4.37% Canada 3.31% United Kingdom 3.07% Other 5.71% Top GeoIP locations 93
  • 94. Archive Percent United States 57.97% Arabic Countries 10.53% Germany 9.75% Netherlands 5.29% France 4.37% Canada 3.31% United Kingdom 3.07% Other 5.71% Archive Percent Saudi Arabia 4.75% Egypt 1.97% Jordan 1.42% Kuwait 0.71% United Arab Emirates 0.67% Top GeoIP locations 94
  • 95. Archive Percent United States 57.97% Arabic Countries 10.53% Germany 9.75% Netherlands 5.29% France 4.37% Canada 3.31% United Kingdom 3.07% Other 5.71% Archive Percent Saudi Arabia 4.75% Egypt 1.97% Jordan 1.42% Kuwait 0.71% United Arab Emirates 0.67% Top GeoIP locations 95
  • 96. Seed Data Set (Live, Indexed, Archived) Percent (1, 1, 1) 43.34% (1, 1, 0) 25.59% (1, 0, 1) 15.27% (1, 0, 0) 15.76% Status of Arabic seed URIs 96
  • 97. Seed Data Set (Live, Indexed, Archived) Percent (1, 1, 1) 43.34% (1, 1, 0) 25.59% (1, 0, 1) 15.27% (1, 0, 0) 15.76% (Good) discovered and saved 97 Status of Arabic seed URIs
  • 98. Seed Data Set (Live, Indexed, Archived) Percent (1, 1, 1) 43.34% (1, 1, 0) 25.59% (1, 0, 1) 15.27% (1, 0, 0) 15.76% (Good) discovered and saved (Bad) undiscovered and not saved 98 Status of Arabic seed URIs
  • 99. Seed Data Set (Live, Indexed, Archived) Percent (1, 1, 1) 43.34% (1, 1, 0) 25.59% (1, 0, 1) 15.27% (1, 0, 0) 15.76% 31% were not indexed by Google 99 Status of Arabic seed URIs
  • 100. 18% have creation dates over 1 year before the first memento was archived 19.48% of the URIs have an estimated creation date that is the same as first memento date Difference between creation date and first memento 100
  • 101. Seed Data Set Arabic Archived Indexed DMOZ 34.43% 95.52% 82.13% Raddadi 19.88% 45.44% 65.83% Star28 45.69% 41.54% 65.23% DMOZ URIs are more likely to be found and archived 101
  • 102. Seed Data Set Arabic Archived Indexed DMOZ 34.43% 95.52% 82.13% Raddadi 19.88% 45.44% 65.83% Star28 45.69% 41.54% 65.23% 102 DMOZ URIs are more likely to be found and archived
  • 103. Seed Data Set Arabic Archived Indexed DMOZ 34.43% 95.52% 82.13% Raddadi 19.88% 45.44% 65.83% Star28 45.69% 41.54% 65.23% 103 DMOZ URIs are more likely to be found and archived
  • 104. Full Data Set Total Archived Category Total Archived Arabic 33.18% 33.56% AR ccTLD 14.84% 28.09% AR GeoIP 10.53% 13.11% AR both 7.81% 59.50% Neither 66.82% 65.22% Neither 66.82% 65.22% Hosted in Western countries would be more likely to be archived 104
  • 105. Full Data Set Total Archived Category Total Archived Arabic 33.18% 33.56% AR ccTLD 14.84% 28.09% AR GeoIP 10.53% 13.11% AR both 7.81% 59.50% Neither 66.82% 65.22% Neither 66.82% 65.22% 105 Hosted in Western countries would be more likely to be archived
  • 106. Seed Data Set Total Indexed Category Total Indexed Arabic 15.01% 78.29% AR ccTLD 6.61% 76.09% AR GeoIP 2.37% 73.54% AR both 6.03% 85.24% Neither 84.99% 65.22% Neither 84.99% 67.09% URIs that had some Arabic location had a higher indexing rate 106
  • 107. Seed Data Set Total Indexed Category Total Indexed Arabic 15.01% 78.29% AR ccTLD 6.61% 76.09% AR GeoIP 2.37% 73.54% AR both 6.03% 85.24% Neither 84.99% 65.22% Neither 84.99% 67.09% URIs that had some Arabic location had a higher indexing rate 107
  • 108. The spread of memento was not affected by location or ccTLD Ø  Kolmogorov-Smirnov test Category Mean Ar GeoIP 0.5010 Ar ccTLD 0.5013 Both 0.5016 Neither 0.5005 Category D-Value P-Value Ar ccTLD vs. neither 0.017 <0.002 Ar GeoIP vs. neither 0.014 <0.002 108
  • 109. Just because a webpage is older it does not mean that it is archived more Because of low historical archiving rates 109
  • 110. We look in the last three years 110 Just because a webpage is older it does not mean that it is archived more
  • 111. We look in the last three years 111 Just because a webpage is older it does not mean that it is archived more
  • 112. In the last three years the older the resource is the more memento it has 112
  • 113. Full Data Set Seed Data Set Path Depth Total Archived Total Indexed 0 17.30% 86.29% 86.05% 74.60% 1 40.42% 53.49% 9.77% 38.91% 2 24.45% 45.57% 3.72% 17.85% 3+ 17.83% 34.24% 0.50% 57.50% Top level URIs are more likely to be archived and indexed 113
  • 114. Full Data Set Seed Data Set Path Depth Total Archived Total Indexed 0 17.30% 86.29% 86.05% 74.60% 1 40.42% 53.49% 9.77% 38.91% 2 24.45% 45.57% 3.72% 17.85% 3+ 17.83% 34.24% 0.50% 57.50% 114 Top level URIs are more likely to be archived and indexed
  • 115. Full Data Set Seed Data Set Path Depth Total Archived Total Indexed 0 17.30% 86.29% 86.05% 74.60% 1 40.42% 53.49% 9.77% 38.91% 2 24.45% 45.57% 3.72% 17.85% 3+ 17.83% 34.24% 0.50% 57.50% 115 Top level URIs are more likely to be archived and indexed
  • 116. •  Collected URIs from three Arabic directories (7,976): Ø  DMOZ Ø  Raddadi.com Ø  Star28.com •  Crawl seed dataset (1,299,671) •  Check if they are unique (663,443) •  Check if they are live (482,905) •  Check for Arabic Language (300,646) Summary of collection methods 116
  • 117. §  Our Arabic language dataset was not largely located in Arabic countries Ø  Only 14.84% had an Arabic ccTLD Ø  Only 10.53% had a GeoIP in an Arabic country Ø  Popular Western domains (e.g., cnn.com, wikipedia.org) appeared in the top 10 §  Arabic webpages are not particularly well archived or indexed Ø  46% were not archived Ø  31% were not indexed by Google §  An Arabic webpage is more likely to be... Ø  indexed if it is present in a directory Ø  archived if it is present in DMOZ Ø  archived if it has neither Arabic GeoIP nor Arabic ccTLD For right now, if you want your Arabic language webpage to be archived, host it outside of an Arabic country and get it listed in DMOZ Findings 117
  • 118. 118
  • 120. GeoIP Location •  We obtained the IP addresses of the hostnames using nslookup, (which uses DNS to convert the hostname to its IP address) •  We used the MaxMind GeoLite29 database to determine location from the IP address. (Which tests at 99.8% accuracy at the country level) h,p://dev.maxmind.com/geoip/geoip2/geolite2/   h,p://dev.maxmind.com/faq/how-­‐‑accurate-­‐‑are-­‐‑the-­‐‑  geoip-­‐‑databases/   120