Media news is more and more used in academic research as data for social sciences’ studies. News allows detecting and monitoring events from social movement to natural catastrophes. In the last decades, several scholars have worked on the definition and identification of media events (Dayan and Katz, 1992; McCombs and Shaw, 1972). Among them, some investigated cross-national media coverage of different types of events (Galtung and Ruge, 1965; Koopmaas and Vliegenthart, 2011) and focused on mechanisms that may explain diffusion of media attention. One of the main issues related to this type of studies is that media data can be retrieved only in commercial databases such as DowJones Factiva. The use of these databases is not only expensive, but it also raises several technical (i.e. it is not possible to extract more than 100 items simultaneously) and methodological problems (i.e. the lack of transparency concerning keywords and the inhomogeneous coverage of sources).
Yet, recently, with the emergence of the Web 2.0, media (especially newspapers) publish news directly on the Web, mainly free of charge, and often they provide free push services such as RSS feeds to get real-time information to the reader. Our hypothesis is that newspapers’ RSS flows can be an alternative source of information for media studies. RSS are supposed to have three great advantages: they are free; they may be archived and tagged without limits; they are generally provided as the news is ready and they can therefore be suitable for a real-time analysis. However, RSS are still little studied. If there are several researches on technical aspects, their informational value is rarely analyzed (Marty et al., 2012). Several exploratory studies on our corpus already showed that, despite the short format, RSS items allow both qualitative and quantitative analysis (Beauguitte and Severo, 2014).
This paper will present the results of the ANR Corpus Geomedia project (2013-2016, http://geomedia.hypotheses.org). In this project, we build a database storing RSS flows associated with articles published in one hundred newspapers in different parts of the World in order to extract two types of information: flows among countries (which are spaces of interest according to the media localization?) and international events (can we distinguish between inter-national, regional and global events in our corpus?). Geographical structures possibly revealed by international news flows are our main domain of investigation, as we plan to compare them with other global interaction patterns (trade, migration, finance, etc.). At the end of the project, the archive will be freely available to researchers. This paper will present the main features of this Web archive and possible uses of RSS data for studying international events with a multi-dimensional viewpoint.
Archiving news on the Web through RSS flows. A new tool for studying international events (with L. Beauguitte and H. Pecout).
1. Archiving news on the Web
through RSS flows
A new tool for studying international events
M. Severo (Université Lille 3/GIS-CIST)
L. Beauguitte (CNRS/UMR IDEES)
H. Pecout (CNRS/GIS-CIST)
2. Studies on media news
• News values
(Galtung and Ruge, 1965; Koopmaas and Vliegenthart, 2011)
• Definition of media events (Dayan and Katz, 1992)
• Agenda setting (McCombs and Shaw, 1972)
• Event detection from social movements to
natural events
(Herkenrath and Knoll, 2011; Koopmaas and Vliegenthart, 2011)
3. Which data?
• Commercial databases : DowJones Factiva,
LexisNexis, Europresse…
• Issues :
– Costs
– The extraction of data is very limited
– These databases are usually not very accurate and
transparent
4. RSS feeds
Our hypothesis is that newspapers’ RSS feeds can
be an alternative source of information for media
studies
5. Advantages of RSS feeds
• freely accessible
• homogenous structure
• In real-time
6. Limits of RSS
• Copyright status is not clear
• Are RSS dying ?
7. Geomedia project
• ANR CORPUS (2013-2016)
• Partners: GIS-CIST, Density Design Lab
(Politecnico de Milan), INED, Laboratoire
d’informatique de Grenoble, Laboratoire GERiiCO
(Lille 3), UMR Géographie-Cités, UMR IDEES,
UMS Riate
• http://geomedia.hypotheses.org
• A database storing RSS feeds associated with
articles published in newspapers in different parts
of the World
8. 300 RSS feeds
8 langues
8
- English= 165 rss (55%)
- Spanish = 52 rss (17%)
- French= 41 rss (14%)
- Portuguese = 17 rss ( 6%)
- German = 14 rss (5%)
- Italian = 7 rss (2%)
- Polish = 2 flux
- Catalan = 1 flux
• Distribution by langue and by type :
RSS feeds by langue
9. Unique General Une Breaking news International National
One RSS feed is offered
by the newspaper
"Home" "Top Head Lines" "Ultimas Noticias" "International" "National"
"Home page" "Portada" "Dernières News" "Internacional" Nom du pays
"All News" "Top stories" "Latest News" "Monde"
"Todas las noticias" "Une" "Hot News" "Mundo"
"Titulares" "News"
"Actu" "Fil info"
"Actualidad"
Categorisation according to the name used on the website:
10. 300 RSS feeds
6 « types »
- International = 114 flux (38%)
- (A la) Une = 56 flux (19%)
- Général = 45 flux (15%)
- Breaking News = 40 flux (13%)
- Unique = 30 (10%)
- National =14 flux (4,7%)
RSS feeds by type
11. 300 RSS feeds
6 continents
59 countries
Feeds by region
Europe = 82
Amérique Latine = 52
Amérique du Nord = 31
Afrique Subsaharienne = 29
Asie du Sud-Est = 28
Océanie = 17
Moyen Orient = 16
Asie du Sud =15
Afrique du Nord = 10
Asie du Nord-Est = 7
Caraïbes = 6
Asie centrale = 6
Feeds by country
12. Total stored items : 6,185,200 (mai 2015)
Average of items per hour: 980
Daily Average of items: 23500
Stored items by langue
13. Objects of analysis
• Flows among countries:
– hierarchy of places
– Co-occurrences among places
• International events:
– can we distinguish between inter-national,
regional and global events?
– time persistence et geographical diffusion of
media events
15. Issues related to tagging
•Countries: with a low margin of errors,
identifying country’s names was possible. Our
objective is to get a margin of error below of 5%.
•Events: lexical spectrum of a event needs to be
as reduced as possible to be identified in our
corpus (i.e. Ebola, Snowden, Charlie-Hebdo)
16. Research directions (1):
time persistence et diffusion of given events
Wukan protest – Severo et al., 2012
https://jitsociology.wordpress.com/2012/12/
02/the-wukans-protests-just-in-time-
identification-of-international-media-events-
revised/
17. Research directions (2) :
hierarchy of places
Top 38 European Cities based on a sample of
international RSS flows
18. Research directions (3):
co-occurrences of places
2 months of RSS items regarding international news (Jan. - Feb. 2014)
2 French newspapers - 2 Australian newspapers
2 reference newspapers & 2 popular newspapers
The Australian : 1160 items
The Daily Telegraph : 1103 items
Le Figaro : 608 items
Le Parisien : 643 items
4 newspapers : 6 to 7 states =
50% of all states occurrences
19. Beauguitte et al, 2014
http://fr.slideshare.net/Laur
entBeauguitte/do-
international-news-reflect-
world-hierarchy-a-network-
approach
20.
21. Further research
• Technical issues:
(1) limited coverage of the feeds (not the entire
news content)
(2) heterogeneity of the feeds
(3) tagging
• Our database will be made available to
researchers that will be able to test their own
hypotheses.
22. Policy of use of the data
• http://www.gis-cist.fr/en/
Issue related to data storage
Technological issues
Data issues (jat lag)
Changes of sources (broken links etc)
Selection of newspapers
Selection of Flows (international, homepage, breaking news…)
The first direction tries to model news flows from a thematic and a spatial perspective:
when and how a given event spread across newspapers on an international scale?
Is it possible to identify patterns and trends of diffusion?
Which are the barriers (cultural, linguistic, political) to these diffusion processes?
4 newspapers : 6 to 7 states = 50% of all states occurrences
Examining states co-occurrences was, to our full knowledge, barely examined in previous studies. However, our first tests (Beauguitte and Severo, 2014) seem to indicate that several teachings regarding world structure could be brought by such investigation. For instance, if main powers are often quoted alone in international news (an American event often becomes a world event), least developed countries are never quoted alone, except in case of major crisis.
International news is rstly a product targeting a national audience
Hierarchy diers according to nationality rather than type of newspapers
Really dierent type of information (much more faits divers in the Daily
Telegraph) but same places & same concentration on few states