In this talk, I survey a series of research works in which I have collaborated during the last 5 years about data science applications on digital platforms for citizen participation. In particular, I describe solutions that we have developed for the analysis of discussion and controversy in web forums, the classification of the intention of messages in microblogging systems, the search and recommendation of citizen proposals, the extraction of argumentative information and access to contents of electronic participatory budgets through a conversational agent. For some of them, I will give a brief introduction to related research areas, such as web crawling and scraping, argument mining, and recommender systems.
Governance and Nation-Building in Nigeria: Some Reflections on Options for Po...
Data science in practice: Case studies in e-participation
1. Universidad del Bío-Bío, Chile
Facultad de Ciencias Empresariales
Iván Cantador, ivan.cantador@uam.es
January 13, 2023
Case studies in e-participation
Data science in practice
2. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
About me
• Iván Cantador
• Associate Professor at the Computer Science and Engineering Department
of Universidad Autónoma de Madrid, Spain
http://www.eps.uam.es/~cantador
• Research interests
- Recommender systems
- Information retrieval
- Machine learning
- Natural language processing
- Semantic technologies
- E-government
1
3. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Contents
1. E-participation
2. Decide Madrid
3. Data acquisition and processing
4. Data mining applications
5. Information retrieval applications
6. Recommendation applications
7. Conclusions
2
4. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Contents
1. E-participation
• Open government
• Citizen participation
• Digital platforms for citizen participation
2. Decide Madrid
3. Data acquisition and processing
4. Data mining applications
5. Information retrieval applications
6. Recommendation applications
7. Conclusions
3
5. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Open government
• Open Government (Oszlak, 2013) – A public management paradigm that arises in
a context characterized by:
• The disaffection on the part of the citizenry originated by the numerous crises that question
the capacity of the Public Administration to deal with them
• The rise of the ubiquitous use of technologies, which have transformed communications
and interactions between individuals, and have promoted the emergence of open,
participatory and collaborative practices
• The opening of the government, among other institutions, to the citizens, aiming to end with
the existing disaffection
4
1. E-participation
6. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Open government
• Goals of the open government model (Ramírez-Alujas, 2014):
• Increasing the transparency (accountability) and access to
government information through Open Data
- These open data should allow citizens to have access to information
and should promote innovation and economic development in the public sector
• Facilitating the collaboration between distinct actors, particularly between public
administrations, civil society, and the private sector, in order to codesign and generate
public value
• Promoting citizen participation in the design and implementation of public policies,
i.e., in decision and policy making
5
1. E-participation
7. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Open government
• Background – Memorandum on Transparency and Open Government, USA.
Barack Obama’s Administration, 2009
6
1. E-participation
• Providing information about the government activity, its performance, etc.
This encourages and promotes accountability and social control.
Transparency
• Promoting the right of citizens to actively participate in policy making.
Participation
• Involving citizens and other actors in scenarios of cooperation and
coordinated work.
Collaboration
• Using technology as an instrument to promote openness in government,
facing the challenges of the new millennium.
Technology
8. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Open government
• The Open Government Partnership emerged in 2011 in order to promote open
government in different administrations
• It seeks for the different governments to reach specific commitments on transparency and
power of citizens, fight against corruption, and take advantage of new technologies to
strengthen governance
- Founded by 8 countries: Brazil, Mexico, Indonesia, Philippines, Norway, USA, South Africa, UK
- Composed of 70 member states and numerous government organizations
• Principal commitments:
- Improvement of public services
- Increased public integrity
- Effective management of public resources
- Safer communities
- Increased corporate responsibility
7
1. E-participation
https://www.opengovpartnership.org
9. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Contents
1. E-participation
• Open government
• Citizen participation
• Digital platforms for citizen participation
2. Decide Madrid
3. Data acquisition and processing
4. Data mining applications
5. Information retrieval applications
6. Recommendation applications
7. Conclusions
8
10. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Citizen participation
• Citizen participation is a process that provides private
individuals an opportunity to influence on public decisions,
and has been a component of democratic decision making
• A community-based process in which citizens may organize themselves
and their goals, and may work together through non-governmental organizations
to influence on public policies and plans
• Benefits
• Governance: reducing conflicts, strengthening democratic legitimacy, encouraging active
citizenship → government transparency and accountability, and trust between citizens and
political institutions
• Increasing the quality of public decisions and services
• Learning and training to build stronger societies
• Promoting social cohesion, mutual understanding and social justice
9
1. E-participation
11. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Citizen participation
• Ladder of citizen participation (Arnstein, 1969)
• 8 levels in 3 groups
- No participation
- Symbolic participation
- ‘Real’ participation
• Simplified by the OECD model into 3 levels
10
1. E-participation
12. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Citizen participation
• Barriers of citizen participation
• Incompatibilities
- Politic, legal, cultural, socioeconomic, organizational
• Intrinsic problems
- Complex, expensive, under representative, non-plural, little informed, conflictive,
non-deliberative, non-scalable, etc.
• Extrinsic problems
- Arbitrary and manipulable
- Inefficient and non-self-sustaining
- Irrelevant issues and lack of effect
- Citizen saturation
- Monopoly of participation, etc.
11
1. E-participation
13. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Citizen participation
• Tools for citizen participation
• Non-ICT-based
- Questionnaires, and surveys
- Seminars, talks, and meetings
- Discussion and work groups
- Cultural, artistic and leisure events
• ICT-based
- E-mail, RSS, SMS, multimedia sharing
- Social media, web portals and e-platforms
- Mobile apps
- Open data, IoT (crowdsensing)
- Augmented/virtual reality
12
1. E-participation
14. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Citizen participation
• Participedia.net
• Anyone can join the Participedia community
and help crowdsource, catalogue, and
compare participatory political processes
around the world
• Cases (2259)
• Methods (360)
• Organizations (841)
• Teaching resources
13
1. E-participation
15. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Contents
1. E-participation
• Open government
• Citizen participation
• Digital platforms for citizen participation
2. Decide Madrid
3. Data acquisition and processing
4. Data mining applications
5. Information retrieval applications
6. Recommendation applications
7. Conclusions
14
16. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Digital platforms for citizen participation
• With the advent of social media and mobile computing, nowadays there is a plethora
of digital citizen participation channels
• general-purpose online social networks
• ad hoc e-consultation, e-voting and e-participation platforms
• The huge, ever-increasing citizen generated content leads to an information
overload problem for both citizens and government stakeholders in decision and
policy making tasks
• Users may feel overwhelmed by the large amount of data, whose exploration and
understanding could result challenging and frustrating
• Citizens may feel thwarted if their proposals do not reach sufficient visibility and impact
15
1. E-participation
17. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Digital platforms for citizen participation
• E-participation refers to ICT-supported citizen
participation in governance processes
• administration
• service delivery
• decision making
• policy making
• It aims to upgrade the relations among stakeholders
in civil society –e.g., local government, citizens, firms–,
putting the citizens in the center of the processes
• It has originated novel consultation and deliberation
initiatives
16
1. E-participation
18. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Digital platforms for citizen participation
• E-participation tools by type of engagement and role of ICT/level of participation
17
1. E-participation
Aichholzer, G., & Allhutter, D. (2011).
Online forms of political participation and their
impact on democracy. Institute of Technology
Assessment (ITA).
19. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Digital platforms for citizen participation
• Most current e-participation
platforms are based on web forums
• Citizens make proposals and provide
comments and opinions, forming
large conversation threads
18
1. E-participation
Example of web forum-based e-participation platform
Citizen proposal Discussions
20. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Digital platforms for citizen participation
• Conventional web forums promote social interaction
• Pros
- Easy and fast content generation (through free text posts)
- Smooth, large-scale interaction (via comment threads)
• Cons
- No or very limited functionalities for content organization,
filtering and analysis
- Dispersed and redundant content, since it is structured
by time
- Challenging processing of discussions
19
1. E-participation
21. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Contents
1. E-participation
2. Decide Madrid
• Participatory budgeting
• E-participatory budgeting
• The ‘Decide Madrid’ platform
3. Data acquisition and processing
4. Data mining applications
5. Information retrieval applications
6. Recommendation applications
7. Conclusions
20
22. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Participatory budgeting
• Participatory budgeting (PB) is a democratic
deliberation and decision-making process in
which citizens decide how to spend certain
municipal or public budgets
• informing about issues and problems on a wide range
of subject areas in a city, e.g., housing, public safety,
education, health, transportation and environment
• proposing, debating and supporting/voting for
spending ideas and projects aimed to address such
problems
21
2. Decide Madrid
23. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Participatory budgeting
• Pros
• Increased government transparency and trust
• Citizens’ empowerment and change of democratic attitude
• Better allocation of resources (in general)
• Increased voter turnout
• Cons
• Lack of diverse representation
• Time consuming
• Resource intensive
• Lack of interest or political will
22
2. Decide Madrid
24. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Participatory budgeting
• Since its original invention in Porto Alegre,
Brazil, in 1988, PB has gained much
popularity
• As for 2022, PB had spread to over 4,500 cities
around the world (source: Participatory
Budgeting World Atlas, https://www.pbatlas.net)
• Tools of citizen participation
• Meetings
• Committees
• Consultations
• …
• Electronic participatory platforms
23
2. Decide Madrid
http://www.participatorybudgeting.org
25. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Participatory budgeting
• PB in Europe (https://www.euractiv.com/section/participatory-democracy/infographic/participatory-
budgeting-europes-bet-to-increase-trust-in-government)
• While residents’ demands in European cities are often similar, the percentage of budget can
vary widely from one place to another: Paris dedicates 25% of the investment budget to PB,
while smaller cities usually invest 2 to 5% of their resources.
24
2. Decide Madrid
26. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Participatory budgeting
• PB in Chile (https://www.pbatlas.net/chile.html)
• 37 local government initiatives + 1 regional government initiative
• Although PB initiatives in the country are born in 2002 due to political will of the mayors at
local level, since 2014 the region of Los Ríos started its own process:
- high valuation of citizen participation that exists in the region
- historical roots of the creation of the region that happened in 2007, preceded by a social
movement of more than 30 years that demanded to be a region
• The presentation of proposals is made mainly through social leaders
- the selection of the projects is carried out in neighborhood or territorial assemblies,
which mostly are formed by representatives of social organizations and institutions
• Regarding voting and prioritizing proposals, predominates the model the people’s
direct and universal vote
25
2. Decide Madrid
27. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Contents
1. E-participation
2. Decide Madrid
• Participatory budgeting
• E-participatory budgeting
• The ‘Decide Madrid’ platform
3. Data acquisition and processing
4. Data mining applications
5. Information retrieval applications
6. Recommendation applications
7. Conclusions
26
28. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
E-participatory budgeting
• In addition to ad hoc PB digital applications
and platforms, there are several software
frameworks to build online PB platforms
• CONSUL, http://consulproject.org: tens of cities
in Spain, Italy, France and South America
• Stanford Participatory Budgeting,
http://pbstanford.org: major cities in the USA,
e.g., New York, Chicago, Seattle, Oakland and
Boston
• EU Open Budgets, http://openbudgets.eu/tools
27
2. Decide Madrid
title
location category
author description
supports comments
Proposal
29. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
E-participatory budgeting
• Motivations for data science applications
• Limitations of current ePB platforms of large
cities
- very limited search and filtering functionalities
- unable to facilitate the analysis of hundreds,
even thousands, of citizen proposals and
associated comments and discussions
• Creating a budgeting proposal, a citizen should
be aware of similar or related ideas or projects, so
she could better define the proposal or find the
opportunity to collaborate with others
28
2. Decide Madrid
30. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Contents
1. E-participation
2. Decide Madrid
• Participatory budgeting
• E-participatory budgeting
• The ‘Decide Madrid’ platform
3. Data acquisition and processing
4. Data mining applications
5. Information retrieval applications
6. Recommendation applications
7. Conclusions
29
31. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
The ‘Decide Madrid’ platform
• A web system designed to allow Madrid
residents to make, discuss and vote
proposals for the city
• Used since September 2015
• With a 100M € budget in 2017
• Consisting of a 3-phase process
30
2. Decide Madrid
32. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
The ‘Decide Madrid’ platform
• ~6,000 citizen proposals per year
• Keyword-based search
• No use of (structured) metadata
• No data analysis
• No personalization
• No recommendation
31
2. Decide Madrid
33. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
The ‘Decide Madrid’ platform
• Available data for a proposal
• Title
• Author
• Date
• Summary
• Description
• Freely-chosen tags
• Number of user votes
• User comment threads
32
2. Decide Madrid
34. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
The ‘Decide Madrid’ platform
Why considering Decide Madrid as a representative case study?
• Participatory budgeting is one the citizen participation methods most used
worldwide:
• Represented in more than 400 cases from a total of 2,000 cases analyzed in Participedia
(https://participedia.net)
• Used in more than 3,000 cities and municipalities worldwide according to the Participatory
Budgeting Project (https://www.participatorybudgeting.org/white-paper)
• Decide Madrid is implemented upon CONSUL (https://consulproject.org), an open-source
framework to develop citizen participation platforms:
• Used in more than de 135 institutions of 35 countries
• With a structure similar to other popular frameworks, such as Stanford Participatory
Budgeting (https://pbstanford.org) and EU Open Budgets (http://openbudgets.eu/tools)
33
2. Decide Madrid
35. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Contents
34
1. E-participation
2. Decide Madrid
3. Data acquisition and processing
• The data mining pipeline
• Data crawling
• Data scraping
• Data processing
4. Data mining applications
5. Information retrieval applications
6. Recommendation applications
7. Conclusions
36. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
The data mining pipeline
35
3. Data acquisition and processing
Data
Information
Knowledge
Understanding, experience, insights,
intuitions to use information
Pure and simple facts with no particular
organization
Understanding, experience, insights,
intuitions to use information
Processed, filtered, calculated, structured,
categorized, contextualized data
37. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
The data mining pipeline
36
3. Data acquisition and processing
Unstructured data
Semi-structured data
Structured data
Simple and flexible structure, no strict format
Limited vocabulary, schema mixed with data values
E.g.: taxonomies (categories), folksonomies (tags)
Rigid structure, strict format
Well defined vocabularies and representation
E.g.: databases, ontologies
No structure
Non-restricted vocabulary, no predefined schema
E.g.: free text
38. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
The data mining pipeline
• Open Government Data (OGD) promote transparency, accountability and
public value creation
• By making datasets publicly available, institutions become more transparent and
accountable to citizens
• By facilitating the use, reuse and free distribution of datasets, governments foster business
creation and innovative, citizen-centered digital applications and services
• OGD portals enable the general public to access the open data collections
• allowing the search of data files, but not the search of information within the files
37
3. Data acquisition and processing
39. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
The data mining pipeline
• Open data portals are web sites to access sets
of OGD collections
• Search engine
- Retrieving collections via keyword-based queries
• Collection metadata
- Title, description, date, size, etc.
• Data files
- Formats: CSV, XLS, XML, RDF, etc.
- To be downloaded and opened with specific
applications, e.g., Microsoft Excel
• Documentation
- Inner structure of the data files
38
3. Data acquisition and processing
Example: open data portal of Madrid City Council
40. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
The data mining pipeline
• Open data are commonly provided as tables:
• Rows = data records (instances, individuals)
• Columns = data attributes (features, fields)
39
3. Data acquisition and processing
Example: records of traffic accidents occurred in Madrid in 2020
41. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
The data mining pipeline
Methodology
• Text processing on titles, tags,
descriptions and comments of citizen
proposals
• Semantic annotation of proposals:
topics and districts
• Computing discussion and
controversy metrics on the
comments of each proposal
• Exploiting open data as statistical
indicators about districts: economic,
sociocultural, ideology, employment,
education, health, housing, etc.
40
3. Data acquisition and processing
42. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
The data mining pipeline
• 2 complex processes for:
• crawling and scrapping the ‘Decide Madrid’ web pages
• mapping tags to places and topics
• 22 districts & hundreds of places
• 30 topics
• urbanism, transport, environment,
health care, education, social rights,
education, culture, economy, job,
politics, security, housing, family,
old age, religion, animals, etc.
Assumption: a comment = a (positive, unary,) rating
41
3. Data acquisition and processing
43. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
The data mining pipeline
Dataset
• Participatory budgeting of
4 editions: 2015-2018
• Around 29,000 proposals
• More than 86,000
comments
• 30 categories and 325
topics
• 21 districts + “city scope”
42
3. Data acquisition and processing
44. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Contents
43
1. E-participation
2. Decide Madrid
3. Data acquisition and processing
• The data mining pipeline
• Data crawling
• Data scraping
• Data processing
4. Data mining applications
5. Information retrieval applications
6. Recommendation applications
7. Conclusions
45. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Data crawling
• A (web) crawler is a computer program that browses the Web in a methodological
(with an orderly fashion), automated manner
• Applications
• Web search/indexing
• Vertical (specialized) search engines, e.g., news, shopping, recipes, reviews, papers
• Monitoring web sites and pages of interest
• Business intelligence: collecting information about company competitors and potential
collaborators
• Malicious applications: collecting personal information
44
3. Data acquisition and processing
46. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Data crawling
• A crawler within a web search engine
45
3. Data acquisition and processing
47. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Data crawling
• A crawler within a web application
46
3. Data acquisition and processing
48. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Data crawling
• Generic web crawling process
• Seeds
- A list of starting URLs
• Visiting order
- Frontier = unvisited URLs
- Deciding which URLs should be discarded
to not fill up the frontier (lower priority)
• Stop criterion
- Empty frontier or maximum number
of pages crawled
47
3. Data acquisition and processing
49. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Data crawling
• Best First
• The simplest topical crawler
• The frontier is a priority queue based on text (or keyword) similarity between topic and
parent page
48
3. Data acquisition and processing
bestFirst(topic, seed_urls) {
foreach link(seed_urls) {
queue(frontier, link);
}
while (frontier.size() > 0 and visited < MAX_PAGES) {
link = dequeueMax(frontier); // dequeue MAX similarity
page := fetch(link);
score := sim(topic, page);
foreach (extract_links(doc)) { // outlinks
enqueue(frontier, outlink, score);
}
}
}
50. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Data crawling
49
3. Data acquisition and processing
<div class="proposal-content">
<h3><a href="/proposals/34239-luces-led-barrio-concepcion-y-san-pascual">
Luces LED Barrio Concepción y San Pascual </a></h3>
<p class="proposal-info">
<span class="icon-comments"></span>
<a href="/proposals/34239-luces-led-barrio-concepcion-y-san-pascual#comments">
Sin comentarios</a>
<span class="bullet"> • </span>01/12/2022
<div class="proposal-content">
<h3><a href="/proposals/34239-luces-led-barrio-concepcion-y-san-pascual">
Luces LED Barrio Concepción y San Pascual </a></h3>
<p class="proposal-info">
<span class="icon-comments"></span>
<a href="/proposals/34239-luces-led-barrio-concepcion-y-san-pascual#comments">
Sin comentarios</a>
<span class="bullet"> • </span>01/12/2022
51. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Data crawling
50
3. Data acquisition and processing
public static void downloadProposalsURLs(String url, String file, int firstPage, int lastPage, boolean append) throws Exception {
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file, append), "UTF-8"));
for (int p = firstPage; p <= lastPage; p++) {
// Pick a random user agent
int indx = RAND.nextInt(USER_AGENTS.length);
String userAgent = USER_AGENTS[indx];
// Open the connection and read the web document
URI uri = new URI(url + p);
Connection connection = Jsoup.connect(uri.toASCIIString());
Document doc = connection.userAgent(userAgent).get();
// Read the proposals URLs from the web document -> identified by <a> links within <div class="proposal-content"> element
Elements linkList = doc.getElementsByClass("proposal-content");
Iterator<Element> it = linkList.iterator();
while (it.hasNext()) {
Element link = it.next();
String linkURL = link.getElementsByTag("a").get(0).attr("href");
writer.write(linkURL + "n");
}
}
writer.close();
}
52. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Contents
51
1. E-participation
2. Decide Madrid
3. Data acquisition and processing
• The data mining pipeline
• Data crawling
• Data scraping
• Data processing
4. Data mining applications
5. Information retrieval applications
6. Recommendation applications
7. Conclusions
53. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Data scraping
52
3. Data acquisition and processing
<img alt="Armando Cuesta" class="initialjs-avatar author-photo"
data-char-count="1" data-font-size="19" data-height="32"
data-name="Armando Cuesta"
data-radius="4" data-seed="460897"
data-text-color="#ffffff" data-width="32" src="data:image/…">
<img alt="Armando Cuesta" class="initialjs-avatar author-photo"
data-char-count="1" data-font-size="19" data-height="32"
data-name="Armando Cuesta"
data-radius="4" data-seed="460897"
data-text-color="#ffffff" data-width="32" src="data:image/…">
54. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Data scraping
53
3. Data acquisition and processing
public static Proposal getProposal(String proposalFile, boolean isClosed) throws Exception {
Proposal proposal = new Proposal();
Document doc = Jsoup.parse(new File(proposalFile), "UTF-8");
// URL
Elements elems = doc.select("meta[property=og:url]");
String url = elems.attr("content").trim();
proposal.setUrl(url);
// Id
String id = url.substring(url.lastIndexOf("/") + 1);
id = id.substring(0, id.indexOf("-"));
proposal.setId(Integer.valueOf(id));
// Title
elems = doc.select("meta[property=og:title]");
String title = elems.attr("content").trim();
proposal.setTitle(title);
// Summary
String summary = doc.select("div.proposal-show").get(0).getElementsByTag("blockquote").text().trim();
if (summary.equals("Resumen de la propuesta")) {
summary = "";
}
proposal.setSummary(summary);
...
55. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Contents
1. E-participation
2. Decide Madrid
3. Data acquisition and processing
• The data mining pipeline
• Data crawling
• Data scraping
• Data processing
4. Data mining applications
5. Information retrieval applications
6. Recommendation applications
7. Conclusions
54
56. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Data processing
• Database tables
• Created by the crawler and scraper
- proposals (code, title, author, date, summary, description, supports,…)
- users
- proposal_tags
- proposal_comments (id, author, text, parent_comment, pos_votes, neg_votes, …)
• Created from proposal_tags
- proposal_categories text processing + clustering
- proposal_topics text processing + clustering
- proposal_districts text processing
- proposal_locations text processing + mapping to a street directory + geolocation
55
3. Data acquisition and processing
57. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Data processing
• Graph building
• Its nodes are whole set of proposal tags
• Each of its (weighted) edges links
“related” a pair of tags, according to:
- Syntactic similarity
- Semantic similarity
- Cooccurrences within proposals
• Graph clustering method proposed by
Newman and Girvan (2004)
• It has a criterion to automatically set an
optimal number of clusters
• Each cluster represents a topic, which is
composed by a set of tags
56
3. Data acquisition and processing
58. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Data processing
• 2-level taxonomy: 30 categories + 325 topics
57
3. Data acquisition and processing
Accesibilidad accesibilidad, accesibilidad metro, aparcamiento para discapacitados....
Animales adiestramiento canino, águilas, animales, animales de compañía, antitaurino....
Asociaciones asociaciones, asociaciones de vecinos, asociaciones juveniles....
Ayuntamiento y administración pública administracion, alcaldesa, atencion al ciudadano....
Civismo acoger, bioetica, bullying, cinismo, civico, civismo, colaboracion social....
Cultura arqueologia, arte, arte callejero, arte urbano, artesania, artistas, ....
Delincuencia anti corrupcion, atraco, carteristas, corrupcion, delincuencia, delitos....
Deportes actividad fisica, anillo ciclista, area de deportes, atletas, atleti, atletismo....
Derechos sociales abuso, acoso, albergue, altermundialismo, apoyo emocional, apoyo social....
Economía actividad económica, ahorro, bancos, bbva, comerciantes, comercio....
Educación acoso escolar, alumnos, bachillerato, bibiotecas, brecha cultural....
Empleo autoempleo, autónomos, comerciales, conciliacion laboral, contratacion municipal....
Equidad e integración chabolas, cie, derechos lgtbi, inmigración, desigualdad de genero...
Familia e infancia actividades infantiles, ayuda embarazo, bebes, carricoche....
Jóvenes acoso escolar, adolescencia, adolescentes, asociaciones juveniles....
Justicia constitucion, cumplimiento de las leyes, dictadura, fiscal, franquismo....
Medio ambiente acusticas, agroecologia, agua, aire, aire acondicionado, ajardinamiento....
Movilidad abono transportes, adif, agentes de movilidad, aparamiento regulado...
Ocio y entretenimiento baile, bares, celebraciones, centro comercial, cines, conciertos....
Participación ciudadana accion social, avisos madrid, decide madrid, decidemadrid...
Política, 15m, ahora madrid, ayuntamiento, ayuntamiento de madrid, democracia....
Religión españa laica, estado aconfesional, iglesia, islam, laicismo, religion...
Salud y sanidad acoholismo, acustica, acusticas, aire libre, aire puro, alcohol....
Seguridad y emergencias accidentes, app emergencias, aviso, avisos madrid, bomberos...
Sostenibilidad agroecologia, ahorro de energia, autogestion, ciudad amable, ....
Tercera edad abuelos, ancianos, centros de dia, desempleo mayores, jubilacion....
Transparencia anti corrupcion, datos abiertos, derecho a la informacion....
Turismo oferta turistica, puntos de informacion turistica, puntos de interes...
Urbanismo aceras, adoquinado, ajardinamiento, alumbrado, apariencia edificios....
Vivienda alquileres, alquiler vacacional, alquiler vivienda, derecho a un vivienda....
59. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Contents
1. E-participation
2. Decide Madrid
3. Data acquisition and processing
4. Data mining applications
• Discussion and controversy analysis
• Clustering and visualization
• Intent-based classification
5. Information retrieval applications
6. Recommendation applications
7. Conclusions
58
60. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Discussion and controversy analysis
• In the literature, there is a predominance of online tools implemented ad hoc to
facilitate citizen participation at scale and to reduce costs
• Aiming to analyze in depth how participation is performed in such tools, we conduct
a study about a particular tool
• The chosen tool is Decide Madrid (https://decide.madrid.es), the participatory budgeting
e-platform of Madrid City Council since 2015
• The study makes use of diverse data:
• Topics, districts and support levels of citizen proposals
• Controversy level of comment threads originated over the proposals
• Indicators about economic, sociocultural and ideological aspects of the districts
59
4. Data mining applications
61. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Discussion and controversy analysis
Motivation
• Lack of government institutions comprehension about content generated by
citizens in electronic tools
• Possibility that institutions fail to meet the citizens’ demands
- Meeting certain relevant demands may be missed, not because they are unfeasible, but
because of their controversial nature
• Decreased quality of decision making
• Loss of confidence on the part of the citizenry
60
4. Data mining applications
62. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Discussion and controversy analysis
Decide Madrid
• Operational since 2015
• With more than 6,000 citizen
proposals a year
• With more than 400,000
registered users in 2019
• With a structure of
discussion threads
(comments) for each citizen
proposal
61
4. Data mining applications
Ejemplo de propuesta ciudadana en Decide Madrid.
title
author, date
description
tags
votes
comments
63. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Discussion and controversy analysis
Controversy metrics
• To measure the controversy of a citizen proposal, we consider the aggregation of 3 metrics applied to
discussion threads (comments)
62
4. Data mining applications
Controversy based on the content
(lenght) of dicussions
Controversy vased on the opinión
polarization (of votes)
Controversy based on the estructure of
the conversations
64. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Discussion and controversy analysis
Controversy metrics
• Discussion content-based controversy
• The length of the proposal’s discussion, measured as the sum of the length of its comments
• Opinion polarization-based controversy
• A weighted ratio measuring the difference of positive and negative votes for the proposal’s comments
• Conversation structure-based controversy
• An adaptation of the H-index for measuring discussion diversification
63
4. Data mining applications
65. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Discussion and controversy analysis
Some results of the study (I)
• The controversy values follow a heavy tail distribution, in which the majority of the proposals have
low controversy
• The proposals highly supported are not necessarily the most controversial
64
4. Data mining applications
“In Decide Madrid, proposals with a low level of support are currently discarded and archived, regardless of the level of
discussion and controversy they have. However, from a decision-making perspective, it would be interesting to delve deeper
into the controversial proposals and understand the problems of the city and the citizens they are affected by”.
“In Decide Madrid, proposals with a low level of support are currently discarded and archived, regardless of the level of
discussion and controversy they have. However, from a decision-making perspective, it would be interesting to delve deeper
into the controversial proposals and understand the problems of the city and the citizens they are affected by”.
66. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Discussion and controversy analysis
Some results of the study (II)
• Most controversial and supported topics
• Religion: inclusion of LGTBI+ groups in Cabalgata de Reyes,
public funding and tax benefits for Catholic institutions
• Housing: creation of social housing, annual property taxes
• Culture: prohibition of bullfighting
• Topics having low-moderate number of proposals with
low level of support and high controversy
• Governance: transparency, citizen participation, public
administration, laws and legislation
• Rights and social movements: social rights, civility, equity,
migration, integration, crime, NIMBY
65
4. Data mining applications
“In Decide Madrid, citizens’ ideological differences play
an important role in the group of controversial categories”.
“In Decide Madrid, citizens’ ideological differences play
an important role in the group of controversial categories”.
“In Decide Madrid, political and social issues reach a
low-moderate relevance (final attention)”.
“In Decide Madrid, political and social issues reach a
low-moderate relevance (final attention)”.
67. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Discussion and controversy analysis
Some results of the study (II)
• Topics having a large number of proposals with a high level
of support and controversy
• Domestic animals, mainly dogs (e.g., cleaning and fines for excrements on
public roads, creation of "pipicans", compulsory leash, etc.)
• Topics having low-moderate number of proposals with
low-moderate level of support and controversy
• Education, health, family, childhood, old age, employment, accessibility,
youth.
66
4. Data mining applications
“In Decide Madrid, proposals aimed at some vulnerable groups
(for example, people with disabilities, the elderly, unemployed)
tend to generate less citizen participation”.
“In Decide Madrid, proposals aimed at some vulnerable groups
(for example, people with disabilities, the elderly, unemployed)
tend to generate less citizen participation”.
68. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Discussion and controversy analysis
Some results of the study (III)
• Study of factors external to participation. Calculation of the correlation between levels of
support/controversy and district “statistical indicators” published as open data
67
4. Data mining applications
69. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Discussion and controversy analysis
Some results of the study (III)
• Study of factors external to participation. Calculation of the correlation between levels of
support/controversy and district “statistical indicators” published as open data
• The districts in which the greatest number of proposals are generated are those with:
• A high number of groups, neighborhood associations, and consumer organizations
• A more progressive position, that is, in which the majority voted for PSOE and Unidas Podemos
• A greater environmental commitment, that is, with more ecological associations
• The districts in which the most controversial proposals are generated are those with:
• A higher percentage of young people
• A greater number of citizens belonging to vulnerable groups, such as the elderly, young people and people
with some type of disability
• A higher birth rate and number of associations related to childhood
4. Data mining applications
68
70. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Discussion and controversy analysis
Limitations of the study
• The discussion of the votes (for or against) given to the comments has been considered
• The “polarity” (positive or negative) of the comments themselves should be analyzed. To do this, natural
language processing techniques would have to be applied
• Decide Madrid, which is a tool restricted and adjusted to a specific participation procedure,
has been analyzed
• More open tools such as online social networks (e.g., Twitter) should be considered
• Proposals and discussions motivated by political and ideological cleavages that traditionally
confront Spanish society have been observed (ideological positioning on the left-right scale,
religious versus secular values, traditional versus progressive, etc.)
• Tools from other countries should be analyzed to obtain more generalizable conclusions
• Possible biases (e.g., digital divide, political program) that could exist in users who use Decide
Madrid, and similar tools, have been omitted
4. Data mining applications
69
71. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Contents
1. E-participation
2. Decide Madrid
3. Data acquisition and processing
4. Data mining applications
• Discussion and controversy analysis
• Clustering and visualization
• Intent-based classification
5. Information retrieval applications
6. Recommendation applications
7. Conclusions
70
72. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Clustering and visualization
• Citizen collaboration through current digital participation platforms can entail the
generation of large amounts of complex content, which may hide relevant citizens’
concerns, requests and initiatives, diluted in isolated individual proposals
• We present an interactive data mining tool for citizen participation data
visualization and analysis
• Applying natural language processing, text similarity, and graph clustering techniques
• Grouping proposals with common objectives
• Identifying trends and recurrent topics of interest
• Filtering and presenting information according to several criteria
• The tool is flexible, able to process different sources of data, and lightweight as it
uses simple data structures and dynamic HTML-based visualization and interaction
71
4. Data mining applications
73. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Clustering and visualization
• The tools is built upon the
Tableau data visualization
software
https://www.tableau.com/resource/
data-visualization
• Lightweight
• Easy to configure
• Several visualization
functionalities
- Diagram bars
- Heat maps
- Time series graphs
72
4. Data mining applications
74. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Clustering and visualization
• Distribution of proposals,
categories and topics,
according to:
• Time (year, month) and
location (district)
• Support, discussion and
controversy levels
• Diverse temporal and
geographical analysis
• Better and easier extraction of
patterns and insights when
analyzing the published citizen
generated content
73
4. Data mining applications
75. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Clustering and visualization
• Text processing
• Mistake correction
- Dictionary
- Levenshtein distance
• Special characters removal
• Stopwords removal
• Word lemmatization
• Document similarity
• Word Mover’s Distance
(WMD) similarity, which
treats text documents as
weighted point clouds of
word embeddings
74
4. Data mining applications
76. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Clustering and visualization
• Document clustering
• Weighted graph
- Nodes: citizen proposal
documents
- Edges: document
similarity values
- Removal of edges with
“low” weights
• Louvain clustering
method
- Optimizes the
modularity of the graph,
associating nodes to
clusters until
convergence
75
4. Data mining applications
77. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Clustering and visualization
• A coproduction functionality
based on the retrieval of
existing similar proposals
• A citizen who is interested in
submitting a new proposal can
first bring it into the tool, and
check if there are related ones
76
4. Data mining applications
78. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Contents
1. E-participation
2. Decide Madrid
3. Data acquisition and processing
4. Data mining applications
• Discussion and controversy analysis
• Clustering and visualization
• Intent-based classification
5. Information retrieval applications
6. Recommendation applications
7. Conclusions
77
79. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Intent-based classification
• Social networks represent a prominent bidirectional communication
channel
between citizens and government
• Citizens are…
- content consumers who receive the government announcements, to which they
react and freely respond according to personal ideology, interests and needs, and
- content providers who generate a wide range of messages targeted to government
and political stakeholders
• The amount of social media content daily generated by citizens is huge and
diverse, and its processing by human actors may result too costly and
overwhelming
78
4. Data mining applications
80. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Intent-based classification
• There is an increasing interest and need to use computer-assisted
solutions capable of automatically gathering, processing and analyzing
the underlying information in the citizens’ messages (a.k.a. posts) on social
networks
• The research literature reports extensive work on:
• analyzing social phenomena produced through the online network structures
(e.g., information spreading, fake news, and opinion polarity), and mainly originated
by particular events (e.g., natural disasters, elections, and trending news)
• extracting the most popular topics addressed by citizens’ posts in social networks,
as well as the general dynamics (i.e., temporal evolution) and opinions on such topics
79
4. Data mining applications
81. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Intent-based classification
• Differently to previous work, we go beyond the extraction of topics by
attempting to automatically classify citizens’ posts (tweets) according
to their intents or purposes
1. Complaint: stating something that is unsatisfactory or unacceptable
- “@MADRID after 1 week of calling, the city is yet not clean, and the rats are taking over!!
http://t.co/IiIDuaPFG9”
2. Announcement: making a public statement about a fact, occurrence or event
- “The date, place and schedule of the Festival activities in La Latina have already been
confirmed http://t.co/U0tRwKAC @madrid @madridiario”
3. News item: objectively informing about current events
- “#oladecalor #aemet @Madrid has suffered its warmest night within the latest 100 years
http://t.co/ZSjeqK6m”
80
4. Data mining applications
82. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Intent-based classification
• Differently to previous work, we go beyond the extraction of topics by
attempting to automatically classify citizens’ posts (tweets) according
to their intents or purposes
4. Personal fact: publicizing self issues and experiences
- “I also support the candidature from @Madrid2020ES @MADRID #aporella”
5. Opinion: expressing subjective opinions about the city, its events, activities, etc.
- “The activity of #emprendeenmadrid is amazing. Congratulations @MADRID and greetings
from an entrepreneur”
6. Request: explicitly asking for something specific
- “Very nice but impossible to ride a bike at normal speed #MadridRio. Please @MADRID
create a bike lane with cyclist priority”
81
4. Data mining applications
83. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Intent-based classification
• Differently to previous work, we go beyond the extraction of topics by
attempting to automatically classify citizens’ posts (tweets) according
to their intents or purposes
7. Notification: reporting or giving notice of urban, citizenship- or government-related
issues, so that government can quickly act on them and help other citizens
- “@MADRID can you fix this gap in San Bernardino street 8-10 before someone gets hurt?
http://lockerz.com/s/117566458”
8. Question: explicitly asking for information
- “@MADRID could you please give me the telephone number of the press office of the
Madrid city hall”
9. Proposal: suggesting an initiative or project
- “There is a collection of used oil in the center of Alicante. It would be fantastic to have
something similar @MADRID”
82
4. Data mining applications
84. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Intent-based classification
• To automatically categorize a tweet into one of the previous intents
(labels), it is first transformed into a vector of features
• We consider 37 domain- and language-independent features to
describe the content of a tweet
83
4. Data mining applications
Lexical features Grammatical features Social network-based features
• number of characters
• number of words
• number of exclamation marks
• number of question marks
• existence of a positive emoticon
• existence of a negative emoticon
• existence of a vowel (or “y”)
consecutively repeated 3 or more
times in a word
• number of nouns
• number of proper nouns
• number of adjectives
• number of verbs
• number of adverbs
• number of personal/possessive
pronouns
• number of time references
(entities)
• number of money-related
references
• number of followers
• number of friends
(a.k.a. followees)
• number of posts
• number of active days in Twitter
• number of hashtags (#)
• number of user mentions (@)
• number of hyperlinks
• number of multimedia
• maximum hashtag length
• existence of an explicit retweet
request (i.e., "RT" abbreviation)
85. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Intent-based classification
• To validate the proposed approach, we evaluated several machine learning
algorithms on a labeled dataset:
• K-Nearest Neighbors (KNN)
• Logistic Regression (LR)
• Quadratic Discriminant Analysis (QDA)
• Decision Tree (DT)
- executed alone, and in combination with
feature selection (RFECV DT) and
tree pruning (AP DT)
to avoid learning over-fitting
• Gaussian Process (GP)
• Support Vector Machine (SVM)
• Bagging Ensemble (BE)
84
4. Data mining applications
86. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Intent-based classification
• Dataset: a random sample of 666 tweets mentioning @Madrid account, each of
them manually labeled by 3 researchers (almost perfect agreement: Fleiss' kappa = 0.98)
• 9 binary classification problems: one-against-all (i.e., training a single classifier
per label)
• Classification metrics
• acc (accuracy)
• acc+ (minority class acc)
• acc– (majority class acc)
•
85
4. Data mining applications
87. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Intent-based classification
• Dataset: a random sample of 666 tweets mentioning @Madrid account, each of
them manually labeled by 3 researchers (almost perfect agreement: Fleiss' kappa = 0.98)
• 9 binary classification problems: one-against-all (i.e., training a single classifier
per label)
• Classification metrics
• acc (accuracy)
• acc+ (minority class acc)
• acc– (majority class acc)
•
86
4. Data mining applications
(very) unbalanced classification problems
88. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Intent-based classification
• Dataset: a random sample of 666 tweets mentioning @Madrid account, each of
them manually labeled by 3 researchers (almost perfect agreement: Fleiss' kappa = 0.98)
• 9 binary classification problems: one-against-all (i.e., training a single classifier
per label)
• Classification metrics
• acc (accuracy)
• acc+ (minority class acc)
• acc– (majority class acc)
•
87
4. Data mining applications
(misleading) high classification accuracies
89. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Intent-based classification
• Dataset: a random sample of 666 tweets mentioning @Madrid account, each of
them manually labeled by 3 researchers (almost perfect agreement: Fleiss' kappa = 0.98)
• 9 binary classification problems: one-against-all (i.e., training a single classifier
per label)
• Classification metrics
• acc (accuracy)
• acc+ (minority class acc)
• acc– (majority class acc)
•
88
4. Data mining applications
reasonably good accuracy balance for the two labels
90. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Intent-based classification
• Most discriminating words and features for each of the considered intents
89
4. Data mining applications
COM = complaint
ANN = announcement
REQ = request
NEW = news item
FAC = personal fact
OPI = personal opinion
91. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Intent-based classification
• The proposed intent-based classification represents a task prior to the
extraction of topics and opinions, and may help filtering and prioritizing citizens’
messages, and further automatizing processes for more efficient and effective
decision and policy making
• There is room for improvement:
• More sophisticated NLP techniques, such as language models and word embeddings,
could be used to exploit the semantics of words and word sequences
- e.g., “opinion is” and “really think that” could be identified as informative bigram and
trigram of the personal opinion intent
• Features from other sources of information, such as the user who creates a post and the
user(s) who are mentioned in a post
- e.g., by considering their types: citizens, neighborhood associations, organizations, or
political actors
90
4. Data mining applications
92. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Contents
1. E-participation
2. Decide Madrid
3. Data acquisition and processing
4. Data mining applications
5. Information retrieval applications
• Argument mining in a nutshell
• Argument-based document search
• Argument-based conversational information access
• Neural network-based argument extraction
6. Recommendation applications
7. Conclusions
91
93. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Argument mining in a nutshell
• Tasks
• Detection of argument text fragments
• Identification of argument components
• Extraction of argument relations
• Algorithmic foundations
• Natural Language Processing (NLP)
• Machine/deep learning
• Linguistic features
• Sentence-level (e.g., sentence length, argument linkers, etc.),
grammatical (e.g., number of nouns, adjectives, modal verbs, etc.), syntactic (e.g., patterns,
constituency tree depth, etc.), semantic (e.g., named entities, word embeddings, etc.)
92
5. Information retrieval applications
Source: ACL’16 tutorial “NLP Approaches to Computational Argumentation”
94. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Argument mining in a nutshell
• Tasks
1. Detection of arguments
2. Identification of argument components and structures
3. Extraction of argument relations
93
5. Information retrieval applications
Source: ACL’16 tutorial
“NLP Approaches to Computational Argumentation”
95. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Argument mining in a nutshell
• Tasks
1. Detection of arguments
2. Identification of argument components and structures
3. Extraction of argument relations
94
5. Information retrieval applications
Source: ACL’16 tutorial
“NLP Approaches to Computational Argumentation”
96. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Argument mining in a nutshell
• Tasks
1. Detection of arguments
2. Identification of argument components and structures
3. Extraction of argument relations
95
5. Information retrieval applications
Source: ACL’16 tutorial
“NLP Approaches to Computational Argumentation”
97. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Argument mining in a nutshell
• Tasks
1. Detection of arguments
2. Identification of argument components and structures
3. Extraction of argument relations
96
5. Information retrieval applications
Source: ACL’16 tutorial
“NLP Approaches to Computational Argumentation”
98. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Argument mining in a nutshell
• Tasks
1. Detection of arguments
2. Identification of argument components and structures
3. Extraction of argument relations
97
5. Information retrieval applications
Source: ACL’16 tutorial
“NLP Approaches to Computational Argumentation”
99. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Argument mining in a nutshell
• Example: Categorization of argumentative components via machine learning
• Classes
- “Major claim”, “Claim”, “Premise”
• Features
- Lexical lemmatized unigrams including previous tokens
- Syntactic number of nested phrases, depth of the syntactic tree, POS distribution,
tense of the principal verb, modal verbs
- Structural first or last sentence of a paragraph, present in introduction or conclusion,
relative position, number of tokens, etc.
- Indicators connectors: “because”, “however”, “as a result”, etc.
- Contextual contextualized connectors, number of words shared by introduction and conclusion
- Probabilistic conditional probability P(category | previous tokens)
- Discourse discourse relation based on Penn Discourse Treebank
- Embeddings vectors with 300 dimensions trained with Google News Corpus
98
5. Information retrieval applications
100. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Argument mining in a nutshell
• Example: Categorization of argumentative components via machine learning
• Using all features results on the best F1 values
• The classification of claims is the most difficult task
• The structural features are the most valuable
• The discourse features are informative for the identification of claims
• The word embeddings achieve results similar to lexical features
99
5. Information retrieval applications
Source: ACL’16 tutorial
“NLP Approaches to Computational Argumentation”
101. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Argument mining in a nutshell
• Corpus
• AIFdb: repository of databases, following the
Argument Interchange Format, AIF
- AracuriaDB: news editorials, parliamentary records,
court summaries and panel discussions
- MM2012: transcriptions of BBC Radio 4
- …
• The Internet Argument Corpus, IAC: set of political
debates in internet forums
• The ECHR Corpus: collection of documents extracted
from legal texts of the European Court of Human Rights
• The Argument Annotated Essays Corpus, AAEC:
collection of persuasive essays
• …
100
5. Information retrieval applications
102. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Argument mining in a nutshell
• Tools
• Collaborative editors of argumentative graphs
- Agora, http://agora.gatech.edu
- Argunet, http://www.argunet.org
- DebateGraph, http://debategraph.org
- Rationale Online, https://www.rationaleonline.com
• Argumentative annotation platforms
- Araucaria, http://araucaria.arg.tech
- OVA, http://ova.arg-tech.org
101
5. Information retrieval applications
103. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Argument mining in a nutshell
• Events
• International Conference on Computational Models of Argument (COMMA),
https://comma2020.dmi.unipg.it
• Workshop on Argument Mining (ArgMining), https://2021.argmining.org
• Workshop on Computational Models of Natural Argument (CMNA),
http://cmna.csc.liv.ac.uk/CMNA20
• Summer School on Argumentation (SSA), https://ssa2020.dmi.unipg.it
• ACL’19 tutorial “Advances in Argument Mining”, http://arg.tech/~chris/acl2019tut/index.html
• ACL’16 tutorial “NLP Approaches to Computational Argumentation”, http://acl2016tutorial.arg.tech
• Online Seminars on Computational Models of Argument,
https://sites.google.com/view/argumentation-seminar
• Dagstuhl’16 seminar “Natural Language Argumentation: Mining, Processing, and Reasoning over
Textual Arguments”, https://www.dagstuhl.de/16161
• BiCi’14 seminar “Frontiers and Connections between Argumentation Theory and Natural
Language Processing”, http://www-sop.inria.fr/members/Serena.Villata/BiCi2014/frontiersARG-
NLP.html
102
5. Information retrieval applications
104. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Contents
1. E-participation
2. Decide Madrid
3. Data acquisition and processing
4. Data mining applications
5. Information retrieval applications
• Argument mining in a nutshell
• Argument-based document search
• Argument-based conversational information access
• Neural network-based argument extraction
6. Recommendation applications
7. Conclusions
103
105. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Argument-based document search
• Proposed framework
104
5. Information retrieval applications
106. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Argument-based document search
• Argument model
• Premise → Claim → Major claim
• Types and subtypes of argument relations
• Cause: linking an argument that reflects the reason or condition for another argument
• Clarification: introducing a conclusion, exemplification, restatement or summary of an argument
• Consequence: evidencing an explanation, goal or result of a previous argument
• Contrast: attacking arguments, distinguishing between giving alternatives, doing comparisons,
making concessions, and providing oppositions
• Elaboration: introducing an argument that provides details about another one, entailing addition,
precision or similarity issues about the target argument
• Argument mining methods
• Syntactic pattern matching
• Feature-based machine learning classification
• Embedding-based deep neural network
105
5. Information retrieval applications
107. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Argument-based document search
• Heuristic algorithm
• For each sentence of an
input text: looking for certain
syntactic patterns that
introduce argumentative
expressions
• 1,744 arguments extracted
from 5,633 comments
• Contrast: 54.1%
• Consequence: 12.1%
• Cause: 3.6%
• Elaboration: 0.1%
106
5. Information retrieval applications
108. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Argument-based document search
• Argument linkers
107
5. Information retrieval applications
109. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Argument-based document search
• Information retrieval
• Text processing
• NLP for linguistic feature extraction
• Indexing based on keywords, topics, categories, entities and other metadata
• Search engine based on the vector space model
• Argument-based reranking according to controversy metrics
108
5. Information retrieval applications
110. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Argument-based document search
• Outcomes – arguments
• JSON object created for an argument that evidences a contrast premise on a proposal in
favor of using Madrid public transport with pets
109
5. Information retrieval applications
111. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Argument-based document search
• Outcomes – documents, topics and arguments
110
5. Information retrieval applications
112. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Argument-based document search
• Dataset
• 80 proposals (covering 10 categories and having high controversy) and 5,633 comments
• Experiment setting
• 3 evaluators
• 3 queries
• Topical relevance – accuracy of an argument with respect to the major claim of the
discussion
• 14.6% of the arguments were labeled as very relevant
• 39.9% as relevant
• 36.9% as not relevant
• 8.6% as incorrect
• Rhetoric quality – effectiveness of an argument in persuading an audience
• 17.1% of the arguments were of high quality
• 40.6% of sufficient quality
• 42.3% of low quality
111
5. Information retrieval applications
113. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Argument-based document search
• We have presented a general and flexible argument-based search framework
• Preliminary implementation and evaluation on a dataset with citizen proposals and discussions generated in an
online participatory platform
• Its current implementation includes:
• Various argument extraction methods (heuristic patter matching, feature-based machine learning, embedding-based deep learning)
• A document retrieval engine built upon vector space-based models
• A reranking strategy that exploits certain controversy metrics
• We envision several open research lines:
• Development of ad hoc argument-based document retrieval methods (so far, we have used a reranking technique)
• Consideration of alternative controversy notions
• Increment of the size and quality of the generated corpus
• Evaluation on other datasets and domains
• Measurement of additional argument quality metrics, e.g., based on diversity, fairness, persuasiveness, etc.
112
5. Information retrieval applications
114. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Contents
1. E-participation
2. Decide Madrid
3. Data acquisition and processing
4. Data mining applications
5. Information retrieval applications
• Argument mining in a nutshell
• Argument-based document search
• Argument-based conversational information access
• Neural network-based argument extraction
6. Recommendation applications
7. Conclusions
113
115. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Argument-based conversational information access
• E-participation –understood as the
computer-assisted support to citizen
participation– has originated novel
consultation and deliberation processes
• Most current e-participation platforms are
based on web forums
• Citizens make proposals and provide comments
and opinions, forming
large conversation threads
• Recent attention has shifted to social media,
especially social networks
(e.g., Facebook and Twitter) and
instant messaging tools
(e.g., Telegram and WhatsApp)
5. Information retrieval applications
114
116. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Argument-based conversational information access
• Conventional web forums promote social interaction
• Pros
- Easy and fast content generation (through free text
posts)
- Smooth, large-scale interaction (via comment threads)
• Cons
- No or very limited functionalities for content
organization, filtering and analysis
- Dispersed and redundant content, since it is structured
by time
- Challenging processing of discussions
• Argument-driven tools promote the production and
reuse of collective knowledge
115
5. Information retrieval applications
117. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Argument-based conversational information access
• Our work on e-participation…
• addresses 2 promising research lines
- The exploitation of argument mining techniques to automatically
extract and present argumentative information from
citizen-generated content
- The use of conversational agents or chatbots as citizen-to-government
communication channels in instant messaging applications
• targets a final goal
- Helping on finding out and understanding city problems and
citizens’ concerns, and consequently on getting well-formed opinions
for making better decisions in participatory processes
116
5. Information retrieval applications
118. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Argument-based conversational information access
• The ‘Decide Madrid’ e-participation platform
• A web system designed to allow Madrid residents to
make, debate and vote proposals for the city
• Available data from a citizen proposal
• Title
• Author, date
• Summary, description
• Freely-chosen tags
• User comment threads
• Heterogeneous topics and discussions
• urbanism, transport, environment, health care,
education, social rights, education, culture, economy,
job, politics, security, housing, family, old age,
religion, animals, etc.
117
5. Information retrieval applications
https://decide.madrid.es
119. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Argument-based conversational information access
• Argument model
• Premise → Claim → Major claim
• Types and subtypes of argument relations
• Cause: linking an argument that reflects the reason or condition for another argument
• Clarification: introducing a conclusion, exemplification, restatement or summary of an argument
• Consequence: evidencing an explanation, goal or result of a previous argument
• Contrast: attacking arguments, distinguishing between giving alternatives,
doing comparisons, making concessions, and providing oppositions
• Elaboration: introducing an argument that provides details about another one,
entailing addition, precision or similarity issues about the target argument
118
5. Information retrieval applications
120. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Argument-based document search
• Example of an extracted argument tree
119
5. Information retrieval applications
C = claim
L = linker
P = premise
121. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Argument-based conversational information access
• Through a natural language
conversation with the chatbot,
the user can:
1. explore citizen proposals and
comments, organized by
categories, topics and districts
2. access to categorized citizens’
arguments given
in the debates around a
proposal
3. provide feedback and
votes for proposals
120
5. Information retrieval applications
122. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Argument-based conversational information access
• The chatbot is built upon the Google DialogFlow framework, which links external web services
with a variety of instant messaging and social networking services, e.g., Google Assistant,
Facebook Messenger, WhatsApp, Telegram and Skype
121
5. Information retrieval applications
123. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Argument-based conversational information access
• The chatbot handles several conversation intents, each of them with triggering sentence
patterns and associated functionalities
122
5. Information retrieval applications
124. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Argument-based conversational information access
User study: empirical evaluation of the chatbot in terms of:
1. The feasibility of exploring e-participation content via a conversational interface
2. The potential benefits of argument-driven information in e-participation
• Uncontrolled, realistic scenario
• Without external supervision, participants freely tested the chatbot via Telegram during a period of one
week, using their own Telegram accounts and mobile devices
• 32 participants → 2 groups
• Control group: having disabled the chatbot’s argument-driven browsing functionalities
• Experimental group: having enabled the chatbot’s argument-driven browsing functionalities
123
5. Information retrieval applications
125. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Argument-based conversational information access
Study questionnaire
• 33 items
• 10 evaluation criteria
• Citizen participation
• Decision making
• Public values
124
5. Information retrieval applications
126. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Argument-based conversational information access
32 participants
• Gender: 22 male, 10 female
• Ages: 18-29 years old (12), 30-39 years old (9), 40-49 years old (5), 50-59 years old (4), more
than 59 years old (2)
• Education levels: secondary education (3), vocational education (1), Bachelor’s degree (20),
Master’s degree (6), Doctoral degree (2)
• Those with Higher Education levels had studied Sciences (3), Social Sciences (10),
Arts and Humanities (4), and Engineering (11) careers
• Diverse levels of knowledge/expertise on chatbots –null knowledge and expertise (5),
null expertise (5), low expertise (20), medium expertise (2)
• Diverse levels of knowledge on citizen participation –null (7), low (16), medium (9)
125
5. Information retrieval applications
127. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Argument-based conversational information access
• Objective metrics
• Subjective questionnaires
126
5. Information retrieval applications
128. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Argument-based conversational information access
• More user activity
• No significant difference on the avg. number of sessions per user (between groups)
• Longer sessions in the experimental group
- Increase of 45.6% on the avg. session duration (from 16.0 to 23.3 minutes)
- Increase of 14.3% (from 56.8 to 64.9) on the avg. number of actions per user
• Higher user engagement and persuasiveness
• Increase of 23.5% (from 1.7 to 2.1) on the avg. number of feedback actions per user
• Meaningful exploration of arguments (avg. 7.4 actions per user)
• Better user opinions
• About the chatbot: highly efficient, quite effective, moderately easy to use
• About the argumentative information: higher perception of transparency and fairness
127
5. Information retrieval applications
129. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Argument-based conversational information access
• Participants’ suggestions
• A more “natural” conversation with the chatbot
• A more fluent transition between browsed proposals
• Facilities to read proposals with large descriptions
• Future research directions
• Personalized recommendation mechanisms to proactively present relevant content to the user, thus
mitigating the information overload problem
• Richer data structures, analysis and visualizations for facilitating decision making
• Functionalities oriented to citizen collaboration
• Integration of external data sources, such as open government data and news items
128
5. Information retrieval applications
130. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Contents
1. E-participation
2. Decide Madrid
3. Data acquisition and processing
4. Data mining applications
5. Information retrieval applications
• Argument mining in a nutshell
• Argument-based document search
• Argument-based conversational information access
• Neural network-based argument extraction
6. Recommendation applications
7. Conclusions
129
131. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Neural network-based argument extraction
• Argument retrieval aims at automatically extracting structured argumentative
information existing in a text corpus
• It has been commonly modeled as a pipeline of three tasks, namely argument
segmentation, argument component classification, and argument relation recognition
• We investigate the application of transformer-based deep learning to jointly
address the above tasks as a single end-to-end sequence tagging problem
130
5. Information retrieval applications
132. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Neural network-based argument extraction
Deep neural network architecture
• 1st block: BETO Language model
• A BERT-based model trained on a corpus in Spanish with Wikipedia articles, legal texts,
and TED Talks transcript
- 12 encoders with a hidden layer size of 768 units, and 12 self-attention heads
• 2nd block: generic layers of feed-forward neural networks
• 3rd block: task-specific layers that address the following argument mining tasks
• Identification of argumentative units (BIO tagging task)
• Classification of argumentative components: premise, claim, major claim, empty
• Recognition of argumentative relations: 17 subtypes of the 2-level taxonomy
• Classification of argumentative relation intents: support, attack, empty
131
5. Information retrieval applications
133. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Neural network-based argument extraction
• Input
• Annotated sentences from citizen comments
• Deep neural network configuration
132
5. Information retrieval applications
134. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Neural network-based argument extraction
• ARGAEL: ARGument Annotation and Evaluation tooL
• Simple annotation view: the user identifies argument components and relations (and their types)
133
5. Information retrieval applications
135. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Neural network-based argument extraction
• ARGAEL: ARGument Annotation and Evaluation tooL
• Assisted annotation view: the user has access to others’ argument annotations
134
5. Information retrieval applications
136. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Neural network-based argument extraction
• ARGAEL: ARGument Annotation and Evaluation tooL
• Evaluation view: the user evaluates others’ argument annotations
135
5. Information retrieval applications
137. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Neural network-based argument extraction
• ARGAEL: ARGument Annotation and Evaluation tooL
• Argument component (AC) annotations and evaluations
• Argument relation (AR) annotations and evaluations
136
5. Information retrieval applications
138. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Neural network-based argument extraction
• ARGAEL: ARGument Annotation and Evaluation tooL
• Some results of the argument annotation process on the Decide Madrid dataset
137
5. Information retrieval applications
139. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Neural network-based argument extraction
• Some preliminary results
• Argument identification
• Argument component classification
138
5. Information retrieval applications
140. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Neural network-based argument extraction
• Some preliminary results
• Relation type classification
• Relation intent classification
139
5. Information retrieval applications
141. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Contents
1. E-participation
2. Decide Madrid
3. Data acquisition and processing
4. Data mining applications
5. Information retrieval applications
6. Recommendation applications
• Recommender systems in a nutshell
• Personalized recommendations
• Context-aware recommendations
7. Conclusions
140
Disclaimer: some of the materials of this subsection have been created by
Prof. Pablo Castells for his information retrieval master course at EPS-UAM.
Disclaimer: some of the materials of this subsection have been created by
Prof. Pablo Castells for his information retrieval master course at EPS-UAM.
142. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Recommender systems in a nutshell
141
6. Recommendation applications
Is it possible to help the user to find
information without asking for it?
How to customize the process?
143. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Recommender systems in a nutshell
• Personalized recommendations
142
6. Recommendation applications
144. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Recommender systems in a nutshell
143
6. Recommendation applications
145. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Recommender systems in a nutshell
144
6. Recommendation applications
146. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Recommender systems in a nutshell
• Many ways to make recommendations
• Spotify: https://www.music-tomorrow.com/blog/how-spotify-recommendation-system-works-a-
complete-guide-2022
• Instagram: https://ai.facebook.com/blog/powered-by-ai-instagrams-explore-recommender-system
• Netflix: https://research.netflix.com/research-area/recommendations
https://scale.com/blog/Netflix-Recommendation-Personalization-TransformX-Scale-AI-Insights
• Google Play: https://deepmind.com/blog/article/Advanced-machine-learning-helps-Play-Store-users-
discover-personalised-apps
145
6. Recommendation applications
147. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Recommender systems in a nutshell
• It is estimated that the recommendations produce…
• 20% of sales on Amazon
• 60% of streaming on YouTube
• 80% of streaming on Netflix
• ∼10% of electronic commerce
• Recommendation has a large market to tap into
• It seems possible to target beyond ∼10% of engagement
• Many companies aim to exploit such potential
146
6. Recommendation applications
148. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Recommender systems in a nutshell
• Situations with option overload
• 1994 → 0.5 millions of different products on sale in the USA
• 2010 → 24 millions of products only in Amazon
• Recommendation = Personalized IR without explicit query
• First initiatives published in 1992 (Tapestry at Xerox Parc)
• Precedents: user models based on stereotypes (late 70s)
• Conferences: RecSys, SIGIR, ECIR, UMAP
• Confluence with other areas: Machine Learning (ICML, ECML, IJML, etc.), Data Mining
(KDD, etc.), Artificial Intelligence (IJCAI, AAAI), Human Computer Interaction (IUI)
147
6. Recommendation applications
149. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Recommender systems in a nutshell
• Non-personalized recommendations
148
6. Recommendation applications
150. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Recommender systems in a nutshell
• Contextualized recommendations
149
6. Recommendation applications
151. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Recommender systems in a nutshell
• Utility of recommender systems
150
6. Recommendation applications
Jannach, D. and Adomavicius, G. 2016. Recommendations with a purpose. In Proceedings of the 10th
ACM Conference in Recommender Systems (RecSys ’16), pp. 7-10.
152. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Recommender systems in a nutshell
• User preferences
151
6. Recommendation applications
Ratings
Reviews
Categorical
Thumbs up / down
153. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Recommender systems in a nutshell
• Personalized recommendations: problem formulation
152
6. Recommendation applications
154. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Recommender systems in a nutshell
• Problem formulation
• Input
- A set U of users
- A set I of items
- A sorted set R of values, e.g., R = { 1, 2, 3, 4, 5 }
- A functional relation. r : U x I → R
- Typically, r(u,i) is a “rating”, and represents the user u’s assessment for item I at scale R
- This input can be seen as a matrix of ratings
- Most of its values (95% and more in general) are unknown
• Goal
- Predicting the values r(u,x) of items x for a user u who has not evaluated such items
- The unknown values r(u,x) are considered for recommending x to u
- In general, generating a sorted list of items that can be of interest for the user
- This goal is commonly referred as generating the “top n” recommendations
153
6. Recommendation applications
155. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Recommender systems in a nutshell
• Problem formulation
• Implicit user feedback (preferences)
- No need for asking the user
- r : U x I→ {0, 1} binary, e.g., “u buys i”
- It can be treated as a particular case R = {0, 1}
- r : U x I → R measuring the frequency of accessing item by user u, e.g., listening music
- Binarized to 1 if frequencies > 0
- Applying a conversion function frequency → rating (e.g., percentiles)
- r : U x I → P(T) for users u annotating (tagging) items x, where T is a set of tags
- It can be treated as “1 tag 1 vote”, but more elaborated and complex techniques can be
performed on graphs of tags, items, users…
- Timestamps
- Frequency data: r(u,i) is a set of timestamps
- Rating data: r(u,i) is a [rating, timestamp] pair
154
6. Recommendation applications
156. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Recommender systems in a nutshell
• Types of recommendation strategies
• Content-based filtering (CB)
- Item features are considered: words (text case), descriptors (metadata), etc.
- Items are compared with user information collected in a preference profile
- A user profile is long-term; it can be acquired through decision trees, neural networks, etc.
• Collaborative filtering (CF)
- Items are opaque
- The profiles of other users with similar traits (tastes, behavior patterns, demographic data,
etc.) are used to recommend items
• Hybrid filtering: combining different recommendation strategies
- Combining the output of CB and CF
- Inserting CB elements into CF or vice versa
- Unified models
155
6. Recommendation applications
157. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Recommender systems in a nutshell
• Content-based filtering
156
6. Recommendation applications
158. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Recommender systems in a nutshell
• Content-based filtering
• Each user is recommended without looking at others
• A feature space for the items is needed → items are represented as vectors in such space
- “Data” that describe the items, structured or unstructured, e.g., item metadata (author,
place, language, categories, tags), words in the text associated with items, etc.
- Binary, integer or real values
• A similarity function on the feature space, e.g.,
- Cosine similarity for numerical features
- Jaccard similarity for binary features
• Two very common methods: kNN- and centroid-based
- but many others based on classification can be used
(where users essentially play the role of class)
157
6. Recommendation applications
159. Facultad de Ciencias Empresariales
Universidad del Bío-Bío, Chile
Data science in practice: Case studies in e-participation
Recommender systems in a nutshell
• Content-based filtering: kNN-based
• Adaptation of the kNN classification algorithm
- In classification, 𝑟(𝑢,i) would be binary
- Ranking of “instances” (items) for each “class” (user), rather than the opposite
158
6. Recommendation applications