Data science in practice: Case studies in e-participation

Universidad del Bío-Bío, Chile
Facultad de Ciencias Empresariales
Iván Cantador, ivan.cantador@uam.es
January 13, 2023
Case studies in e-participation
Data science in practice

Data science in practice: Case studies in e-participation
About me
• Iván Cantador
• Associate Professor at the Computer Science and Engineering Department
of Universidad Autónoma de Madrid, Spain
http://www.eps.uam.es/~cantador
• Research interests
- Recommender systems
- Information retrieval
- Machine learning
- Natural language processing
- Semantic technologies
- E-government
1

Contents
1. E-participation
2. Decide Madrid
3. Data acquisition and processing
4. Data mining applications
5. Information retrieval applications
6. Recommendation applications
7. Conclusions
2

Contents
1. E-participation
• Open government
• Citizen participation
• Digital platforms for citizen participation
2. Decide Madrid
7. Conclusions
3

Open government
• Open Government (Oszlak, 2013) – A public management paradigm that arises in
a context characterized by:
• The disaffection on the part of the citizenry originated by the numerous crises that question
the capacity of the Public Administration to deal with them
• The rise of the ubiquitous use of technologies, which have transformed communications
and interactions between individuals, and have promoted the emergence of open,
participatory and collaborative practices
• The opening of the government, among other institutions, to the citizens, aiming to end with
the existing disaffection
4
1. E-participation

Open government
• Goals of the open government model (Ramírez-Alujas, 2014):
• Increasing the transparency (accountability) and access to
government information through Open Data
- These open data should allow citizens to have access to information
and should promote innovation and economic development in the public sector
• Facilitating the collaboration between distinct actors, particularly between public
administrations, civil society, and the private sector, in order to codesign and generate
public value
• Promoting citizen participation in the design and implementation of public policies,
i.e., in decision and policy making
5
1. E-participation

Open government
• Background – Memorandum on Transparency and Open Government, USA.
Barack Obama’s Administration, 2009
6
1. E-participation
• Providing information about the government activity, its performance, etc.
This encourages and promotes accountability and social control.
Transparency
• Promoting the right of citizens to actively participate in policy making.
Participation
• Involving citizens and other actors in scenarios of cooperation and
coordinated work.
Collaboration
• Using technology as an instrument to promote openness in government,
facing the challenges of the new millennium.
Technology

Open government
• The Open Government Partnership emerged in 2011 in order to promote open
government in different administrations
• It seeks for the different governments to reach specific commitments on transparency and
power of citizens, fight against corruption, and take advantage of new technologies to
strengthen governance
- Founded by 8 countries: Brazil, Mexico, Indonesia, Philippines, Norway, USA, South Africa, UK
- Composed of 70 member states and numerous government organizations
• Principal commitments:
- Improvement of public services
- Increased public integrity
- Effective management of public resources
- Safer communities
- Increased corporate responsibility
7
1. E-participation
https://www.opengovpartnership.org

Contents
1. E-participation
• Open government
2. Decide Madrid
7. Conclusions
8

Citizen participation
• Citizen participation is a process that provides private
individuals an opportunity to influence on public decisions,
and has been a component of democratic decision making
• A community-based process in which citizens may organize themselves
and their goals, and may work together through non-governmental organizations
to influence on public policies and plans
• Benefits
• Governance: reducing conflicts, strengthening democratic legitimacy, encouraging active
citizenship → government transparency and accountability, and trust between citizens and
political institutions
• Increasing the quality of public decisions and services
• Learning and training to build stronger societies
• Promoting social cohesion, mutual understanding and social justice
9
1. E-participation

• Ladder of citizen participation (Arnstein, 1969)
• 8 levels in 3 groups
- No participation
- Symbolic participation
- ‘Real’ participation
• Simplified by the OECD model into 3 levels
10
1. E-participation

• Barriers of citizen participation
• Incompatibilities
- Politic, legal, cultural, socioeconomic, organizational
• Intrinsic problems
- Complex, expensive, under representative, non-plural, little informed, conflictive,
non-deliberative, non-scalable, etc.
• Extrinsic problems
- Arbitrary and manipulable
- Inefficient and non-self-sustaining
- Irrelevant issues and lack of effect
- Citizen saturation
- Monopoly of participation, etc.
11
1. E-participation

• Tools for citizen participation
• Non-ICT-based
- Questionnaires, and surveys
- Seminars, talks, and meetings
- Discussion and work groups
- Cultural, artistic and leisure events
• ICT-based
- E-mail, RSS, SMS, multimedia sharing
- Social media, web portals and e-platforms
- Mobile apps
- Open data, IoT (crowdsensing)
- Augmented/virtual reality
12
1. E-participation

• Participedia.net
• Anyone can join the Participedia community
and help crowdsource, catalogue, and
compare participatory political processes
around the world
• Cases (2259)
• Methods (360)
• Organizations (841)
• Teaching resources
13
1. E-participation

Contents
1. E-participation
• Open government
2. Decide Madrid
7. Conclusions
14

Digital platforms for citizen participation
• With the advent of social media and mobile computing, nowadays there is a plethora
of digital citizen participation channels
• general-purpose online social networks
• ad hoc e-consultation, e-voting and e-participation platforms
• The huge, ever-increasing citizen generated content leads to an information
overload problem for both citizens and government stakeholders in decision and
policy making tasks
• Users may feel overwhelmed by the large amount of data, whose exploration and
understanding could result challenging and frustrating
• Citizens may feel thwarted if their proposals do not reach sufficient visibility and impact
15
1. E-participation

• E-participation refers to ICT-supported citizen
participation in governance processes
• administration
• service delivery
• decision making
• policy making
• It aims to upgrade the relations among stakeholders
in civil society –e.g., local government, citizens, firms–,
putting the citizens in the center of the processes
• It has originated novel consultation and deliberation
initiatives
16
1. E-participation

• E-participation tools by type of engagement and role of ICT/level of participation
17
1. E-participation
Aichholzer, G., & Allhutter, D. (2011).
Online forms of political participation and their
impact on democracy. Institute of Technology
Assessment (ITA).

• Most current e-participation
platforms are based on web forums
• Citizens make proposals and provide
comments and opinions, forming
large conversation threads
18
1. E-participation
Example of web forum-based e-participation platform
Citizen proposal Discussions

• Conventional web forums promote social interaction
• Pros
- Easy and fast content generation (through free text posts)
- Smooth, large-scale interaction (via comment threads)
• Cons
- No or very limited functionalities for content organization,
filtering and analysis
- Dispersed and redundant content, since it is structured
by time
- Challenging processing of discussions
19
1. E-participation

Contents
1. E-participation
2. Decide Madrid
• Participatory budgeting
• E-participatory budgeting
• The ‘Decide Madrid’ platform
7. Conclusions
20

Participatory budgeting
• Participatory budgeting (PB) is a democratic
deliberation and decision-making process in
which citizens decide how to spend certain
municipal or public budgets
• informing about issues and problems on a wide range
of subject areas in a city, e.g., housing, public safety,
education, health, transportation and environment
• proposing, debating and supporting/voting for
spending ideas and projects aimed to address such
problems
21
2. Decide Madrid

• Pros
• Increased government transparency and trust
• Citizens’ empowerment and change of democratic attitude
• Better allocation of resources (in general)
• Increased voter turnout
• Cons
• Lack of diverse representation
• Time consuming
• Resource intensive
• Lack of interest or political will
22
2. Decide Madrid

• Since its original invention in Porto Alegre,
Brazil, in 1988, PB has gained much
popularity
• As for 2022, PB had spread to over 4,500 cities
around the world (source: Participatory
Budgeting World Atlas, https://www.pbatlas.net)
• Tools of citizen participation
• Meetings
• Committees
• Consultations
• …
• Electronic participatory platforms
23
2. Decide Madrid
http://www.participatorybudgeting.org

• PB in Europe (https://www.euractiv.com/section/participatory-democracy/infographic/participatory-
budgeting-europes-bet-to-increase-trust-in-government)
• While residents’ demands in European cities are often similar, the percentage of budget can
vary widely from one place to another: Paris dedicates 25% of the investment budget to PB,
while smaller cities usually invest 2 to 5% of their resources.
24
2. Decide Madrid

• PB in Chile (https://www.pbatlas.net/chile.html)
• 37 local government initiatives + 1 regional government initiative
• Although PB initiatives in the country are born in 2002 due to political will of the mayors at
local level, since 2014 the region of Los Ríos started its own process:
- high valuation of citizen participation that exists in the region
- historical roots of the creation of the region that happened in 2007, preceded by a social
movement of more than 30 years that demanded to be a region
• The presentation of proposals is made mainly through social leaders
- the selection of the projects is carried out in neighborhood or territorial assemblies,
which mostly are formed by representatives of social organizations and institutions
• Regarding voting and prioritizing proposals, predominates the model the people’s
direct and universal vote
25
2. Decide Madrid

Contents
1. E-participation
2. Decide Madrid
7. Conclusions
26

E-participatory budgeting
• In addition to ad hoc PB digital applications
and platforms, there are several software
frameworks to build online PB platforms
• CONSUL, http://consulproject.org: tens of cities
in Spain, Italy, France and South America
• Stanford Participatory Budgeting,
http://pbstanford.org: major cities in the USA,
e.g., New York, Chicago, Seattle, Oakland and
Boston
• EU Open Budgets, http://openbudgets.eu/tools
27
2. Decide Madrid
title
location category
author description
supports comments
Proposal

E-participatory budgeting
• Motivations for data science applications
• Limitations of current ePB platforms of large
cities
- very limited search and filtering functionalities
- unable to facilitate the analysis of hundreds,
even thousands, of citizen proposals and
associated comments and discussions
• Creating a budgeting proposal, a citizen should
be aware of similar or related ideas or projects, so
she could better define the proposal or find the
opportunity to collaborate with others
28
2. Decide Madrid

Contents
1. E-participation
2. Decide Madrid
7. Conclusions
29

The ‘Decide Madrid’ platform
• A web system designed to allow Madrid
residents to make, discuss and vote
proposals for the city
• Used since September 2015
• With a 100M € budget in 2017
• Consisting of a 3-phase process
30
2. Decide Madrid

• ~6,000 citizen proposals per year
• Keyword-based search
• No use of (structured) metadata
• No data analysis
• No personalization
• No recommendation
31
2. Decide Madrid

• Available data for a proposal
• Title
• Author
• Date
• Summary
• Description
• Freely-chosen tags
• Number of user votes
• User comment threads
32
2. Decide Madrid

Why considering Decide Madrid as a representative case study?
• Participatory budgeting is one the citizen participation methods most used
worldwide:
• Represented in more than 400 cases from a total of 2,000 cases analyzed in Participedia
(https://participedia.net)
• Used in more than 3,000 cities and municipalities worldwide according to the Participatory
Budgeting Project (https://www.participatorybudgeting.org/white-paper)
• Decide Madrid is implemented upon CONSUL (https://consulproject.org), an open-source
framework to develop citizen participation platforms:
• Used in more than de 135 institutions of 35 countries
• With a structure similar to other popular frameworks, such as Stanford Participatory
Budgeting (https://pbstanford.org) and EU Open Budgets (http://openbudgets.eu/tools)
33
2. Decide Madrid

Contents
34
1. E-participation
2. Decide Madrid
• The data mining pipeline
• Data crawling
• Data scraping
• Data processing
7. Conclusions

The data mining pipeline
35
Data
Information
Knowledge
Understanding, experience, insights,
intuitions to use information
Pure and simple facts with no particular
organization
Understanding, experience, insights,
intuitions to use information
Processed, filtered, calculated, structured,
categorized, contextualized data

36
Unstructured data
Semi-structured data
Structured data
Simple and flexible structure, no strict format
Limited vocabulary, schema mixed with data values
E.g.: taxonomies (categories), folksonomies (tags)
Rigid structure, strict format
Well defined vocabularies and representation
E.g.: databases, ontologies
No structure
Non-restricted vocabulary, no predefined schema
E.g.: free text

• Open Government Data (OGD) promote transparency, accountability and
public value creation
• By making datasets publicly available, institutions become more transparent and
accountable to citizens
• By facilitating the use, reuse and free distribution of datasets, governments foster business
creation and innovative, citizen-centered digital applications and services
• OGD portals enable the general public to access the open data collections
• allowing the search of data files, but not the search of information within the files
37

• Open data portals are web sites to access sets
of OGD collections
• Search engine
- Retrieving collections via keyword-based queries
• Collection metadata
- Title, description, date, size, etc.
• Data files
- Formats: CSV, XLS, XML, RDF, etc.
- To be downloaded and opened with specific
applications, e.g., Microsoft Excel
• Documentation
- Inner structure of the data files
38
Example: open data portal of Madrid City Council

• Open data are commonly provided as tables:
• Rows = data records (instances, individuals)
• Columns = data attributes (features, fields)
39
Example: records of traffic accidents occurred in Madrid in 2020

Methodology
• Text processing on titles, tags,
descriptions and comments of citizen
proposals
• Semantic annotation of proposals:
topics and districts
• Computing discussion and
controversy metrics on the
comments of each proposal
• Exploiting open data as statistical
indicators about districts: economic,
sociocultural, ideology, employment,
education, health, housing, etc.
40

• 2 complex processes for:
• crawling and scrapping the ‘Decide Madrid’ web pages
• mapping tags to places and topics
• 22 districts & hundreds of places
• 30 topics
• urbanism, transport, environment,
health care, education, social rights,
education, culture, economy, job,
politics, security, housing, family,
old age, religion, animals, etc.
Assumption: a comment = a (positive, unary,) rating
41

Dataset
• Participatory budgeting of
4 editions: 2015-2018
• Around 29,000 proposals
• More than 86,000
comments
• 30 categories and 325
topics
• 21 districts + “city scope”
42

Contents
43
1. E-participation
2. Decide Madrid
• Data crawling
• Data scraping
• Data processing
7. Conclusions

Data crawling
• A (web) crawler is a computer program that browses the Web in a methodological
(with an orderly fashion), automated manner
• Applications
• Web search/indexing
• Vertical (specialized) search engines, e.g., news, shopping, recipes, reviews, papers
• Monitoring web sites and pages of interest
• Business intelligence: collecting information about company competitors and potential
collaborators
• Malicious applications: collecting personal information
44

Data crawling
• A crawler within a web search engine
45

Data crawling
• A crawler within a web application
46

Data crawling
• Generic web crawling process
• Seeds
- A list of starting URLs
• Visiting order
- Frontier = unvisited URLs
- Deciding which URLs should be discarded
to not fill up the frontier (lower priority)
• Stop criterion
- Empty frontier or maximum number
of pages crawled
47

Data crawling
• Best First
• The simplest topical crawler
• The frontier is a priority queue based on text (or keyword) similarity between topic and
parent page
48
bestFirst(topic, seed_urls) {
foreach link(seed_urls) {
queue(frontier, link);
}
while (frontier.size() > 0 and visited < MAX_PAGES) {
link = dequeueMax(frontier); // dequeue MAX similarity
page := fetch(link);
score := sim(topic, page);
foreach (extract_links(doc)) { // outlinks
enqueue(frontier, outlink, score);
}
}
}

Data crawling
49
<div class="proposal-content">
<h3><a href="/proposals/34239-luces-led-barrio-concepcion-y-san-pascual">
Luces LED Barrio Concepción y San Pascual </a></h3>
<p class="proposal-info">
<span class="icon-comments"></span> 
<a href="/proposals/34239-luces-led-barrio-concepcion-y-san-pascual#comments">
Sin comentarios</a>
<span class="bullet"> • </span>01/12/2022
<div class="proposal-content">
<h3><a href="/proposals/34239-luces-led-barrio-concepcion-y-san-pascual">
Luces LED Barrio Concepción y San Pascual </a></h3>
<p class="proposal-info">
<span class="icon-comments"></span> 
<a href="/proposals/34239-luces-led-barrio-concepcion-y-san-pascual#comments">
Sin comentarios</a>
<span class="bullet"> • </span>01/12/2022

Data crawling
50
public static void downloadProposalsURLs(String url, String file, int firstPage, int lastPage, boolean append) throws Exception {
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file, append), "UTF-8"));
for (int p = firstPage; p <= lastPage; p++) {
// Pick a random user agent
int indx = RAND.nextInt(USER_AGENTS.length);
String userAgent = USER_AGENTS[indx];
// Open the connection and read the web document
URI uri = new URI(url + p);
Connection connection = Jsoup.connect(uri.toASCIIString());
Document doc = connection.userAgent(userAgent).get();
// Read the proposals URLs from the web document -> identified by <a> links within <div class="proposal-content"> element
Elements linkList = doc.getElementsByClass("proposal-content");
Iterator<Element> it = linkList.iterator();
while (it.hasNext()) {
Element link = it.next();
String linkURL = link.getElementsByTag("a").get(0).attr("href");
writer.write(linkURL + "n");
}
}
writer.close();
}

Contents
51
1. E-participation
2. Decide Madrid
• Data crawling
• Data scraping
• Data processing
7. Conclusions

Data scraping
52
<img alt="Armando Cuesta" class="initialjs-avatar author-photo"
data-char-count="1" data-font-size="19" data-height="32"
data-name="Armando Cuesta"
data-radius="4" data-seed="460897"
data-text-color="#ffffff" data-width="32" src="data:image/…">
<img alt="Armando Cuesta" class="initialjs-avatar author-photo"
data-char-count="1" data-font-size="19" data-height="32"
data-name="Armando Cuesta"
data-radius="4" data-seed="460897"
data-text-color="#ffffff" data-width="32" src="data:image/…">

Data scraping
53
public static Proposal getProposal(String proposalFile, boolean isClosed) throws Exception {
Proposal proposal = new Proposal();
Document doc = Jsoup.parse(new File(proposalFile), "UTF-8");
// URL
Elements elems = doc.select("meta[property=og:url]");
String url = elems.attr("content").trim();
proposal.setUrl(url);
// Id
String id = url.substring(url.lastIndexOf("/") + 1);
id = id.substring(0, id.indexOf("-"));
proposal.setId(Integer.valueOf(id));
// Title
elems = doc.select("meta[property=og:title]");
String title = elems.attr("content").trim();
proposal.setTitle(title);
// Summary
String summary = doc.select("div.proposal-show").get(0).getElementsByTag("blockquote").text().trim();
if (summary.equals("Resumen de la propuesta")) {
summary = "";
}
proposal.setSummary(summary);
...

Contents
1. E-participation
2. Decide Madrid
• Data crawling
• Data scraping
• Data processing
7. Conclusions
54

Data processing
• Database tables
• Created by the crawler and scraper
- proposals (code, title, author, date, summary, description, supports,…)
- users
- proposal_tags
- proposal_comments (id, author, text, parent_comment, pos_votes, neg_votes, …)
• Created from proposal_tags
- proposal_categories  text processing + clustering
- proposal_topics  text processing + clustering
- proposal_districts  text processing
- proposal_locations  text processing + mapping to a street directory + geolocation
55

Data processing
• Graph building
• Its nodes are whole set of proposal tags
• Each of its (weighted) edges links
“related” a pair of tags, according to:
- Syntactic similarity
- Semantic similarity
- Cooccurrences within proposals
• Graph clustering method proposed by
Newman and Girvan (2004)
• It has a criterion to automatically set an
optimal number of clusters
• Each cluster represents a topic, which is
composed by a set of tags
56

Data processing
• 2-level taxonomy: 30 categories + 325 topics
57
Accesibilidad accesibilidad, accesibilidad metro, aparcamiento para discapacitados....
Animales adiestramiento canino, águilas, animales, animales de compañía, antitaurino....
Asociaciones asociaciones, asociaciones de vecinos, asociaciones juveniles....
Ayuntamiento y administración pública administracion, alcaldesa, atencion al ciudadano....
Civismo acoger, bioetica, bullying, cinismo, civico, civismo, colaboracion social....
Cultura arqueologia, arte, arte callejero, arte urbano, artesania, artistas, ....
Delincuencia anti corrupcion, atraco, carteristas, corrupcion, delincuencia, delitos....
Deportes actividad fisica, anillo ciclista, area de deportes, atletas, atleti, atletismo....
Derechos sociales abuso, acoso, albergue, altermundialismo, apoyo emocional, apoyo social....
Economía actividad económica, ahorro, bancos, bbva, comerciantes, comercio....
Educación acoso escolar, alumnos, bachillerato, bibiotecas, brecha cultural....
Empleo autoempleo, autónomos, comerciales, conciliacion laboral, contratacion municipal....
Equidad e integración chabolas, cie, derechos lgtbi, inmigración, desigualdad de genero...
Familia e infancia actividades infantiles, ayuda embarazo, bebes, carricoche....
Jóvenes acoso escolar, adolescencia, adolescentes, asociaciones juveniles....
Justicia constitucion, cumplimiento de las leyes, dictadura, fiscal, franquismo....
Medio ambiente acusticas, agroecologia, agua, aire, aire acondicionado, ajardinamiento....
Movilidad abono transportes, adif, agentes de movilidad, aparamiento regulado...
Ocio y entretenimiento baile, bares, celebraciones, centro comercial, cines, conciertos....
Participación ciudadana accion social, avisos madrid, decide madrid, decidemadrid...
Política, 15m, ahora madrid, ayuntamiento, ayuntamiento de madrid, democracia....
Religión españa laica, estado aconfesional, iglesia, islam, laicismo, religion...
Salud y sanidad acoholismo, acustica, acusticas, aire libre, aire puro, alcohol....
Seguridad y emergencias accidentes, app emergencias, aviso, avisos madrid, bomberos...
Sostenibilidad agroecologia, ahorro de energia, autogestion, ciudad amable, ....
Tercera edad abuelos, ancianos, centros de dia, desempleo mayores, jubilacion....
Transparencia anti corrupcion, datos abiertos, derecho a la informacion....
Turismo oferta turistica, puntos de informacion turistica, puntos de interes...
Urbanismo aceras, adoquinado, ajardinamiento, alumbrado, apariencia edificios....
Vivienda alquileres, alquiler vacacional, alquiler vivienda, derecho a un vivienda....

Contents
1. E-participation
2. Decide Madrid
• Discussion and controversy analysis
• Clustering and visualization
• Intent-based classification
7. Conclusions
58

Discussion and controversy analysis
• In the literature, there is a predominance of online tools implemented ad hoc to
facilitate citizen participation at scale and to reduce costs
• Aiming to analyze in depth how participation is performed in such tools, we conduct
a study about a particular tool
• The chosen tool is Decide Madrid (https://decide.madrid.es), the participatory budgeting
e-platform of Madrid City Council since 2015
• The study makes use of diverse data:
• Topics, districts and support levels of citizen proposals
• Controversy level of comment threads originated over the proposals
• Indicators about economic, sociocultural and ideological aspects of the districts
59

Motivation
• Lack of government institutions comprehension about content generated by
citizens in electronic tools
• Possibility that institutions fail to meet the citizens’ demands
- Meeting certain relevant demands may be missed, not because they are unfeasible, but
because of their controversial nature
• Decreased quality of decision making
• Loss of confidence on the part of the citizenry
60

Decide Madrid
• Operational since 2015
• With more than 6,000 citizen
proposals a year
• With more than 400,000
registered users in 2019
• With a structure of
discussion threads
(comments) for each citizen
proposal
61
Ejemplo de propuesta ciudadana en Decide Madrid.
title
author, date
description
tags
votes
comments

Controversy metrics
• To measure the controversy of a citizen proposal, we consider the aggregation of 3 metrics applied to
discussion threads (comments)
62
Controversy based on the content
(lenght) of dicussions
Controversy vased on the opinión
polarization (of votes)
Controversy based on the estructure of
the conversations

Controversy metrics
• Discussion content-based controversy
• The length of the proposal’s discussion, measured as the sum of the length of its comments
• Opinion polarization-based controversy
• A weighted ratio measuring the difference of positive and negative votes for the proposal’s comments
• Conversation structure-based controversy
• An adaptation of the H-index for measuring discussion diversification
63

Some results of the study (I)
• The controversy values follow a heavy tail distribution, in which the majority of the proposals have
low controversy
• The proposals highly supported are not necessarily the most controversial
64
“In Decide Madrid, proposals with a low level of support are currently discarded and archived, regardless of the level of
discussion and controversy they have. However, from a decision-making perspective, it would be interesting to delve deeper
into the controversial proposals and understand the problems of the city and the citizens they are affected by”.
“In Decide Madrid, proposals with a low level of support are currently discarded and archived, regardless of the level of
discussion and controversy they have. However, from a decision-making perspective, it would be interesting to delve deeper
into the controversial proposals and understand the problems of the city and the citizens they are affected by”.

Some results of the study (II)
• Most controversial and supported topics
• Religion: inclusion of LGTBI+ groups in Cabalgata de Reyes,
public funding and tax benefits for Catholic institutions
• Housing: creation of social housing, annual property taxes
• Culture: prohibition of bullfighting
• Topics having low-moderate number of proposals with
low level of support and high controversy
• Governance: transparency, citizen participation, public
administration, laws and legislation
• Rights and social movements: social rights, civility, equity,
migration, integration, crime, NIMBY
65
“In Decide Madrid, citizens’ ideological differences play
an important role in the group of controversial categories”.
“In Decide Madrid, citizens’ ideological differences play
an important role in the group of controversial categories”.
“In Decide Madrid, political and social issues reach a
low-moderate relevance (final attention)”.
“In Decide Madrid, political and social issues reach a
low-moderate relevance (final attention)”.

Some results of the study (II)
• Topics having a large number of proposals with a high level
of support and controversy
• Domestic animals, mainly dogs (e.g., cleaning and fines for excrements on
public roads, creation of "pipicans", compulsory leash, etc.)
• Topics having low-moderate number of proposals with
low-moderate level of support and controversy
• Education, health, family, childhood, old age, employment, accessibility,
youth.
66
“In Decide Madrid, proposals aimed at some vulnerable groups
(for example, people with disabilities, the elderly, unemployed)
tend to generate less citizen participation”.
“In Decide Madrid, proposals aimed at some vulnerable groups
(for example, people with disabilities, the elderly, unemployed)
tend to generate less citizen participation”.

Some results of the study (III)
• Study of factors external to participation. Calculation of the correlation between levels of
support/controversy and district “statistical indicators” published as open data
67

Some results of the study (III)
• Study of factors external to participation. Calculation of the correlation between levels of
support/controversy and district “statistical indicators” published as open data
• The districts in which the greatest number of proposals are generated are those with:
• A high number of groups, neighborhood associations, and consumer organizations
• A more progressive position, that is, in which the majority voted for PSOE and Unidas Podemos
• A greater environmental commitment, that is, with more ecological associations
• The districts in which the most controversial proposals are generated are those with:
• A higher percentage of young people
• A greater number of citizens belonging to vulnerable groups, such as the elderly, young people and people
with some type of disability
• A higher birth rate and number of associations related to childhood
68

Limitations of the study
• The discussion of the votes (for or against) given to the comments has been considered
• The “polarity” (positive or negative) of the comments themselves should be analyzed. To do this, natural
language processing techniques would have to be applied
• Decide Madrid, which is a tool restricted and adjusted to a specific participation procedure,
has been analyzed
• More open tools such as online social networks (e.g., Twitter) should be considered
• Proposals and discussions motivated by political and ideological cleavages that traditionally
confront Spanish society have been observed (ideological positioning on the left-right scale,
religious versus secular values, traditional versus progressive, etc.)
• Tools from other countries should be analyzed to obtain more generalizable conclusions
• Possible biases (e.g., digital divide, political program) that could exist in users who use Decide
Madrid, and similar tools, have been omitted
69

Contents
1. E-participation
2. Decide Madrid
7. Conclusions
70

Clustering and visualization
• Citizen collaboration through current digital participation platforms can entail the
generation of large amounts of complex content, which may hide relevant citizens’
concerns, requests and initiatives, diluted in isolated individual proposals
• We present an interactive data mining tool for citizen participation data
visualization and analysis
• Applying natural language processing, text similarity, and graph clustering techniques
• Grouping proposals with common objectives
• Identifying trends and recurrent topics of interest
• Filtering and presenting information according to several criteria
• The tool is flexible, able to process different sources of data, and lightweight as it
uses simple data structures and dynamic HTML-based visualization and interaction
71

• The tools is built upon the
Tableau data visualization
software
https://www.tableau.com/resource/
data-visualization
• Lightweight
• Easy to configure
• Several visualization
functionalities
- Diagram bars
- Heat maps
- Time series graphs
72

• Distribution of proposals,
categories and topics,
according to:
• Time (year, month) and
location (district)
• Support, discussion and
controversy levels
• Diverse temporal and
geographical analysis
• Better and easier extraction of
patterns and insights when
analyzing the published citizen
generated content
73

• Text processing
• Mistake correction
- Dictionary
- Levenshtein distance
• Special characters removal
• Stopwords removal
• Word lemmatization
• Document similarity
• Word Mover’s Distance
(WMD) similarity, which
treats text documents as
weighted point clouds of
word embeddings
74

• Document clustering
• Weighted graph
- Nodes: citizen proposal
documents
- Edges: document
similarity values
- Removal of edges with
“low” weights
• Louvain clustering
method
- Optimizes the
modularity of the graph,
associating nodes to
clusters until
convergence
75

• A coproduction functionality
based on the retrieval of
existing similar proposals
• A citizen who is interested in
submitting a new proposal can
first bring it into the tool, and
check if there are related ones
76

Contents
1. E-participation
2. Decide Madrid
7. Conclusions
77

Intent-based classification
• Social networks represent a prominent bidirectional communication
channel
between citizens and government
• Citizens are…
- content consumers who receive the government announcements, to which they
react and freely respond according to personal ideology, interests and needs, and
- content providers who generate a wide range of messages targeted to government
and political stakeholders
• The amount of social media content daily generated by citizens is huge and
diverse, and its processing by human actors may result too costly and
overwhelming
78

• There is an increasing interest and need to use computer-assisted
solutions capable of automatically gathering, processing and analyzing
the underlying information in the citizens’ messages (a.k.a. posts) on social
networks
• The research literature reports extensive work on:
• analyzing social phenomena produced through the online network structures
(e.g., information spreading, fake news, and opinion polarity), and mainly originated
by particular events (e.g., natural disasters, elections, and trending news)
• extracting the most popular topics addressed by citizens’ posts in social networks,
as well as the general dynamics (i.e., temporal evolution) and opinions on such topics
79

• Differently to previous work, we go beyond the extraction of topics by
attempting to automatically classify citizens’ posts (tweets) according
to their intents or purposes
1. Complaint: stating something that is unsatisfactory or unacceptable
- “@MADRID after 1 week of calling, the city is yet not clean, and the rats are taking over!!
http://t.co/IiIDuaPFG9”
2. Announcement: making a public statement about a fact, occurrence or event
- “The date, place and schedule of the Festival activities in La Latina have already been
confirmed http://t.co/U0tRwKAC @madrid @madridiario”
3. News item: objectively informing about current events
- “#oladecalor #aemet @Madrid has suffered its warmest night within the latest 100 years
http://t.co/ZSjeqK6m”
80

4. Personal fact: publicizing self issues and experiences
- “I also support the candidature from @Madrid2020ES @MADRID #aporella”
5. Opinion: expressing subjective opinions about the city, its events, activities, etc.
- “The activity of #emprendeenmadrid is amazing. Congratulations @MADRID and greetings
from an entrepreneur”
6. Request: explicitly asking for something specific
- “Very nice but impossible to ride a bike at normal speed #MadridRio. Please @MADRID
create a bike lane with cyclist priority”
81

7. Notification: reporting or giving notice of urban, citizenship- or government-related
issues, so that government can quickly act on them and help other citizens
- “@MADRID can you fix this gap in San Bernardino street 8-10 before someone gets hurt?
http://lockerz.com/s/117566458”
8. Question: explicitly asking for information
- “@MADRID could you please give me the telephone number of the press office of the
Madrid city hall”
9. Proposal: suggesting an initiative or project
- “There is a collection of used oil in the center of Alicante. It would be fantastic to have
something similar @MADRID”
82

• To automatically categorize a tweet into one of the previous intents
(labels), it is first transformed into a vector of features
• We consider 37 domain- and language-independent features to
describe the content of a tweet
83
Lexical features Grammatical features Social network-based features
• number of characters
• number of words
• number of exclamation marks
• number of question marks
• existence of a positive emoticon
• existence of a negative emoticon
• existence of a vowel (or “y”)
consecutively repeated 3 or more
times in a word
• number of nouns
• number of proper nouns
• number of adjectives
• number of verbs
• number of adverbs
• number of personal/possessive
pronouns
• number of time references
(entities)
• number of money-related
references
• number of followers
• number of friends
(a.k.a. followees)
• number of posts
• number of active days in Twitter
• number of hashtags (#)
• number of user mentions (@)
• number of hyperlinks
• number of multimedia
• maximum hashtag length
• existence of an explicit retweet
request (i.e., "RT" abbreviation)

• To validate the proposed approach, we evaluated several machine learning
algorithms on a labeled dataset:
• K-Nearest Neighbors (KNN)
• Logistic Regression (LR)
• Quadratic Discriminant Analysis (QDA)
• Decision Tree (DT)
- executed alone, and in combination with
feature selection (RFECV DT) and
tree pruning (AP DT)
to avoid learning over-fitting
• Gaussian Process (GP)
• Support Vector Machine (SVM)
• Bagging Ensemble (BE)
84

• Dataset: a random sample of 666 tweets mentioning @Madrid account, each of
them manually labeled by 3 researchers (almost perfect agreement: Fleiss' kappa = 0.98)
• 9 binary classification problems: one-against-all (i.e., training a single classifier
per label)
• Classification metrics
• acc (accuracy)
• acc+ (minority class acc)
• acc– (majority class acc)
•
85

per label)
• acc (accuracy)
•
86
(very) unbalanced classification problems

per label)
• acc (accuracy)
•
87
(misleading) high classification accuracies

per label)
• acc (accuracy)
•
88
reasonably good accuracy balance for the two labels

• Most discriminating words and features for each of the considered intents
89
COM = complaint
ANN = announcement
REQ = request
NEW = news item
FAC = personal fact
OPI = personal opinion

• The proposed intent-based classification represents a task prior to the
extraction of topics and opinions, and may help filtering and prioritizing citizens’
messages, and further automatizing processes for more efficient and effective
decision and policy making
• There is room for improvement:
• More sophisticated NLP techniques, such as language models and word embeddings,
could be used to exploit the semantics of words and word sequences
- e.g., “opinion is” and “really think that” could be identified as informative bigram and
trigram of the personal opinion intent
• Features from other sources of information, such as the user who creates a post and the
user(s) who are mentioned in a post
- e.g., by considering their types: citizens, neighborhood associations, organizations, or
political actors
90

Contents
1. E-participation
2. Decide Madrid
• Argument mining in a nutshell
• Argument-based document search
• Argument-based conversational information access
• Neural network-based argument extraction
7. Conclusions
91

Argument mining in a nutshell
• Tasks
• Detection of argument text fragments
• Identification of argument components
• Extraction of argument relations
• Algorithmic foundations
• Natural Language Processing (NLP)
• Machine/deep learning
• Linguistic features
• Sentence-level (e.g., sentence length, argument linkers, etc.),
grammatical (e.g., number of nouns, adjectives, modal verbs, etc.), syntactic (e.g., patterns,
constituency tree depth, etc.), semantic (e.g., named entities, word embeddings, etc.)
92
Source: ACL’16 tutorial “NLP Approaches to Computational Argumentation”

• Tasks
1. Detection of arguments
2. Identification of argument components and structures
3. Extraction of argument relations
93
Source: ACL’16 tutorial
“NLP Approaches to Computational Argumentation”

• Tasks
94

• Tasks
95

• Tasks
96

• Tasks
97

• Example: Categorization of argumentative components via machine learning
• Classes
- “Major claim”, “Claim”, “Premise”
• Features
- Lexical lemmatized unigrams including previous tokens
- Syntactic number of nested phrases, depth of the syntactic tree, POS distribution,
tense of the principal verb, modal verbs
- Structural first or last sentence of a paragraph, present in introduction or conclusion,
relative position, number of tokens, etc.
- Indicators connectors: “because”, “however”, “as a result”, etc.
- Contextual contextualized connectors, number of words shared by introduction and conclusion
- Probabilistic conditional probability P(category | previous tokens)
- Discourse discourse relation based on Penn Discourse Treebank
- Embeddings vectors with 300 dimensions trained with Google News Corpus
98

• Example: Categorization of argumentative components via machine learning
• Using all features results on the best F1 values
• The classification of claims is the most difficult task
• The structural features are the most valuable
• The discourse features are informative for the identification of claims
• The word embeddings achieve results similar to lexical features
99

• Corpus
• AIFdb: repository of databases, following the
Argument Interchange Format, AIF
- AracuriaDB: news editorials, parliamentary records,
court summaries and panel discussions
- MM2012: transcriptions of BBC Radio 4
- …
• The Internet Argument Corpus, IAC: set of political
debates in internet forums
• The ECHR Corpus: collection of documents extracted
from legal texts of the European Court of Human Rights
• The Argument Annotated Essays Corpus, AAEC:
collection of persuasive essays
• …
100

• Tools
• Collaborative editors of argumentative graphs
- Agora, http://agora.gatech.edu
- Argunet, http://www.argunet.org
- DebateGraph, http://debategraph.org
- Rationale Online, https://www.rationaleonline.com
• Argumentative annotation platforms
- Araucaria, http://araucaria.arg.tech
- OVA, http://ova.arg-tech.org
101

• Events
• International Conference on Computational Models of Argument (COMMA),
https://comma2020.dmi.unipg.it
• Workshop on Argument Mining (ArgMining), https://2021.argmining.org
• Workshop on Computational Models of Natural Argument (CMNA),
http://cmna.csc.liv.ac.uk/CMNA20
• Summer School on Argumentation (SSA), https://ssa2020.dmi.unipg.it
• ACL’19 tutorial “Advances in Argument Mining”, http://arg.tech/~chris/acl2019tut/index.html
• ACL’16 tutorial “NLP Approaches to Computational Argumentation”, http://acl2016tutorial.arg.tech
• Online Seminars on Computational Models of Argument,
https://sites.google.com/view/argumentation-seminar
• Dagstuhl’16 seminar “Natural Language Argumentation: Mining, Processing, and Reasoning over
Textual Arguments”, https://www.dagstuhl.de/16161
• BiCi’14 seminar “Frontiers and Connections between Argumentation Theory and Natural
Language Processing”, http://www-sop.inria.fr/members/Serena.Villata/BiCi2014/frontiersARG-
NLP.html
102

Contents
1. E-participation
2. Decide Madrid
7. Conclusions
103

Argument-based document search
• Proposed framework
104

• Argument model
• Premise → Claim → Major claim
• Types and subtypes of argument relations
• Cause: linking an argument that reflects the reason or condition for another argument
• Clarification: introducing a conclusion, exemplification, restatement or summary of an argument
• Consequence: evidencing an explanation, goal or result of a previous argument
• Contrast: attacking arguments, distinguishing between giving alternatives, doing comparisons,
making concessions, and providing oppositions
• Elaboration: introducing an argument that provides details about another one, entailing addition,
precision or similarity issues about the target argument
• Argument mining methods
• Syntactic pattern matching
• Feature-based machine learning classification
• Embedding-based deep neural network
105

• Heuristic algorithm
• For each sentence of an
input text: looking for certain
syntactic patterns that
introduce argumentative
expressions
• 1,744 arguments extracted
from 5,633 comments
• Contrast: 54.1%
• Consequence: 12.1%
• Cause: 3.6%
• Elaboration: 0.1%
106

• Argument linkers
107

• Information retrieval
• Text processing
• NLP for linguistic feature extraction
• Indexing based on keywords, topics, categories, entities and other metadata
• Search engine based on the vector space model
• Argument-based reranking according to controversy metrics
108

• Outcomes – arguments
• JSON object created for an argument that evidences a contrast premise on a proposal in
favor of using Madrid public transport with pets
109

• Outcomes – documents, topics and arguments
110

• Dataset
• 80 proposals (covering 10 categories and having high controversy) and 5,633 comments
• Experiment setting
• 3 evaluators
• 3 queries
• Topical relevance – accuracy of an argument with respect to the major claim of the
discussion
• 14.6% of the arguments were labeled as very relevant
• 39.9% as relevant
• 36.9% as not relevant
• 8.6% as incorrect
• Rhetoric quality – effectiveness of an argument in persuading an audience
• 17.1% of the arguments were of high quality
• 40.6% of sufficient quality
• 42.3% of low quality
111

• We have presented a general and flexible argument-based search framework
• Preliminary implementation and evaluation on a dataset with citizen proposals and discussions generated in an
online participatory platform
• Its current implementation includes:
• Various argument extraction methods (heuristic patter matching, feature-based machine learning, embedding-based deep learning)
• A document retrieval engine built upon vector space-based models
• A reranking strategy that exploits certain controversy metrics
• We envision several open research lines:
• Development of ad hoc argument-based document retrieval methods (so far, we have used a reranking technique)
• Consideration of alternative controversy notions
• Increment of the size and quality of the generated corpus
• Evaluation on other datasets and domains
• Measurement of additional argument quality metrics, e.g., based on diversity, fairness, persuasiveness, etc.
112

Contents
1. E-participation
2. Decide Madrid
7. Conclusions
113

Argument-based conversational information access
• E-participation –understood as the
computer-assisted support to citizen
participation– has originated novel
consultation and deliberation processes
• Most current e-participation platforms are
based on web forums
• Citizens make proposals and provide comments
and opinions, forming
large conversation threads
• Recent attention has shifted to social media,
especially social networks
(e.g., Facebook and Twitter) and
instant messaging tools
(e.g., Telegram and WhatsApp)
114

• Conventional web forums promote social interaction
• Pros
- Easy and fast content generation (through free text
posts)
- Smooth, large-scale interaction (via comment threads)
• Cons
- No or very limited functionalities for content
organization, filtering and analysis
- Dispersed and redundant content, since it is structured
by time
- Challenging processing of discussions
• Argument-driven tools promote the production and
reuse of collective knowledge
115

• Our work on e-participation…
• addresses 2 promising research lines
- The exploitation of argument mining techniques to automatically
extract and present argumentative information from
citizen-generated content
- The use of conversational agents or chatbots as citizen-to-government
communication channels in instant messaging applications
• targets a final goal
- Helping on finding out and understanding city problems and
citizens’ concerns, and consequently on getting well-formed opinions
for making better decisions in participatory processes
116

• The ‘Decide Madrid’ e-participation platform
• A web system designed to allow Madrid residents to
make, debate and vote proposals for the city
• Available data from a citizen proposal
• Title
• Author, date
• Summary, description
• Freely-chosen tags
• User comment threads
• Heterogeneous topics and discussions
• urbanism, transport, environment, health care,
education, social rights, education, culture, economy,
job, politics, security, housing, family, old age,
religion, animals, etc.
117
https://decide.madrid.es

• Argument model
• Premise → Claim → Major claim
• Types and subtypes of argument relations
• Cause: linking an argument that reflects the reason or condition for another argument
• Clarification: introducing a conclusion, exemplification, restatement or summary of an argument
• Consequence: evidencing an explanation, goal or result of a previous argument
• Contrast: attacking arguments, distinguishing between giving alternatives,
doing comparisons, making concessions, and providing oppositions
• Elaboration: introducing an argument that provides details about another one,
entailing addition, precision or similarity issues about the target argument
118

• Example of an extracted argument tree
119
C = claim
L = linker
P = premise

• Through a natural language
conversation with the chatbot,
the user can:
1. explore citizen proposals and
comments, organized by
categories, topics and districts
2. access to categorized citizens’
arguments given
in the debates around a
proposal
3. provide feedback and
votes for proposals
120

• The chatbot is built upon the Google DialogFlow framework, which links external web services
with a variety of instant messaging and social networking services, e.g., Google Assistant,
Facebook Messenger, WhatsApp, Telegram and Skype
121

• The chatbot handles several conversation intents, each of them with triggering sentence
patterns and associated functionalities
122

User study: empirical evaluation of the chatbot in terms of:
1. The feasibility of exploring e-participation content via a conversational interface
2. The potential benefits of argument-driven information in e-participation
• Uncontrolled, realistic scenario
• Without external supervision, participants freely tested the chatbot via Telegram during a period of one
week, using their own Telegram accounts and mobile devices
• 32 participants → 2 groups
• Control group: having disabled the chatbot’s argument-driven browsing functionalities
• Experimental group: having enabled the chatbot’s argument-driven browsing functionalities
123

Study questionnaire
• 33 items
• 10 evaluation criteria
• Decision making
• Public values
124

32 participants
• Gender: 22 male, 10 female
• Ages: 18-29 years old (12), 30-39 years old (9), 40-49 years old (5), 50-59 years old (4), more
than 59 years old (2)
• Education levels: secondary education (3), vocational education (1), Bachelor’s degree (20),
Master’s degree (6), Doctoral degree (2)
• Those with Higher Education levels had studied Sciences (3), Social Sciences (10),
Arts and Humanities (4), and Engineering (11) careers
• Diverse levels of knowledge/expertise on chatbots –null knowledge and expertise (5),
null expertise (5), low expertise (20), medium expertise (2)
• Diverse levels of knowledge on citizen participation –null (7), low (16), medium (9)
125

• Objective metrics
• Subjective questionnaires
126

• More user activity
• No significant difference on the avg. number of sessions per user (between groups)
• Longer sessions in the experimental group
- Increase of 45.6% on the avg. session duration (from 16.0 to 23.3 minutes)
- Increase of 14.3% (from 56.8 to 64.9) on the avg. number of actions per user
• Higher user engagement and persuasiveness
• Increase of 23.5% (from 1.7 to 2.1) on the avg. number of feedback actions per user
• Meaningful exploration of arguments (avg. 7.4 actions per user)
• Better user opinions
• About the chatbot: highly efficient, quite effective, moderately easy to use
• About the argumentative information: higher perception of transparency and fairness
127

• Participants’ suggestions
• A more “natural” conversation with the chatbot
• A more fluent transition between browsed proposals
• Facilities to read proposals with large descriptions
• Future research directions
• Personalized recommendation mechanisms to proactively present relevant content to the user, thus
mitigating the information overload problem
• Richer data structures, analysis and visualizations for facilitating decision making
• Functionalities oriented to citizen collaboration
• Integration of external data sources, such as open government data and news items
128

Contents
1. E-participation
2. Decide Madrid
7. Conclusions
129

Neural network-based argument extraction
• Argument retrieval aims at automatically extracting structured argumentative
information existing in a text corpus
• It has been commonly modeled as a pipeline of three tasks, namely argument
segmentation, argument component classification, and argument relation recognition
• We investigate the application of transformer-based deep learning to jointly
address the above tasks as a single end-to-end sequence tagging problem
130

Deep neural network architecture
• 1st block: BETO Language model
• A BERT-based model trained on a corpus in Spanish with Wikipedia articles, legal texts,
and TED Talks transcript
- 12 encoders with a hidden layer size of 768 units, and 12 self-attention heads
• 2nd block: generic layers of feed-forward neural networks
• 3rd block: task-specific layers that address the following argument mining tasks
• Identification of argumentative units (BIO tagging task)
• Classification of argumentative components: premise, claim, major claim, empty
• Recognition of argumentative relations: 17 subtypes of the 2-level taxonomy
• Classification of argumentative relation intents: support, attack, empty
131

• Input
• Annotated sentences from citizen comments
• Deep neural network configuration
132

• ARGAEL: ARGument Annotation and Evaluation tooL
• Simple annotation view: the user identifies argument components and relations (and their types)
133

• Assisted annotation view: the user has access to others’ argument annotations
134

• Evaluation view: the user evaluates others’ argument annotations
135

• Argument component (AC) annotations and evaluations
• Argument relation (AR) annotations and evaluations
136

• Some results of the argument annotation process on the Decide Madrid dataset
137

• Some preliminary results
• Argument identification
• Argument component classification
138

• Some preliminary results
• Relation type classification
• Relation intent classification
139

Contents
1. E-participation
2. Decide Madrid
• Recommender systems in a nutshell
• Personalized recommendations
• Context-aware recommendations
7. Conclusions
140
Disclaimer: some of the materials of this subsection have been created by
Prof. Pablo Castells for his information retrieval master course at EPS-UAM.
Disclaimer: some of the materials of this subsection have been created by
Prof. Pablo Castells for his information retrieval master course at EPS-UAM.

Recommender systems in a nutshell
141
Is it possible to help the user to find
information without asking for it?
How to customize the process?

• Personalized recommendations
142

143

144

• Many ways to make recommendations
• Spotify: https://www.music-tomorrow.com/blog/how-spotify-recommendation-system-works-a-
complete-guide-2022
• Instagram: https://ai.facebook.com/blog/powered-by-ai-instagrams-explore-recommender-system
• Netflix: https://research.netflix.com/research-area/recommendations
https://scale.com/blog/Netflix-Recommendation-Personalization-TransformX-Scale-AI-Insights
• Google Play: https://deepmind.com/blog/article/Advanced-machine-learning-helps-Play-Store-users-
discover-personalised-apps
145

• It is estimated that the recommendations produce…
• 20% of sales on Amazon
• 60% of streaming on YouTube
• 80% of streaming on Netflix
• ∼10% of electronic commerce
• Recommendation has a large market to tap into
• It seems possible to target beyond ∼10% of engagement
• Many companies aim to exploit such potential
146

• Situations with option overload
• 1994 → 0.5 millions of different products on sale in the USA
• 2010 → 24 millions of products only in Amazon
• Recommendation = Personalized IR without explicit query
• First initiatives published in 1992 (Tapestry at Xerox Parc)
• Precedents: user models based on stereotypes (late 70s)
• Conferences: RecSys, SIGIR, ECIR, UMAP
• Confluence with other areas: Machine Learning (ICML, ECML, IJML, etc.), Data Mining
(KDD, etc.), Artificial Intelligence (IJCAI, AAAI), Human Computer Interaction (IUI)
147

• Non-personalized recommendations
148

• Contextualized recommendations
149

• Utility of recommender systems
150
Jannach, D. and Adomavicius, G. 2016. Recommendations with a purpose. In Proceedings of the 10th
ACM Conference in Recommender Systems (RecSys ’16), pp. 7-10.

• User preferences
151
Ratings
Reviews
Categorical
Thumbs up / down

• Personalized recommendations: problem formulation
152

• Problem formulation
• Input
- A set U of users
- A set I of items
- A sorted set R of values, e.g., R = { 1, 2, 3, 4, 5 }
- A functional relation. r : U x I → R
- Typically, r(u,i) is a “rating”, and represents the user u’s assessment for item I at scale R
- This input can be seen as a matrix of ratings
- Most of its values (95% and more in general) are unknown
• Goal
- Predicting the values r(u,x) of items x for a user u who has not evaluated such items
- The unknown values r(u,x) are considered for recommending x to u
- In general, generating a sorted list of items that can be of interest for the user
- This goal is commonly referred as generating the “top n” recommendations
153

• Problem formulation
• Implicit user feedback (preferences)
- No need for asking the user
- r : U x I→ {0, 1} binary, e.g., “u buys i”
- It can be treated as a particular case R = {0, 1}
- r : U x I → R measuring the frequency of accessing item by user u, e.g., listening music
- Binarized to 1 if frequencies > 0
- Applying a conversion function frequency → rating (e.g., percentiles)
- r : U x I → P(T) for users u annotating (tagging) items x, where T is a set of tags
- It can be treated as “1 tag 1 vote”, but more elaborated and complex techniques can be
performed on graphs of tags, items, users…
- Timestamps
- Frequency data: r(u,i) is a set of timestamps
- Rating data: r(u,i) is a [rating, timestamp] pair
154

• Types of recommendation strategies
• Content-based filtering (CB)
- Item features are considered: words (text case), descriptors (metadata), etc.
- Items are compared with user information collected in a preference profile
- A user profile is long-term; it can be acquired through decision trees, neural networks, etc.
• Collaborative filtering (CF)
- Items are opaque
- The profiles of other users with similar traits (tastes, behavior patterns, demographic data,
etc.) are used to recommend items
• Hybrid filtering: combining different recommendation strategies
- Combining the output of CB and CF
- Inserting CB elements into CF or vice versa
- Unified models
155

• Content-based filtering
156

• Content-based filtering
• Each user is recommended without looking at others
• A feature space for the items is needed → items are represented as vectors in such space
- “Data” that describe the items, structured or unstructured, e.g., item metadata (author,
place, language, categories, tags), words in the text associated with items, etc.
- Binary, integer or real values
• A similarity function on the feature space, e.g.,
- Cosine similarity for numerical features
- Jaccard similarity for binary features
• Two very common methods: kNN- and centroid-based
- but many others based on classification can be used
(where users essentially play the role of class)
157

• Content-based filtering: kNN-based
• Adaptation of the kNN classification algorithm
- In classification, 𝑟(𝑢,i) would be binary
- Ranking of “instances” (items) for each “class” (user), rather than the opposite
158

Data science in practice: Case studies in e-participation

Recommended

Recommended

More Related Content

Similar to Data science in practice: Case studies in e-participation

Similar to Data science in practice: Case studies in e-participation (20)

Recently uploaded

Recently uploaded (20)

Data science in practice: Case studies in e-participation