Crowdsourcing Research Opportunities: Lessons from Natural Language Processing

•Download as PPT, PDF•

2 likes•1,399 views

How is crowdsourcing used in science? How did it impact the field of NLP? A presentation of the key points described in: Marta Sabou, Kalina Bontcheva, Arno Scharl (2012) Crowdsourcing Research Opportunities: Lessons from Natural Language Processing. In 12th International Conference on Knowledge Management and Knowledge Technologies (i-KNOW), Special Track on Research 2.0.

Sports Technology Education

Crowdsourcing Research Opportunities:
Lessons from Natural Language Processing
Marta Sabou, Kalina Bontcheva, Arno Scharl

Crowdsourcing

Undefined and generally large group

Crowdsourcing in Science
Crowdsourcing for NLP
Challenges

Crowdsourcing in science – is not new

Sir Francis Galton, “VOX POPULI”

Citizen science, from early 19th century, 60,000 – 80,000 yearly volunteers

Genre 1: Mechanised Labour
 Participants (workers) paid a small amount
of money to complete easy tasks (HIT =
Human Intelligence Task)

Genre 2: Games with a purpose
From 2008
240k players

Genre 3: Altruistic Crowdsourcing

>250K players

>670K players

Crowdsourcing in Science - Typical Use
•Harness human
intuition to prune
solution space

Process/ Evaluation
Input Algorithm
Output

•Form based data collection
•Labeling, Classification
•Surveys

Crowdsourcing in NLP
Papers relying on crowdsourcing in major NLP venues

Benefit 1: Affordable, Large-Scale Resources
 A variety of small-medium sized resources can be
obtained with as little as 100$ using AMT
 Crowdsourcing is also cost effective for large
resources (Poesio, 2012)

$/label 1 M labels ($)
Traditional High Q. 1 1,000,000
Mechanical Turk .38 380,000 (<40%)
Game .19 217,000 (20%)

Challenge 1: Contributor Selection and Training
 From: prior to resource creation
 To: during the resource creation

Challenge 2: Aggregation and Quality Control

 From: a few experts‘ annotations
 To: multiple, noisy annotations from non-experts
 Approach 1: Statistical techniques
 Simplest (and most popular): majority voting
 More complex: Machine learning model trained on
various features
 Approach 2: Crowdsourcing the QC process itself
HIT1 (Create): HIT2 (Verify):
Which of these 5 sentences is the
Translate the following sentence: best translation?

Conclusions (What have we learned from NLP?)

 Crowdsourcing is revolutionalising NLP
research
 Cheaper resource acquisition
 Diversification of research agenda
 But requires more complex methodologies
 For contributor management
 For quality control and data aggregation
 Other findings: most popular
 Genre: mechanised labour
 Task: acquiring input data
 Problem: solving subjective tasks

User Motivation

 Motivating users
 Motivations for scientific projects might differ

 Task-granularity might impact motivation
 Promoting learning and science
 Advertise STEM research to young people
 Support learning and self-improvement through
participation in crowdsourcing

Legal and Ethical Issues
 Acknowledging the Crowd‘s contribution
 S. Cooper, [other auhors], and Foldit players: Predicting
protein structures with a multiplayer online game.
Nature, 466(7307):756-760, 2010.
 Ensuring privacy and wellbeing
 Mechnised labour criticesed for low wages (,$2/hour),
lack of worker rights
 Prevent addition, prolonged-use & user exploitation
 Licensing and consent
 Some clearly state the use of Creative Common licenses
 General failure to provide informed consent information

Technical Issues
 Scaling up to large resources
 Preventing bias
 Increasing repeatability
 Through reuse of crowdsourcing elements (e.g., HIT
templates)
 uComp - Embedded Human Computation for
Knowledge Extraction and Evaluation
 3 year project, starting November 2012
 Develops a scalable and generic HC framework for
knowledge creation
 Provides reusable HC elements

Similar to Crowdsourcing Research Opportunities: Lessons from Natural Language Processing

Leaning Lab il Living Lab di PisaDaniele Mazzei

Establishing an Online Access Panel for Interactive Information Retrieval Res...GESIS

How to facilitate crowd participation - presentation in ISPIM 2013Miia Kosonen

Computational Social Science:The Collaborative Futures of Big Data, Computer ...Academia Sinica

Crowdsourcing & Human Computation Labeling Data & Building Hybrid SystemsMatthew Lease

Investigating Crowdsourcing as an Evaluation Method for (TEL) Recommender Sy...Christoph Rensing

Rise of Crowd Computing (December 2012)Matthew Lease

Social machines: theory design and incentivesElena Simperl

Research to Innovationkhargonekar

TRANSFORMING BIG DATA INTO SMART DATA: Deriving Value via Harnessing Volume, ...Amit Sheth

David Rejeski: The Synthetic Biology Startup Ecosystem in the USKansallinen ennakointiverkosto (KEV)

Crowdsourcing - an overviewMirko Presser

Crowd Computing: Opportunities & Challenges (IJCNLP 2011 Keynote)Matthew Lease

Information entanglementWillard Van De Bogart

Technology in the Wild: Dynamics and Uncertainty in Field Experiments, VietnamBenCorrigan

SSSW 2016 Cognition TutorialIrene Celino

Crowdsourcing: A SurveyIJERA Editor

Overview of Data Science and AIjohnstamford

The culture of researchData TheContentMine

The Culture of Research Data, by Peter Murray-RustLEARN Project

Similar to Crowdsourcing Research Opportunities: Lessons from Natural Language Processing (20)

Leaning Lab il Living Lab di Pisa

Establishing an Online Access Panel for Interactive Information Retrieval Res...

How to facilitate crowd participation - presentation in ISPIM 2013

Computational Social Science:The Collaborative Futures of Big Data, Computer ...

Crowdsourcing & Human Computation Labeling Data & Building Hybrid Systems

Investigating Crowdsourcing as an Evaluation Method for (TEL) Recommender Sy...

Rise of Crowd Computing (December 2012)

Social machines: theory design and incentives

Research to Innovation

TRANSFORMING BIG DATA INTO SMART DATA: Deriving Value via Harnessing Volume, ...

David Rejeski: The Synthetic Biology Startup Ecosystem in the US

Crowdsourcing - an overview

Crowd Computing: Opportunities & Challenges (IJCNLP 2011 Keynote)

Information entanglement

Technology in the Wild: Dynamics and Uncertainty in Field Experiments, Vietnam

SSSW 2016 Cognition Tutorial

Crowdsourcing: A Survey

Overview of Data Science and AI

The culture of researchData

The Culture of Research Data, by Peter Murray-Rust

Recently uploaded

Turkey Vs Georgia Vincenzo Montella's Squad Selection for Turkey's Euro 2024 ...Eticketing.co

NATIONAL SPORTS DAY WRITTEN QUIZ by QUI9Qui9 (Ultimate Quizzing)

JORNADA 2 LIGA MUROBASQUETBOL1 2024.docxArturo Pacheco Alvarez

Spain Vs Italy Showdown Between Italy and Spain Could Determine UEFA Euro 202...World Wide Tickets And Hospitality

Italy Vs Albania Euro Cup 2024 Italy's Strategy for Success.docxWorld Wide Tickets And Hospitality

Project & Portfolio, Market Analysis: WWEDeShawn Ellis

DONAL88 >LINK SLOT PG SOFT TERGACOR 2024DONAL88 GACOR

Clash of Titans_ PSG vs Barcelona (1).pdfMuhammad Hashim

PPT on INDIA VS PAKISTAN - A Sports Rivalryanirbannath184

Benifits of Individual And Team Sports-Group 7.pptxsherrymieg19

BADMINTON EQUIPMENTS / EQUIPMENTS GROUP9.pptxvillenoc6

PGC _ 3.1 _ Powerpoint (2024) scorm ready.pptxaleonardes

Recently uploaded (12)

Turkey Vs Georgia Vincenzo Montella's Squad Selection for Turkey's Euro 2024 ...

NATIONAL SPORTS DAY WRITTEN QUIZ by QUI9

JORNADA 2 LIGA MUROBASQUETBOL1 2024.docx

Spain Vs Italy Showdown Between Italy and Spain Could Determine UEFA Euro 202...

Italy Vs Albania Euro Cup 2024 Italy's Strategy for Success.docx

Project & Portfolio, Market Analysis: WWE

DONAL88 >LINK SLOT PG SOFT TERGACOR 2024

Clash of Titans_ PSG vs Barcelona (1).pdf

PPT on INDIA VS PAKISTAN - A Sports Rivalry

Benifits of Individual And Team Sports-Group 7.pptx

BADMINTON EQUIPMENTS / EQUIPMENTS GROUP9.pptx

PGC _ 3.1 _ Powerpoint (2024) scorm ready.pptx

Crowdsourcing Research Opportunities: Lessons from Natural Language Processing

1. Crowdsourcing Research Opportunities: Lessons from Natural Language Processing Marta Sabou, Kalina Bontcheva, Arno Scharl

2. Crowdsourcing

3. Crowdsourcing Undefined and generally large group

4. Crowdsourcing in Science Crowdsourcing for NLP Challenges

5. Crowdsourcing in science – is not new Sir Francis Galton, “VOX POPULI” Citizen science, from early 19th century, 60,000 – 80,000 yearly volunteers

6. Genre 1: Mechanised Labour  Participants (workers) paid a small amount of money to complete easy tasks (HIT = Human Intelligence Task)

7. Genre 2: Games with a purpose From 2008 240k players

8. Crowdsourcing via Facebook

9. Genre 3: Altruistic Crowdsourcing >250K players >670K players

10. Crowdsourcing in Science - Typical Use •Harness human intuition to prune solution space Process/ Evaluation Input Algorithm Output •Form based data collection •Labeling, Classification •Surveys

11. Crowdsourcing in Science Crowdsourcing for NLP Challenges

12. Crowdsourcing in NLP Papers relying on crowdsourcing in major NLP venues

13. Crowdsourcing Genres in NLP

14. Benefit 1: Affordable, Large-Scale Resources  A variety of small-medium sized resources can be obtained with as little as 100$ using AMT  Crowdsourcing is also cost effective for large resources (Poesio, 2012) $/label 1 M labels ($) Traditional High Q. 1 1,000,000 Mechanical Turk .38 380,000 (<40%) Game .19 217,000 (20%)

15. Benefit 2: Diversification of research

16. Challenge 1: Contributor Selection and Training  From: prior to resource creation  To: during the resource creation

17. Challenge 2: Aggregation and Quality Control  From: a few experts‘ annotations  To: multiple, noisy annotations from non-experts  Approach 1: Statistical techniques  Simplest (and most popular): majority voting  More complex: Machine learning model trained on various features  Approach 2: Crowdsourcing the QC process itself HIT1 (Create): HIT2 (Verify): Which of these 5 sentences is the Translate the following sentence: best translation?

18. Conclusions (What have we learned from NLP?)  Crowdsourcing is revolutionalising NLP research  Cheaper resource acquisition  Diversification of research agenda  But requires more complex methodologies  For contributor management  For quality control and data aggregation  Other findings: most popular  Genre: mechanised labour  Task: acquiring input data  Problem: solving subjective tasks

19. Crowdsourcing in Science Crowdsourcing for NLP Challenges

20. User Motivation  Motivating users  Motivations for scientific projects might differ  Task-granularity might impact motivation  Promoting learning and science  Advertise STEM research to young people  Support learning and self-improvement through participation in crowdsourcing

21. Legal and Ethical Issues  Acknowledging the Crowd‘s contribution  S. Cooper, [other auhors], and Foldit players: Predicting protein structures with a multiplayer online game. Nature, 466(7307):756-760, 2010.  Ensuring privacy and wellbeing  Mechnised labour criticesed for low wages (,$2/hour), lack of worker rights  Prevent addition, prolonged-use & user exploitation  Licensing and consent  Some clearly state the use of Creative Common licenses  General failure to provide informed consent information

22. Technical Issues  Scaling up to large resources  Preventing bias  Increasing repeatability  Through reuse of crowdsourcing elements (e.g., HIT templates)  uComp - Embedded Human Computation for Knowledge Extraction and Evaluation  3 year project, starting November 2012  Develops a scalable and generic HC framework for knowledge creation  Provides reusable HC elements

23. Thank you!

Editor's Notes

How does crowdsourcing relate to Research 2.0.? My talk will illustrate how certain web technologies can reduce the gap between scientists on one hand, and ordinary citizens on the other – thus enabling a certain form of research 2.0. If Web2.0 is often associate to “user generated content”, research 2.0, at least the one enabled by crowdsourcing, is “user generated/supported science”. Taking the field of NLP as an example, I will discuss how crowdsourcing is changing research practices and its effect on this scientific discipline. Research 2.0 deals with the involvement of the web in science. It spans from the utilization of Web 2.0 tools and technologies in research to a more open and sharing approach to science. Some definitions of Research 2.0 even include notions of a methodological change due to the abundance of data, and the nature of the socio-technical systems on the web. The change in scientific practices due to the involvement of Research 2.0 tools and technologies in the research process and the effects this has on science itself.
But not projects that: Do not have the creation of scientific data as their main goal (e.g., Wikipedia) Use crowds to support auxiliary scientific processes (e.g., Mendeley) Recruit online but experiment in lab Recruit processing power and NOT human effort (SETI@home) Have as contributors scientific stuff alone, e.g., collaboratories
But not projects that: Do not have the creation of scientific data as their main goal (e.g., Wikipedia) Use crowds to support auxiliary scientific processes (e.g., Mendeley) Recruit online but experiment in lab Recruit processing power and NOT human effort (SETI@home) Have as contributors scientific stuff alone, e.g., collaboratories
In fact, already in 1907, Sir Francis Galton, (Darwin‘s cousin, A brilliant Victorian scientist,) has published a Nature article entitled „VOX Populi“ (or the voice of the people, the voice of the crowd), where he discribes his experiment at a lifestock fair: 787 persons were asked to estimate the weight of the ox, and, while none came close to the real value, the mean of the guesses was almost spot-on. Meanwhile, some other societies were using the crowd differently, namely, to support them in gathering scintific data. From the early 19th century, the Aubodon society has been relying on volunteers to count species of local birds. Their campaings continue to this date, and in 2012, volunteers submitted over 100, 000 ch ecklists leading to observations about 623 specied and over 17 million individual birds. These activities are often termed as citizen science. This is not a novel phenomenon Citizen science projects around since the beginning of last century (at least) There is a vast landscape and variety of citizen science projects where scientists call on the public for help - some examples, including from Lora‘s paper (her talk might have some mentions as well) IT enables virtual citizen science projects and this upsurge is a direct consequence of new and improved ways to involve the public into scientifc procecess
Participants contribute while having fun 13 Apr 2012 | 16:35 EDT | Posted by Rebecca Hersher: Two years ago, FoldIt made headlines, lots of them, when players of the online protein-folding video game took three weeks to solve the three dimensional structure of a simian retroviral protein that is used in animal models of HIV, but whose structure had eluded biochemists for more than a decade. “: http://blogs.nature.com/spoonful/2012/04/foldit-games-next-play-crowdsourcing-better-drug-design.html Phylo is an experimental video game about multiple sequence alignment optimisation. “Since the launch in November 2010, we received more than 350,000 solutions submitted from more than 12,000 registered users. Our results show that solutions submitted contributed to improving the accuracy of up to 70% of the alignment blocks considered.” It is about showing that humans can aid algorithms rather than comparing human and machine performance.
In 2008, the group built a FB game that required players to rate the sentiment associated to a sentence on a 5-values scale, then used this as atraining corpus for the sentiment detection module. Over 800 player played the game. In 2009 the game has been released in a slightly different form and with the aim to gather sentiment lexicons, i.e., associations between words and their sentiment polarity (ratings from as many as 12 players were averaged to get the final value). The game ran in 7 different languages and attracted over 4000 players. Let this be an introductory example of a crowdsourcing project, however, crowdsourcing is a not a new phenomenon.
Volunteer contributes because he is interested in a domain, supports a cause
More languages E.g., Urdu, Arabic, Hitian Creole Irvine and Klementiev create lexicons between English and 37 low resourced languages Diverse types of text (besides news-wire) Emails, twitter feeds, augmented and alternative communication texts Speech: transcription, accent rating, assessment of dialog systems Subjective tasks Sentiment detection, translation, word sense disambiguation, anaphora resolution, question answering, textual entailment, text summarization …. Niche language phenomena Lab experiments reproduced at a fraction of their cost E.g., contextual predictivity (Cloze task), corpus trends
Completely new wrt traditional approaches Uses „create-verify“ workflows Widespred technique for translation tasks, less for labeling
STEM (Science, Technology, Engineering, Mathematics) Harness increased visability and ease of engagement in social networks to make STEM research more attractive and understandable => more young people to study STEM
STEM (Science, Technology, Engineering, Mathematics) Harness increased visability and ease of engagement in social networks to make STEM research more attractive and understandable => more young people to study STEM
STEM (Science, Technology, Engineering, Mathematics) Harness increased visability and ease of engagement in social networks to make STEM research more attractive and understandable => more young people to study STEM

Crowdsourcing Research Opportunities: Lessons from Natural Language Processing

Recommended

Recommended

More Related Content

Similar to Crowdsourcing Research Opportunities: Lessons from Natural Language Processing

Similar to Crowdsourcing Research Opportunities: Lessons from Natural Language Processing (20)

Recently uploaded

Recently uploaded (12)

Crowdsourcing Research Opportunities: Lessons from Natural Language Processing

Editor's Notes