2. EasyBib is an automatic
bibliography composer.
Students use it to cite
sources for their research.
3. We teach information
literacy.
18%
of all student papers include plagiarism1
Source: (1) TurnItIn; (2) Both Sides Now: Librarians Looking at Information Literacy from High School and College.
50%
likelihood of using a credible vs. non-
credible source1
4%
increase in the use of paper mills and
cheating sites1
~16%
of students are adequately prepared for
college.2
6. Unprepared students
make for unprepared
adults.
It’s not just students who
plagiarize:
•Pal Schmitt, former president
of Hungary
•German education minister
•Jayson Blair (former New
York Times writer)
•Jonah Lehrer, journalist and
author
•Fareed Zakaria (reporter,
author, host)
7. We are in the right place
to figure it out.
Over half of all
students in the
US (40M)
Over half a billion
citations
8. We asked ourselves the
following questions:
•What are students using in their
research?
•How good are their sources?
•How can we help them?
9. We started with the
basics._gaq.push([
'citations._trackEvent',
citationTitle,
citationPublisher,
citationId
]);
10. Here’s what we found.
Top sources 2010
•Wikipedia
•Google
1.The New York Times
2.CIA World Factbook
3.Oracle Thinkquest
4.Buzzle
5.US BLS
6.Dictionary.com
7.CDC
8.PBS
9.eHow
Source: EasyBib Google Analytics Oct 2010-Nov 2010 data.
11. What could we do?
•Warn them when their source’s
credibility is in question
•Analyze the quality of their full
bibliography
•Make it easier to not plagiarize
•Suggest better sources
16. So after all this...
Does it blend (tm) ?
1. Wikipedia
2. Bio.com
3. History.com
4. PBS
5. Mayo Clinic
6. CDC
7. The New York Times
8. BBC
9. CNN
10.WebMD
11.US BLS
• Wikipedia still on top,
but ...
• No content farms, no
Google..
• WebMD is questionable,
but its credibility can be
argued for.
Source: Apr-May 2013 Google Analytics data
17. We have to admit, it’s getting
better...
We have to admit, it’s getting
better...
19. How does the Research
engine currently work?
Cloudant (CouchDB)
MySQL
Lucene/Solr
Slow, asynchronous, lots of moving
parts.
20. Starting to do a bit more
StatsD::increment($metrics);
$response = $rediska->publish(
array('realtime'),
$citation
);
21. There’s a lot more we can
do, and data will help us.
22. Cloudant Search
•Full-text search integrated into Cloudant
•Lucene syntax
•Indexing is easy
function(doc){
index("title", doc.title, {"store": "yes"});
}
•Grouping of sources via chained map-reduce
map: function(doc){
if (doc.title){ emit({"title": doc.title}, 1); }
}
reduce: _sum
dbcopy: citationGroup
------
map: function(doc){
if (doc.title && doc.key.title){ emit(doc.value, doc.key.title); }
}
23. Live data analysis.
Crowdsourcing.
•Use Cloudant Search to power
feedback on sources (# of times
cited in real time, quality of
bibliographies derived from)
•Allow users to submit their own
credibility evaluations and aggregate
results
24. SourceRank!
Credibility weighting + crowdsourcing
Synchronous & realtime via Cloudant Search
Value nodes based on nearest neighbors
And other things...
25. Driving growth
We have the largest UGC citation
set. Making this searchable
creates a “moat.”
The more people that use EasyBib,
the better the tool becomes.
26. What about other data
analytics tools?
Too stretched to learn more complex tools
(looking for easy answers)
Costs (GA is free!)
EMR, Hadoop, Redshift, Cloudant Search:
This is what’s next.