SlideShare a Scribd company logo
1 of 43
Download to read offline
Engines for Scientific

       Kris Jack, PhD
    Data Mining Team Lead

    2 recommendation use cases

    literature search with Mendeley

    use case 1: related research

    use case 2: personalised recommendations
Use Cases

Two types of     1) Related Research
                   given 1 research article
recommendation   ●
                   find other related articles
use cases:

                 2) Personalised Recommendations
                   given a user's profile (e.g. interests)
                   find articles of interest to them
Use Cases

My secondment   1) Related Research
                  given 1 research article
(Dec-Feb):      ●
                  find other related articles

                2) Personalised Recommendations
                  given a user's profile (e.g. interests)
                  find articles of interest to them
Literature Search
 Using Mendeley

     Use only Mendeley to perform literature search for:
       Related research
       Personalised recommendations

                                                 Eating your 
                                                 own dog food...
Found:    Queries: “content similarity”, “semantic
         similarity”, “semantic relatedness”, “PubMed
 0       related articles”, “Google Scholar related articles”
Found:    Queries: “content similarity”, “semantic
         similarity”, “semantic relatedness”, “PubMed
  1      related articles”, “Google Scholar related articles”




Literature Search
    Using Mendeley

    Summary of Results

   Strategy                Num Docs Comment
   Catalogue Search        19      9 from “Related Research”
   Group Search            0       Needs work
   Perso Recommendations   45      Led to a group with 37 docs!


Literature Search
    Using Mendeley

    Summary of Results

   Strategy                Num Docs Comment
   Catalogue Search        19      9 from “Related Research”
   Group Search            0       Needs work
   Perso Recommendations   45      Led to a group with 37 docs!

                                                    Eating your 
Found:                                              own dog food...
                                                      Tastes good!
64 => 31 docs, read 14 so far,
   so what do they say...?
Use Cases

            1) Related Research
              given 1 research article
              find other related articles
Use Case 1: Related Research

 7 highly relevant papers (related research for scientific articles)

 Q1/4: How are the systems evaluated?

     User study (e.g. Likert scale to rate relatedness
     between documents). (Beel & Gipp, 2010)

     TREC collections with hand classified 'related
     articles' (e.g. TREC 2005 genomics track). (Lin &
     Wilbur, 2007)

     Try to reconstruct a document's reference list
     (Pohl, Radlinski, & Joachims, 2007; Vellino, 2009)
Use Case 1: Related Research

 7 highly relevant papers (related research for scientific articles)

 Q2/4: How are the systems trained?

     Paper reference lists (Pohl et al., 2007; Vellino, 2009)

     Usage data (e.g. PubMed, arXiv) (Lin & Wilbur, 2007)

     Document content (e.g. metadata, co-citation,
     bibliographic coupling) (Gipp, Beel, & Hentschel, 2009)

     Collocation in mind maps (Jöran Beel & Gipp, 2010)
Use Case 1: Related Research

 7 highly relevant papers (related research for scientific articles)

 Q3/4: Which techniques are applied?

     bm25 (Lin & Wilbur, 2007)

     Topic modelling (Lin & Wilbur, 2007)

     Collaborative filtering (Pohl et al., 2007)

     Bespoke heuristics for feature extraction (e.g. in-text
     citation metrics for same sentence, paragraph). (Pohl et
     al., 2007; Gipp et al., 2009)
Use Case 1: Related Research

 7 highly relevant papers (related research for scientific articles)

 Q4/4: Which techniques have most success?

     Topic modelling slighty improves on BM25 (MEDLINE
     abstracts) (Lin & Wilbur, 2007):
     - bm25 = 0.383 precision @ 5
     - PMRA = 0.399 precision @ 5

     Seeding CF with usage data from arXiv won out over
     using citation lists (Pohl et al., 2007)

     Not yet found significant results that show content-
     based or CF methods are better for this task
Use Case 1: Related Research

 Progress so far...

 Q1/2 How do we evaluate our system?

     Construct a non-complex data set of related research:
       include groups with 10-20 documents (i.e. topics)
       no overlaps between groups (i.e. documents in common)
       only take documents that are recognised as being in English
       document metadata must be 'complete' (i.e. has title, year, author,
     published in, abstract, filehash, abstract, tags/keywords/MeSH terms)

     → 4,382 groups
     → mean size = 14
     → 60,715 individual documents

     Given a doc, aim to retrieve the other docs from its group
       tf-idf with lucene implementation
Use Case 1: Related Research

 Progress so far...

 Q1/2 How do we evaluate our system?

     Construct a non-complex data set of related research:
       include groups with 10-20 documents (i.e. topics)
       no overlaps between groups (i.e. documents in common)
       only take documents that are recognised as being in English
       document metadata must be 'complete' (i.e. has title, year, author,
     published in, abstract, filehash, abstract, tags/keywords/MeSH terms)

     → 4,382 groups
     → mean size = 14
     → 60,715 individual documents

     Given a doc, aim to retrieve the other docs from its group
       tf-idf with lucene implementation
Use Case 1: Related Research

 Progress so far...                                                 Metadata Presence in Documents

 Q1/2 How do we evaluate our system?


                                         Construct a non-complex data set of related research:
  % of documents that field appears in

                                           include groups with 10-20 documents (i.e. topics)
                                                                                                          Evaluation Data Det
                                           no overlaps between groups (i.e. documents in common)          Group
                                           only take documents that are recognised as being in English Catalogue
                                           document metadata must be 'complete' (i.e. has title, year, author,
                                         published in, abstract, filehash, abstract, tags/keywords/MeSH terms)


                                         → 4,382 groups









                                         → mean size = 14
                                         → 60,715 individual documents

                                         Given a doc, aim to retrieve thefield
                                                                   metadata other docs from its group
Use Case 1: Related Research

 Progress so far...

 Q2/2 What are our results?

                               tf-idf Precision per Field for Complete Data Set



             Precision @ 5




                                               title                mesh-term             keyword
                                    abstract           generalKeyword            author             tag
                                                                metadata field
Use Case 1: Related Research

 Progress so far...

 Q2/2 What are our results?

                              tf-idf Precision per Field when Field is Available

             Precision @ 5

                                    tag   abstract   mesh-term     title general-keyword author   keyword
                                                           metadata field
Use Case 1: Related Research

 Progress so far...

 Q2/2 What are our results?

               tf-idf Precision for Field Combos for Complete Data Set




             precision @ 5





                                                abstract           generalKeyword       author             tag
                                    bestCombo              title              mesh-term          keyword
                                                                     metadata field(s)

                                                   BestCombo = abstract+author+general-keyword+tag+title
Use Case 1: Related Research

 Progress so far...

 Q2/2 What are our results?

             tf-idf Precision for Field Combos when Field is Available

             precision @ 5

                                          bestCombo              mesh-term              general-keyword         keyword
                                    tag               abstract                  title                  author
                                                                 metadata field(s)

                                             BestCombo = abstract+author+general-keyword+tag+title
Use Case 1: Related Research

 Future directions...?

 Evaluate multiple techniques on same data set

    Construct public data set
      similar to current one but with data from only public groups
      analyse composition of data set in detail

      content-based filtering
      collaborative filtering

    Evaluate the different systems on same data set

    ...and let's brainstorm!
Use Cases

            2) Personalised Recommendations
              given a user's profile (e.g. interests)
              find articles of interest to them
Use Case 2: Perso Recommendations

 7 highly relevant papers (perso recs for scientific articles)

 Q1/4: How are the systems evaluated?

     Cross validation on user libraries (Bogers & van
     Den Bosch, 2009; Wang & Blei, 2011)

     User studies (McNee, Kapoor, & Konstan, 2006;
     Parra-Santander & Brusilovsky, 2009)
Use Case 2: Perso Recommendations

 7 highly relevant papers (perso recs for scientific articles)

 Q2/4: How are the systems trained?

     CiteULike libraries (Bogers & van Den Bosch,
     2009; Parra-Santander & Brusilovsky, 2009;
     Wang & Blei, 2011)

     Documents represent users and their citations
     documents of interest (McNee et al., 2006)

     User search history (N Kapoor et al., 2007)
Use Case 2: Perso Recommendations

 7 highly relevant papers (perso recs for scientific articles)

 Q3/4: Which techniques are applied?

     CF (Parra-Santander & Brusilovsky, 2009; Wang & Blei, 2011)

     LDA (Wang & Blei, 2011)

     Hybrid of CF + LDA (Wang & Blei, 2011)

     BM25 over tags to form user neighbourhood (Parra-Santander &
     Brusilovsky, 2009)

     Item-based and content-based CF (Bogers & van Den Bosch, 2009)

     User-based CF, Naïve Bayes classifier, Probabilistic Latent Semantic
     Indexing, textual TF-IDF-based algorithm (uses document abstracts)
     (McNee et al., 2006)
Use Case 2: Perso Recommendations

 7 highly relevant papers (perso recs for scientific articles)

 Q4/4: Which techniques have most success?

     CF is much better than topic modelling (Wang & Blei, 2011)

     CF-topic modelling hybrid, slightly outperforms CF alone (Wang &
     Blei, 2011)

     Content-based filtering performed slightly better than item-based
     filtering on a test set with 1,322 CiteULike users (Bogers & van Den
     Bosch, 2009)

     User-based CF and tf-idf outperformed Naïve Bayes and Probabilistic
     Latent Semantic Indexing significantly (McNee et al., 2006)

     BM25 gave better results than CF but the study was with just 7
     CiteULike users so small scale (Parra-Santander & Brusilovsky, 2009)
Use Case 2: Perso Recommendations

 7 highly relevant papers (perso recs for scientific articles)

 Q4/4: Which techniques have most success?

                               Advantage                          Disadvantage
 Content-   Human readable form of their profile                  Tends to over-
 based                                                            specialise
            Quickly absorb new content without need for ratings
 CF         Works on an abstract item-user level so you don't     Requires a lot of
            need to 'understand' the content                      data

            Tends to give more novel and creative
Use Case 2: Perso Recommendations

 Our progress so far...

 Q1/2 How do we evaluate our system?

    Construct an evaluation data set from user libraries
      50,000 user libraries
      10-fold cross validation
      libraries vary from 20-500 documents
      preference values are binary (in library = 1; 0 otherwise)

      item-based collaborative filtering recommender

      train recommender and test how well it can reconstruct the users'
    hidden testing libraries
      mulitple similarity metrics (e.g. cooccurrence, loglikelihood)
Use Case 2: Perso Recommendations

 Our progress so far...

 Q2/2 What are our results?

    Cross validation:
      0.1 precision @ 10 articles

    Usage logs:
      0.4 precision @ 10 articles
Use Case 2: Perso Recommendations

 Our progress so far...

 Q2/2 What are our results?
Use Case 2: Perso Recommendations

 Our progress so far...

 Q2/2 What are our results?
     Precision at 10 articles

                                Number of articles in user library
Use Case 2: Perso Recommendations

 Future directions...?

 Evaluate multiple techniques
 Q2/2 What are our results? on same data set

    Construct data set
      similar to current one but with more up-to-date data
      analyse composition of data set in detail

      content-based filtering
      collaborative filtering (user-based and item-based)

    Evaluate the different systems on same data set

    ...and let's brainstorm!

    2 recommendation use cases

    similar problems and techniques

    good results so far

  combining CF with content would likely
improve both

Beel, Jöran, & Gipp, B. (2010). Link Analysis in Mind Maps  : A New Approach to Determining Document Relatedness.
Mind,            (January).            Citeseer.       Retrieved              from
Bogers, T., & van Den Bosch, A. (2009). Collaborative and Content-based Filtering for Item Recommendation on Social
Bookmarking Websites. ACM RecSys ’09 Workshop on Recommender Systems and the Social Web. New York, USA.
Retrieved from
Gipp, B., Beel, J., & Hentschel, C. (2009). Scienstein: A research paper recommender system. Proceedings of the
International Conference on Emerging Trends in Computing (ICETiC’09) (pp. 309–315). Retrieved from
Kapoor, N, Chen, J., Butler, J. T., Fouty, G. C., Stemper, J. A., Riedl, J., & Konstan, J. A. (2007). Techlens: a researcher’s
desktop. Proceedings of the 2007 ACM conference on Recommender systems (pp. 183-184). ACM.
Lin, J., & Wilbur, W. J. (2007). PubMed related articles: a probabilistic topic-based model for content similarity. BMC
Bioinformatics, 8(1), 423. BioMed Central. Retrieved from
McNee, S. M., Kapoor, N., & Konstan, J. A. (2006). Don’t look stupid: avoiding pitfalls when recommending research
papers. Proceedings of the 2006 20th anniversary conference on Computer supported cooperative work (p. 180). ACM.
Retrieved from
Parra-Santander, D., & Brusilovsky, P. (2009). Evaluation of Collaborative Filtering Algorithms for Recommending Articles.
Web 3.0: Merging Semantic Web and Social Web at HyperText ’09 (pp. 3-6). Torino, Italy. Retrieved from http://ceur-
Pohl, S., Radlinski, F., & Joachims, T. (2007). Recommending related papers based on digital library access records.
Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries (pp. 418-419). ACM. Retrieved from
Vellino, A. (2009). The Effect of PageRank on the Collaborative Filtering Recommendation of Journal Articles. Retrieved
Wang, C., & Blei, D. M. (2011). Collaborative topic modeling for recommending scientific articles. Proceedings of the 17th
ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 448–456). ACM. Retrieved from

More Related Content

What's hot

NISO-NFAIS Supplemental Journal Article Materials Working Group: An Update o...
NISO-NFAIS Supplemental Journal Article Materials Working Group: An Update o...NISO-NFAIS Supplemental Journal Article Materials Working Group: An Update o...
NISO-NFAIS Supplemental Journal Article Materials Working Group: An Update o...aschwarzman
Workshop 2 using nvivo 12 for qualitative data analysis
Workshop 2 using nvivo 12 for qualitative data analysisWorkshop 2 using nvivo 12 for qualitative data analysis
Workshop 2 using nvivo 12 for qualitative data analysisDr. Yaar Muhammad
Scott Edmunds: Data publication in the data deluge
Scott Edmunds: Data publication in the data delugeScott Edmunds: Data publication in the data deluge
Scott Edmunds: Data publication in the data delugeGigaScience, BGI Hong Kong
Scientific Recommender Systems - PG PUSHPIN
Scientific Recommender Systems - PG PUSHPINScientific Recommender Systems - PG PUSHPIN
Scientific Recommender Systems - PG PUSHPINDermitder
Setting Up a Qualitative or Mixed Methods Research Project in NVivo 10 to Cod...
Setting Up a Qualitative or Mixed Methods Research Project in NVivo 10 to Cod...Setting Up a Qualitative or Mixed Methods Research Project in NVivo 10 to Cod...
Setting Up a Qualitative or Mixed Methods Research Project in NVivo 10 to Cod...Shalin Hai-Jew
Introduction to NVivo
Introduction to NVivoIntroduction to NVivo
Introduction to NVivoMarieke Guy

What's hot (7)

NISO-NFAIS Supplemental Journal Article Materials Working Group: An Update o...
NISO-NFAIS Supplemental Journal Article Materials Working Group: An Update o...NISO-NFAIS Supplemental Journal Article Materials Working Group: An Update o...
NISO-NFAIS Supplemental Journal Article Materials Working Group: An Update o...
From federated to aggregated search
From federated to aggregated searchFrom federated to aggregated search
From federated to aggregated search
Workshop 2 using nvivo 12 for qualitative data analysis
Workshop 2 using nvivo 12 for qualitative data analysisWorkshop 2 using nvivo 12 for qualitative data analysis
Workshop 2 using nvivo 12 for qualitative data analysis
Scott Edmunds: Data publication in the data deluge
Scott Edmunds: Data publication in the data delugeScott Edmunds: Data publication in the data deluge
Scott Edmunds: Data publication in the data deluge
Scientific Recommender Systems - PG PUSHPIN
Scientific Recommender Systems - PG PUSHPINScientific Recommender Systems - PG PUSHPIN
Scientific Recommender Systems - PG PUSHPIN
Setting Up a Qualitative or Mixed Methods Research Project in NVivo 10 to Cod...
Setting Up a Qualitative or Mixed Methods Research Project in NVivo 10 to Cod...Setting Up a Qualitative or Mixed Methods Research Project in NVivo 10 to Cod...
Setting Up a Qualitative or Mixed Methods Research Project in NVivo 10 to Cod...
Introduction to NVivo
Introduction to NVivoIntroduction to NVivo
Introduction to NVivo

Viewers also liked

Social Media Mining - Chapter 9 (Recommendation in Social Media)
Social Media Mining - Chapter 9 (Recommendation in Social Media)Social Media Mining - Chapter 9 (Recommendation in Social Media)
Social Media Mining - Chapter 9 (Recommendation in Social Media)SocialMediaMining
A Hybrid Recommendation system
A Hybrid Recommendation systemA Hybrid Recommendation system
A Hybrid Recommendation systemPranav Prakash
Data Mining and Recommendation Systems
Data Mining and Recommendation SystemsData Mining and Recommendation Systems
Data Mining and Recommendation SystemsSalil Navgire
Recommender Systems
Recommender SystemsRecommender Systems
Recommender SystemsT212
Content Recommendation Based on Data Mining in Adaptive Social Networks
Content Recommendation Based on Data Mining  in Adaptive Social NetworksContent Recommendation Based on Data Mining  in Adaptive Social Networks
Content Recommendation Based on Data Mining in Adaptive Social NetworksMarcel Caraciolo
Recommendation system
Recommendation system Recommendation system
Recommendation system Vikrant Arya
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)Xavier Amatriain

Viewers also liked (9)

Music data mining
Music  data miningMusic  data mining
Music data mining
Social Media Mining - Chapter 9 (Recommendation in Social Media)
Social Media Mining - Chapter 9 (Recommendation in Social Media)Social Media Mining - Chapter 9 (Recommendation in Social Media)
Social Media Mining - Chapter 9 (Recommendation in Social Media)
A Hybrid Recommendation system
A Hybrid Recommendation systemA Hybrid Recommendation system
A Hybrid Recommendation system
Data Mining and Recommendation Systems
Data Mining and Recommendation SystemsData Mining and Recommendation Systems
Data Mining and Recommendation Systems
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
Content Recommendation Based on Data Mining in Adaptive Social Networks
Content Recommendation Based on Data Mining  in Adaptive Social NetworksContent Recommendation Based on Data Mining  in Adaptive Social Networks
Content Recommendation Based on Data Mining in Adaptive Social Networks
Recommendation system
Recommendation system Recommendation system
Recommendation system
Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)Recommender Systems (Machine Learning Summer School 2014 @ CMU)
Recommender Systems (Machine Learning Summer School 2014 @ CMU)

Similar to Recommendation Engines for Scientific Literature Search

W13 libr250 databases___sources1
W13 libr250 databases___sources1W13 libr250 databases___sources1
W13 libr250 databases___sources1lterrones
Research report writing
Research report writingResearch report writing
Research report writingMichele Knobel
W13 libr250 databases_scholarlyvs_popular
W13 libr250 databases_scholarlyvs_popularW13 libr250 databases_scholarlyvs_popular
W13 libr250 databases_scholarlyvs_popularlterrones
An initial analysis of topic-based similarity among scientific documents base...
An initial analysis of topic-based similarity among scientific documents base...An initial analysis of topic-based similarity among scientific documents base...
An initial analysis of topic-based similarity among scientific documents base...Oscar Corcho
British Library
British LibraryBritish Library
British Libraryclarivate
Lit Reviews for the Health Sciences
Lit Reviews for the Health SciencesLit Reviews for the Health Sciences
Lit Reviews for the Health SciencesRobin Featherstone
كيفية كتابة المسح الأدبي
كيفية كتابة المسح الأدبيكيفية كتابة المسح الأدبي
كيفية كتابة المسح الأدبيresearchcenterm
W13 libr250 databases_scholarlyvs_popular
W13 libr250 databases_scholarlyvs_popularW13 libr250 databases_scholarlyvs_popular
W13 libr250 databases_scholarlyvs_popularlterrones
Literature Review Assignment A literature review is an a.docx
Literature Review Assignment A literature review is an a.docxLiterature Review Assignment A literature review is an a.docx
Literature Review Assignment A literature review is an a.docxwashingtonrosy
PSYC 3401
PSYC 3401PSYC 3401
PSYC 3401Traciwm
UKSG webinar - Introduction to Text-Mining Research Papers with Petr Knoth an...
UKSG webinar - Introduction to Text-Mining Research Papers with Petr Knoth an...UKSG webinar - Introduction to Text-Mining Research Papers with Petr Knoth an...
UKSG webinar - Introduction to Text-Mining Research Papers with Petr Knoth an...UKSG: connecting the knowledge community
Guidelines review article
Guidelines review articleGuidelines review article
Guidelines review articlePreethiT4
Opportunities: Improve Interoperability ... from a library viewpoint.
Opportunities: Improve Interoperability ... from a library viewpoint. Opportunities: Improve Interoperability ... from a library viewpoint.
Opportunities: Improve Interoperability ... from a library viewpoint. TIB Hannover
IntroductionTypically, a review of the literature is extensi
IntroductionTypically, a review of the literature is extensiIntroductionTypically, a review of the literature is extensi
IntroductionTypically, a review of the literature is extensimariuse18nolet

Similar to Recommendation Engines for Scientific Literature Search (20)

W13 libr250 databases___sources1
W13 libr250 databases___sources1W13 libr250 databases___sources1
W13 libr250 databases___sources1
ENS/OCN 3911 Preparation for Field Projects
ENS/OCN 3911 Preparation for Field ProjectsENS/OCN 3911 Preparation for Field Projects
ENS/OCN 3911 Preparation for Field Projects
Research report writing
Research report writingResearch report writing
Research report writing
W13 libr250 databases_scholarlyvs_popular
W13 libr250 databases_scholarlyvs_popularW13 libr250 databases_scholarlyvs_popular
W13 libr250 databases_scholarlyvs_popular
An initial analysis of topic-based similarity among scientific documents base...
An initial analysis of topic-based similarity among scientific documents base...An initial analysis of topic-based similarity among scientific documents base...
An initial analysis of topic-based similarity among scientific documents base...
British Library
British LibraryBritish Library
British Library
Lit Reviews for the Health Sciences
Lit Reviews for the Health SciencesLit Reviews for the Health Sciences
Lit Reviews for the Health Sciences
محاضرة 2
محاضرة 2محاضرة 2
محاضرة 2
كيفية كتابة المسح الأدبي
كيفية كتابة المسح الأدبيكيفية كتابة المسح الأدبي
كيفية كتابة المسح الأدبي
W13 libr250 databases_scholarlyvs_popular
W13 libr250 databases_scholarlyvs_popularW13 libr250 databases_scholarlyvs_popular
W13 libr250 databases_scholarlyvs_popular
PPT on literature review.pdf
PPT on literature review.pdfPPT on literature review.pdf
PPT on literature review.pdf
Literature Review Assignment A literature review is an a.docx
Literature Review Assignment A literature review is an a.docxLiterature Review Assignment A literature review is an a.docx
Literature Review Assignment A literature review is an a.docx
PSYC 3401
PSYC 3401PSYC 3401
PSYC 3401
UKSG webinar - Introduction to Text-Mining Research Papers with Petr Knoth an...
UKSG webinar - Introduction to Text-Mining Research Papers with Petr Knoth an...UKSG webinar - Introduction to Text-Mining Research Papers with Petr Knoth an...
UKSG webinar - Introduction to Text-Mining Research Papers with Petr Knoth an...
Guidelines review article
Guidelines review articleGuidelines review article
Guidelines review article
Guidelines review article
Guidelines review articleGuidelines review article
Guidelines review article
Guidelines review article
Guidelines review articleGuidelines review article
Guidelines review article
PsycInfo from ProQuest
PsycInfo from ProQuestPsycInfo from ProQuest
PsycInfo from ProQuest
Opportunities: Improve Interoperability ... from a library viewpoint.
Opportunities: Improve Interoperability ... from a library viewpoint. Opportunities: Improve Interoperability ... from a library viewpoint.
Opportunities: Improve Interoperability ... from a library viewpoint.
IntroductionTypically, a review of the literature is extensi
IntroductionTypically, a review of the literature is extensiIntroductionTypically, a review of the literature is extensi
IntroductionTypically, a review of the literature is extensi

More from Kris Jack

Modern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in MendeleyModern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in MendeleyKris Jack
Machine Learning @ Mendeley
Machine Learning @ MendeleyMachine Learning @ Mendeley
Machine Learning @ MendeleyKris Jack
Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...Kris Jack
Mendeley Suggest: What will you read next?
Mendeley Suggest: What will you read next?Mendeley Suggest: What will you read next?
Mendeley Suggest: What will you read next?Kris Jack
Mendeley Suggest: Engineering a Personalised Article Recommender System
Mendeley Suggest: Engineering a Personalised Article Recommender SystemMendeley Suggest: Engineering a Personalised Article Recommender System
Mendeley Suggest: Engineering a Personalised Article Recommender SystemKris Jack
Mendeley's Data and Perspectives on Data Challenges
Mendeley's Data and Perspectives on Data ChallengesMendeley's Data and Perspectives on Data Challenges
Mendeley's Data and Perspectives on Data ChallengesKris Jack
Scientific Article Recommendation with Mahout
Scientific Article Recommendation with MahoutScientific Article Recommendation with Mahout
Scientific Article Recommendation with MahoutKris Jack
Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley
Mahout Becomes a Researcher: Large Scale Recommendations at MendeleyMahout Becomes a Researcher: Large Scale Recommendations at Mendeley
Mahout Becomes a Researcher: Large Scale Recommendations at MendeleyKris Jack
improving explicit preference entry by visualising data similarities
improving explicit preference entry by visualising data similaritiesimproving explicit preference entry by visualising data similarities
improving explicit preference entry by visualising data similaritiesKris Jack
Etude de la pertinence de critères de recherche en recherche d'informations s...
Etude de la pertinence de critères de recherche en recherche d'informations s...Etude de la pertinence de critères de recherche en recherche d'informations s...
Etude de la pertinence de critères de recherche en recherche d'informations s...Kris Jack
A Computational Model of Staged Language Acquisition
A Computational Model of Staged Language AcquisitionA Computational Model of Staged Language Acquisition
A Computational Model of Staged Language AcquisitionKris Jack
From Syllables to Syntax: Investigating Staged Linguistic Development through...
From Syllables to Syntax: Investigating Staged Linguistic Development through...From Syllables to Syntax: Investigating Staged Linguistic Development through...
From Syllables to Syntax: Investigating Staged Linguistic Development through...Kris Jack
A Collaborative Tool for the Computational Modelling of Child Language Acquis...
A Collaborative Tool for the Computational Modelling of Child Language Acquis...A Collaborative Tool for the Computational Modelling of Child Language Acquis...
A Collaborative Tool for the Computational Modelling of Child Language Acquis...Kris Jack
Mendeley: crowdsourcing and recommending research on a large scale
Mendeley: crowdsourcing and recommending research on a large scaleMendeley: crowdsourcing and recommending research on a large scale
Mendeley: crowdsourcing and recommending research on a large scaleKris Jack
Mendeley, putting data into the hands of researchers
Mendeley, putting data into the hands of researchersMendeley, putting data into the hands of researchers
Mendeley, putting data into the hands of researchersKris Jack
Cloud Elephants and Witches: A Big Data Tale from Mendeley
Cloud Elephants and Witches: A Big Data Tale from MendeleyCloud Elephants and Witches: A Big Data Tale from Mendeley
Cloud Elephants and Witches: A Big Data Tale from MendeleyKris Jack

More from Kris Jack (16)

Modern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in MendeleyModern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in Mendeley
Machine Learning @ Mendeley
Machine Learning @ MendeleyMachine Learning @ Mendeley
Machine Learning @ Mendeley
Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley Suggest: What will you read next?
Mendeley Suggest: What will you read next?Mendeley Suggest: What will you read next?
Mendeley Suggest: What will you read next?
Mendeley Suggest: Engineering a Personalised Article Recommender System
Mendeley Suggest: Engineering a Personalised Article Recommender SystemMendeley Suggest: Engineering a Personalised Article Recommender System
Mendeley Suggest: Engineering a Personalised Article Recommender System
Mendeley's Data and Perspectives on Data Challenges
Mendeley's Data and Perspectives on Data ChallengesMendeley's Data and Perspectives on Data Challenges
Mendeley's Data and Perspectives on Data Challenges
Scientific Article Recommendation with Mahout
Scientific Article Recommendation with MahoutScientific Article Recommendation with Mahout
Scientific Article Recommendation with Mahout
Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley
Mahout Becomes a Researcher: Large Scale Recommendations at MendeleyMahout Becomes a Researcher: Large Scale Recommendations at Mendeley
Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley
improving explicit preference entry by visualising data similarities
improving explicit preference entry by visualising data similaritiesimproving explicit preference entry by visualising data similarities
improving explicit preference entry by visualising data similarities
Etude de la pertinence de critères de recherche en recherche d'informations s...
Etude de la pertinence de critères de recherche en recherche d'informations s...Etude de la pertinence de critères de recherche en recherche d'informations s...
Etude de la pertinence de critères de recherche en recherche d'informations s...
A Computational Model of Staged Language Acquisition
A Computational Model of Staged Language AcquisitionA Computational Model of Staged Language Acquisition
A Computational Model of Staged Language Acquisition
From Syllables to Syntax: Investigating Staged Linguistic Development through...
From Syllables to Syntax: Investigating Staged Linguistic Development through...From Syllables to Syntax: Investigating Staged Linguistic Development through...
From Syllables to Syntax: Investigating Staged Linguistic Development through...
A Collaborative Tool for the Computational Modelling of Child Language Acquis...
A Collaborative Tool for the Computational Modelling of Child Language Acquis...A Collaborative Tool for the Computational Modelling of Child Language Acquis...
A Collaborative Tool for the Computational Modelling of Child Language Acquis...
Mendeley: crowdsourcing and recommending research on a large scale
Mendeley: crowdsourcing and recommending research on a large scaleMendeley: crowdsourcing and recommending research on a large scale
Mendeley: crowdsourcing and recommending research on a large scale
Mendeley, putting data into the hands of researchers
Mendeley, putting data into the hands of researchersMendeley, putting data into the hands of researchers
Mendeley, putting data into the hands of researchers
Cloud Elephants and Witches: A Big Data Tale from Mendeley
Cloud Elephants and Witches: A Big Data Tale from MendeleyCloud Elephants and Witches: A Big Data Tale from Mendeley
Cloud Elephants and Witches: A Big Data Tale from Mendeley

Recently uploaded

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks

Recently uploaded (20)

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):

Recommendation Engines for Scientific Literature Search

  • 1. Recommendation Engines for Scientific Literature Kris Jack, PhD Data Mining Team Lead
  • 2. Summary ➔ 2 recommendation use cases ➔ literature search with Mendeley ➔ use case 1: related research ➔ use case 2: personalised recommendations
  • 3. Use Cases Two types of 1) Related Research ● given 1 research article recommendation ● find other related articles use cases: 2) Personalised Recommendations ● given a user's profile (e.g. interests) ● find articles of interest to them
  • 4.
  • 5.
  • 6. Use Cases My secondment 1) Related Research ● given 1 research article (Dec-Feb): ● find other related articles 2) Personalised Recommendations ● given a user's profile (e.g. interests) ● find articles of interest to them
  • 7. Literature Search Using Mendeley Challenge! ● Use only Mendeley to perform literature search for: ● Related research ● Personalised recommendations Eating your  own dog food...
  • 8. Found: Queries: “content similarity”, “semantic similarity”, “semantic relatedness”, “PubMed 0 related articles”, “Google Scholar related articles”
  • 9. Found: Queries: “content similarity”, “semantic similarity”, “semantic relatedness”, “PubMed 1 related articles”, “Google Scholar related articles”
  • 14. Literature Search Using Mendeley Summary of Results Strategy Num Docs Comment Found Catalogue Search 19 9 from “Related Research” Group Search 0 Needs work Perso Recommendations 45 Led to a group with 37 docs! Found: 64
  • 15. Literature Search Using Mendeley Summary of Results Strategy Num Docs Comment Found Catalogue Search 19 9 from “Related Research” Group Search 0 Needs work Perso Recommendations 45 Led to a group with 37 docs! Eating your  Found: own dog food...   Tastes good! 64
  • 16. 64 => 31 docs, read 14 so far, so what do they say...?
  • 17. Use Cases 1) Related Research ● given 1 research article ● find other related articles
  • 18. Use Case 1: Related Research 7 highly relevant papers (related research for scientific articles) Q1/4: How are the systems evaluated? User study (e.g. Likert scale to rate relatedness between documents). (Beel & Gipp, 2010) TREC collections with hand classified 'related articles' (e.g. TREC 2005 genomics track). (Lin & Wilbur, 2007) Try to reconstruct a document's reference list (Pohl, Radlinski, & Joachims, 2007; Vellino, 2009)
  • 19. Use Case 1: Related Research 7 highly relevant papers (related research for scientific articles) Q2/4: How are the systems trained? Paper reference lists (Pohl et al., 2007; Vellino, 2009) Usage data (e.g. PubMed, arXiv) (Lin & Wilbur, 2007) Document content (e.g. metadata, co-citation, bibliographic coupling) (Gipp, Beel, & Hentschel, 2009) Collocation in mind maps (Jöran Beel & Gipp, 2010)
  • 20. Use Case 1: Related Research 7 highly relevant papers (related research for scientific articles) Q3/4: Which techniques are applied? bm25 (Lin & Wilbur, 2007) Topic modelling (Lin & Wilbur, 2007) Collaborative filtering (Pohl et al., 2007) Bespoke heuristics for feature extraction (e.g. in-text citation metrics for same sentence, paragraph). (Pohl et al., 2007; Gipp et al., 2009)
  • 21. Use Case 1: Related Research 7 highly relevant papers (related research for scientific articles) Q4/4: Which techniques have most success? Topic modelling slighty improves on BM25 (MEDLINE abstracts) (Lin & Wilbur, 2007): - bm25 = 0.383 precision @ 5 - PMRA = 0.399 precision @ 5 Seeding CF with usage data from arXiv won out over using citation lists (Pohl et al., 2007) Not yet found significant results that show content- based or CF methods are better for this task
  • 22. Use Case 1: Related Research Progress so far... Q1/2 How do we evaluate our system? Construct a non-complex data set of related research: ● include groups with 10-20 documents (i.e. topics) ● no overlaps between groups (i.e. documents in common) ● only take documents that are recognised as being in English ● document metadata must be 'complete' (i.e. has title, year, author, published in, abstract, filehash, abstract, tags/keywords/MeSH terms) → 4,382 groups → mean size = 14 → 60,715 individual documents Given a doc, aim to retrieve the other docs from its group ● tf-idf with lucene implementation
  • 23. Use Case 1: Related Research Progress so far... Q1/2 How do we evaluate our system? Construct a non-complex data set of related research: ● include groups with 10-20 documents (i.e. topics) ● no overlaps between groups (i.e. documents in common) ● only take documents that are recognised as being in English ● document metadata must be 'complete' (i.e. has title, year, author, published in, abstract, filehash, abstract, tags/keywords/MeSH terms) → 4,382 groups → mean size = 14 → 60,715 individual documents Given a doc, aim to retrieve the other docs from its group ● tf-idf with lucene implementation
  • 24. Use Case 1: Related Research Progress so far... Metadata Presence in Documents 100.00% Q1/2 How do we evaluate our system? 90.00% 80.00% 70.00% Construct a non-complex data set of related research: % of documents that field appears in 60.00% ● include groups with 10-20 documents (i.e. topics) 50.00% Evaluation Data Det ● no overlaps between groups (i.e. documents in common) Group 40.00% ● only take documents that are recognised as being in English Catalogue 30.00% ● document metadata must be 'complete' (i.e. has title, year, author, published in, abstract, filehash, abstract, tags/keywords/MeSH terms) 20.00% 10.00% → 4,382 groups 0.00% title year author publishedIn fileHash abstract generalKeyword meshTerms keywords tags → mean size = 14 → 60,715 individual documents Given a doc, aim to retrieve thefield metadata other docs from its group
  • 25. Use Case 1: Related Research Progress so far... Q2/2 What are our results? tf-idf Precision per Field for Complete Data Set 0.3 0.25 0.2 Precision @ 5 0.15 0.1 0.05 0 title mesh-term keyword abstract generalKeyword author tag metadata field
  • 26. Use Case 1: Related Research Progress so far... Q2/2 What are our results? tf-idf Precision per Field when Field is Available 0.5 0.45 0.4 0.35 Precision @ 5 0.3 0.25 0.2 0.15 0.1 0.05 0 tag abstract mesh-term title general-keyword author keyword metadata field
  • 27. Use Case 1: Related Research Progress so far... Q2/2 What are our results? tf-idf Precision for Field Combos for Complete Data Set 0.4 0.35 0.3 0.25 precision @ 5 0.2 0.15 0.1 0.05 0 abstract generalKeyword author tag bestCombo title mesh-term keyword metadata field(s) BestCombo = abstract+author+general-keyword+tag+title
  • 28. Use Case 1: Related Research Progress so far... Q2/2 What are our results? tf-idf Precision for Field Combos when Field is Available 0.5 0.45 0.4 0.35 0.3 precision @ 5 0.25 0.2 0.15 0.1 0.05 0 bestCombo mesh-term general-keyword keyword tag abstract title author metadata field(s) BestCombo = abstract+author+general-keyword+tag+title
  • 29. Use Case 1: Related Research Future directions...? Evaluate multiple techniques on same data set Construct public data set ● similar to current one but with data from only public groups ● analyse composition of data set in detail Train: ● content-based filtering ● collaborative filtering ● hybrid Evaluate the different systems on same data set ...and let's brainstorm!
  • 30. Use Cases 2) Personalised Recommendations ● given a user's profile (e.g. interests) ● find articles of interest to them
  • 31. Use Case 2: Perso Recommendations 7 highly relevant papers (perso recs for scientific articles) Q1/4: How are the systems evaluated? Cross validation on user libraries (Bogers & van Den Bosch, 2009; Wang & Blei, 2011) User studies (McNee, Kapoor, & Konstan, 2006; Parra-Santander & Brusilovsky, 2009)
  • 32. Use Case 2: Perso Recommendations 7 highly relevant papers (perso recs for scientific articles) Q2/4: How are the systems trained? CiteULike libraries (Bogers & van Den Bosch, 2009; Parra-Santander & Brusilovsky, 2009; Wang & Blei, 2011) Documents represent users and their citations documents of interest (McNee et al., 2006) User search history (N Kapoor et al., 2007)
  • 33. Use Case 2: Perso Recommendations 7 highly relevant papers (perso recs for scientific articles) Q3/4: Which techniques are applied? CF (Parra-Santander & Brusilovsky, 2009; Wang & Blei, 2011) LDA (Wang & Blei, 2011) Hybrid of CF + LDA (Wang & Blei, 2011) BM25 over tags to form user neighbourhood (Parra-Santander & Brusilovsky, 2009) Item-based and content-based CF (Bogers & van Den Bosch, 2009) User-based CF, Naïve Bayes classifier, Probabilistic Latent Semantic Indexing, textual TF-IDF-based algorithm (uses document abstracts) (McNee et al., 2006)
  • 34. Use Case 2: Perso Recommendations 7 highly relevant papers (perso recs for scientific articles) Q4/4: Which techniques have most success? CF is much better than topic modelling (Wang & Blei, 2011) CF-topic modelling hybrid, slightly outperforms CF alone (Wang & Blei, 2011) Content-based filtering performed slightly better than item-based filtering on a test set with 1,322 CiteULike users (Bogers & van Den Bosch, 2009) User-based CF and tf-idf outperformed Naïve Bayes and Probabilistic Latent Semantic Indexing significantly (McNee et al., 2006) BM25 gave better results than CF but the study was with just 7 CiteULike users so small scale (Parra-Santander & Brusilovsky, 2009)
  • 35. Use Case 2: Perso Recommendations 7 highly relevant papers (perso recs for scientific articles) Q4/4: Which techniques have most success? Advantage Disadvantage Content- Human readable form of their profile Tends to over- based specialise Quickly absorb new content without need for ratings CF Works on an abstract item-user level so you don't Requires a lot of need to 'understand' the content data Tends to give more novel and creative recommendations
  • 36. Use Case 2: Perso Recommendations Our progress so far... Q1/2 How do we evaluate our system? Construct an evaluation data set from user libraries ● 50,000 user libraries ● 10-fold cross validation ● libraries vary from 20-500 documents ● preference values are binary (in library = 1; 0 otherwise) Train: ● item-based collaborative filtering recommender Evaluate: ● train recommender and test how well it can reconstruct the users' hidden testing libraries ● mulitple similarity metrics (e.g. cooccurrence, loglikelihood)
  • 37. Use Case 2: Perso Recommendations Our progress so far... Q2/2 What are our results? Cross validation: ● 0.1 precision @ 10 articles Usage logs: ● 0.4 precision @ 10 articles
  • 38. Use Case 2: Perso Recommendations Our progress so far... Q2/2 What are our results?
  • 39. Use Case 2: Perso Recommendations Our progress so far... Q2/2 What are our results? Precision at 10 articles Number of articles in user library
  • 40. Use Case 2: Perso Recommendations Future directions...? Evaluate multiple techniques Q2/2 What are our results? on same data set Construct data set ● similar to current one but with more up-to-date data ● analyse composition of data set in detail Train: ● content-based filtering ● collaborative filtering (user-based and item-based) ● hybrid Evaluate the different systems on same data set ...and let's brainstorm!
  • 41. Conclusion ➔ 2 recommendation use cases ➔ similar problems and techniques ➔ good results so far ➔ combining CF with content would likely improve both
  • 43. References Beel, Jöran, & Gipp, B. (2010). Link Analysis in Mind Maps  : A New Approach to Determining Document Relatedness. Mind, (January). Citeseer. Retrieved from hl=en&btnG=Search&q=intitle:Link+Analysis+in+Mind+Maps+: +A+New+Approach+to+Determining+Document+Relatedness#0 Bogers, T., & van Den Bosch, A. (2009). Collaborative and Content-based Filtering for Item Recommendation on Social Bookmarking Websites. ACM RecSys ’09 Workshop on Recommender Systems and the Social Web. New York, USA. Retrieved from Gipp, B., Beel, J., & Hentschel, C. (2009). Scienstein: A research paper recommender system. Proceedings of the International Conference on Emerging Trends in Computing (ICETiC’09) (pp. 309–315). Retrieved from Kapoor, N, Chen, J., Butler, J. T., Fouty, G. C., Stemper, J. A., Riedl, J., & Konstan, J. A. (2007). Techlens: a researcher’s desktop. Proceedings of the 2007 ACM conference on Recommender systems (pp. 183-184). ACM. doi:10.1145/1297231.1297268 Lin, J., & Wilbur, W. J. (2007). PubMed related articles: a probabilistic topic-based model for content similarity. BMC Bioinformatics, 8(1), 423. BioMed Central. Retrieved from McNee, S. M., Kapoor, N., & Konstan, J. A. (2006). Don’t look stupid: avoiding pitfalls when recommending research papers. Proceedings of the 2006 20th anniversary conference on Computer supported cooperative work (p. 180). ACM. Retrieved from Parra-Santander, D., & Brusilovsky, P. (2009). Evaluation of Collaborative Filtering Algorithms for Recommending Articles. Web 3.0: Merging Semantic Web and Social Web at HyperText ’09 (pp. 3-6). Torino, Italy. Retrieved from http://ceur- Pohl, S., Radlinski, F., & Joachims, T. (2007). Recommending related papers based on digital library access records. Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries (pp. 418-419). ACM. Retrieved from Vellino, A. (2009). The Effect of PageRank on the Collaborative Filtering Recommendation of Journal Articles. Retrieved from Wang, C., & Blei, D. M. (2011). Collaborative topic modeling for recommending scientific articles. Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 448–456). ACM. Retrieved from