In this project, we investigated the use of association rules to extract useful knowledge from raw ontological data. To this end, we proposed an approach to pass from graph representation to transactional data. Then, we used different technological solutions to improve the performance of frequent item-sets extraction such as the FP-growth algorithm, and Hadoop. Check our code on Github: https://github.com/8-chems/OntologyMiner
1. Presented by:
ChemsEddine Berbague STIC May. 2015
Supervisor: Pr.Seridi Hassina
Co-supervisor: Dr. Beldjoudi Samia
Jury members:
Dr. Hariati Mehdi
Dr. Mendjel Mehdi
Master Project Presentation
«Association Rules Mining: Ontological Approach»
University of Badji Mokhtar-Annaba
Computer Science Departement
2. 2
• The work presented in the next slides is partially taken
and improved from the work of :
▫ Claudia Marinica. 2010 (Association Rule Interactive Post-processing
using Rule Schemas and Ontologies - ARIPSO).
Note
5. 5
Context and project axe
This Project is about two main tasks:
• Knowledge extraction.
• Ontologies enrichment.
• Axe : mining ontologies using association rules to extract useful
knowledge.
6. 6
Knowledge extraction: general scheme and definition
"...extracting from data original information, previously
unknown, and potentially useful."
[fayyad et al.,1996]
8. 8
STEP 1: generate frequent item-sets
STEP 2: generate association rules
Reduce the number of item-sets using support threshold
Reduce the number of comparisons
FP-Groth algorithm
Calculate support
Hashsets
Using advanced data structure
8
• APRIORI is one of well-known algorithms used for association rules
extraction. It identifies frequent sets from transactional datasets.
Data mining: association rules algorithm
11. 11
The steps of APRIORI algorithm
Step 5
For every frequent set m, generate
all non-empty subsets E
Step 6
For each sub-set non-empty s of E,
generate the rules: "s => (E-s)" if the
confidence C [support (s) / support(E))]>
min_conf
Step 3
Scan the transactional dataset to get the support of
each k-item-set, then filter the set in regard to
min_sup, to get the set 𝑳𝑘 of most frequent k-item-
sets
Step 4
Set of
candidates
= Null
Step 1
Scan the transactional dataset to get the support of
each 1-itemsset
Step 2
use 𝐿𝑘−1 join 𝐿𝑘−1 to generate the set of k-
itemsets candidates. No
Yes
12. 12
• APPRIORI is limited in the different steps of the extraction process:
Wide number of rules.
Semantically meaningless confidence and support measures.
User help is required to extract the targeted rules.
The complexity of the algorithm is O(b).
APRIORI: limitations
13. 13
• Advantages : unsupervised technique, readable results , full sets
• limits: big volume and low quality of the extracted rules:
• invalid statistically.
▫ Onions => pain
• redundant:
▫ R1: X, Y=> Z [c];X => Y [c1]; X => Z [c2]
▫ c1>c or c2>c => R1 is redundant
• Known by the expert
▫ X => Y (rule can be acquired from the context)
• useless for the expert
▫ X => Y (rule is semantically meaningless such as apple implies skirt)
• Difficulty of the manual analyze
• The complexity of the algorithm
▫ Complexity O(b)
• Need:
▫ Eliminate the un-interesting rules.
▫ Target the rules of quality.
Data mining: association rules problematic
14. « an ontology is an explicit and formal
specification of a shared
conceptualization" [Gruber,1993]
14
Knowledge engineering:
the ontologies
« introducing an ontology in an
information system allows to reduce the
conceptual and terminological confusion
and offers a shared understanding that
enhances the communication, the
sharing, , the interpretation, and the
possible re-using"[gandon,2006]
Formal definition:
O={C,G,I,P}
C=Concepts- elements of the domain.
G= Graph of concepts- relation is-one
I=Instances – individuals of the concept
P=Properties- relation between concepts
Food
product
Fruit
grape
red grape
green
grape
appel pear
Dairy
product
milk
cheese
butter
Meat
chicken
beef
15. 15
« semantic web is a part of the current web in which the information is represented
semantically, and allows machines and users to better function
together."
[berners-lee et al.,2001]
• Knowledge representation languages:
▫ RDF,OWL,...
▫ OWL-DL is based on the description logic and can be defined by an accurate and
decidable formalism.
• Reasoning engine:
▫ action-classification of concepts ,test of coherence et test of instantiation.
▫ Fact, Racer, Pellet,...
▫ Querying language: SparQL.
Knowledge engineering: semantic web
16. 16
• Increase the use of ontologies in the process of association rules
extraction:
• Convert the ontologies intro a transaction dataset.
▫ Benefit from the semantic richness to improve the quality of association
rules.
▫ Reduce the complexity of the classical association rules algorithms.
Objectives
17. 17
I. A new method to extract transactional information from the
ontologies.
II. Developing an application that allows to extract, validate, and
visualize the association rules.
III. Using the Framework HADOOP to extract frequent item-sets.
IV. Experimentations on NiceTag ontology.
Contribution
19. 19
• FP-Growth identifies all frequent item-sets without generating candidate item-sets.
• Approach of two steps:
▫ Step 1: Build a compact data structure named FP-tree. This step requires to pass by the
dataset.
▫ Step 2: Directly extract frequent item-sets from the FP-tree.
Complexity problem: FP-Growth algorithm
22. 22
• Features of the selected rules [Silberschats et Tuzhilin,1995] :
▫ Novelty : unexpected rules for the expert.
▫ Actionability : useful rules, allow an expert to take decisions.
• Quality metrics: [Freitas,1999]
▫ Objective measures.
▫ Subjective measures.
• Objective metrics (data-based)
[Piatetsky-shapiro,1991;Guillet and Hamilton,2007]
• Based-data statistical indicator of the association rules significance,
• Advantage : non-supervised quality metrics are easy to apply.
• Disadvantages: not adequate for personalized criterion.
Quality problem: metrics
23. 23
• Models [klementtinen et al., 1994]
• principal: the expert defines his expectations on which the association rules can be selected.
• Representing the expert expectations:
• inclusive pattern (PI) et restrictive pattern (RP)
• Selection technique: syntactic.
• Example:
▫ (PI) Fruit, Dairy products => Meat
▫ (PE) Pear, Dairy products => Meat
▫ R1: Pear, Milk => Pork
▫ R2: Apple , Milk => Chicken
▫ R3: Beef , Milk=> raisin
• R2 is selected.
Quality problem: models
Food
product
Fruit
grape
red grape
green
grape
appel pear
Dairy
product
milk
cheese
butter
Meat
chicken
beef
24. 24
Quality problem: post-processing technique
I. Association rules extraction using the classical method.
II. Knowledge model: enrichment of a model by an expert.
III. Phase of post-processing ARIPSO [Claudia Marinica. 2010] : apply
pruning/selection models.
25. 25
• Previous approaches have a limited use of ontologies.
• Using filtering models is a hard process which depends on the existence
of an expert.
Conclusion
26. 26
Plan de travail
Ontological approach
2
3
4
5
Introduction
Approche existantes
Proposed approach
Application
Conclusion
Description logic
Semantic web and ontologies
Conclusion
27. 27
Ontological layers: T-Box & A-Box
• Attributes assertion
• Concepts assertion
• Associations assertion
• Consistence verification
• Satisfability verification
T-Box A-Box
• Get/ search
• Instance verification
• training
• Coherence testing
Identity
evaluation
homonymie
Search
the text
• Define axioms
• Infer and classify concepts
• Infer associations
• Test the equivalence
• Test the implication
• Test the satisfability
Reasoning
« Extract a
knowledge base is
uncovering hidden
information»
29. 29
• It exists many syntaxes to represent an ontology, we cite among them,
the next:
• Manchester OWL Syntax
OWL/XML
OWL Functional Syntax
RDF/XML
Turtle
Latex
• OWL API permits to interrogate the ontology with different queries.
Ontologies: representation syntaxes
<owl:Class rdf:ID="Lait">
<rdfs:subClassOf
rdf:resource="&food;PotableLiquid"/>
<rdfs:label xml:lang="en">wine</rdfs:label>
<rdfs:label xml:lang="fr">vin</rdfs:label>
</owl:Class>
30. 30
• Exploit the semantic richness to:
▫ Extract transactions:
Step 1 : extract a T-Box model.
Step 2 : extract an A-Box model.
▫ Apply an extraction algorithm to generate the association rules.
How to achieve this task ?
Association rules extraction: ontological approach
31. 31
Ontological approach : general scheme
APPRIORI F-PTREE
Validation and
visualisation
HADOOP
Association rules
extraction
Associations rules and
frequent item-sets
Transactions
Ontology manager
T-Box
extraction
A-Box
extraction
Transactions
extraction
Concepts-based
filtering
Instances-based
filtering
Table T-Box Table A-Box
Algorithm
choice
Ontology
User
filtering
1 2
32. 32
T-Box layer
ID Item-sets
Patient <p1, disease>, <p2, drug>, <p3, cardiologist>,<p4,
gynecologist>, <p5, person>,
disease <p6-,symptom>,<p1-,patient>,<p7,drug>
doctor <p8-, cardiologist>, <p9-, gynecologist>, <p10, person>
symptom <p6, disease>
drug <p2-, patient>, <p7-, disease>
cardiologist <p8, doctor>, <p3-, patient>
gynecologist <p9, doctor>, <p4-, patient>
Person <p5-, patient>, <p10-, doctor>
1
patient
drug
doctor
disease
sympto
m
person
gynecol
ogist
cardiol
ogist
p
3
p
9
p
1
p
2
p
5
p
4
p
7
p
6 p
8
p
1
0
33. 33
A A-Box layer
2
ID Item-sets
Pat 10 <p1, disease 12>, <p2, drug 23 >, <p3, cardiolo x>,<p5,
person>
Pat 12 <p1, disease 12>, <p2, drug 24 >, <p3, cardiolo x>, <p5,
person>
doct 23 <p8-, cardiolo>, <p10, person>
symptom 45 <p6, disease 12>
patient
drug
doctor
disease
sympto
m
person
gynecol
ogist
cardiol
ogist
p
3
p
9
p
1
p
2
p
5
p
4
p
7
p
6 p
8
p
1
0
34. 34
Frequent item-sets extraction using HADOOP
Files of the
ontology
Resulted files
MAP
Identify all possible k-item-sets
REDUCE
Calculate the support of all k-item-
sets
Context
HADOOP
Using HADOOP to extract frequent item-sets
35. 35
Ontological approach : steps of association rules extraction
F-PTREE
Generate frequent item-sets
Set of frequent item-sets
Generating association rules using multi-
threading process
Set of association rules
Sub-set of rules
Support threshold
Apriori
Hadoop
44. 44
• Association rules mining suffer two main issues:
▫ Data complexity processing.
▫ Association rules quality.
• Semantic web can be exploited successfully to improve the quality of
association rules.
• In this project:
• We have extracted a transactional dataset.
• We have applied different frequent item-sets extraction techniques.
• We implemented a visual application to mine ontologies for association
rules.
Conclusion
46. 46
• [Claudia Marinica. 2010] Association Rule Interactive Post-processing using Rule Schemas and Ontologies -
ARIPSO.
• [fayyad et al.,1996]: Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. From data mining to
knowledge discovery in databases. AI Magazine, 17:37 – 54, 1996.
• [gandon,2006]: Fabien Gandon. Ontologies informatiques, May 2006.
• [gruber,1993]: Thomas R. Gruber. Toward principles for the design of ontologies used for knowledge sharing. In
Nicola Guarino and Roberto Poli, editors, Formal Ontology in Conceptual Analysis and Knowledge
Representation. Kluwer AcademicPublishers, 1993.
• [berners-lee et al.,2001]: Tim Berners-Lee, James Hendler, and Ora Lassila. The semantic web - a new form of
web content that is meaningful to computers will unleash a revolution of new possibilities. Scientific American,
2001.
• [bayardo et al.1999]: Roberto J. Bayardo Jr., Rakesh Agrawal, and Dimitrios Gunopulos. Constraintbased rule
mining in large, dense databases. ICDE ’99: Proceedings of the 15th International Conference on Data
Engineering, pages 188–197, 1999
• [liu et al.,1999]: Bing Liu, Wynne Hsu, and Yiming Ma. Pruning and summarizing the discovered associations. In
KDD ’99: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data
mining, pages 125–134.ACM, 1999.
REFERENCES
47. 47
• [Srikant et agrwal,1996]: Ramakrishnan Srikant and Rakesh Agrawal. Mining quantitative association rules in
large relational tables. In Proceedings of the 1996 ACM SIGMOD international conference on Management of
data, pages 1–12, 1996.
• [Silberschats et Tuzhilin,1995] : Abraham Silberschatz and Alexander Tuzhilin. On subjective measures of
interestingness in knowledge discovery. Knowledge Discovery and Data Mining (KDD), pages 275–281, 1995.
• [Piatetsky-shapiro,1991]: G. Piatetsky-Shapiro. Knowledge Discovery in Databases, chapter Discovery, Analysis,
and Presentation of Strong Rules, page 229248. AAAI/MIT Press, 1991.
• [Guillet and Hamilton,2007]: F. Guillet and H. Hamilton. Quality Measures in Data Mining. Studies in
Computational Intelligence, 2007
• [klementtinen et al., 1994]: Mika Klemettinen, Heikki Mannila, Pirjo Ronkainen, Hannu Toivonen, and A. Inkeri
Verkamo. Finding interesting rules from large sets of discovered association rules. International Conference on
Information and Knowledge Management (CIKM), pages 401–407, 1994
• [Burdick, 2005]: Doug Burdick, Manuel Calimlim, Jason Flannick, Johannes Gehrke, and Tomi Yiu. Mafia: A
maximal frequent itemset algorithm. IEEE Transactions on Knowledge and Data Engineering, 17(11):1490–1504,
2005
REFERENCES
48. 48
• [J. Zaki et al, 2002]: Mohammed J. Zaki and Ching J. Hsiao. Charm: An efficient algorithm for
• closed itemset mining. In Proceedings of SIAM’02, 2002.
• [R. Agrawal et al, 1994]: Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules.
Procedings of 20th International Conference Very Large Data Bases, VLDB, pages 487–499, 1994.
• [J. Han et al, 2000]: Jiawei Han and Jian Pei. Mining frequent patterns by pattern-growth: methodology and
implications. ACM SIGKDD Explorations Newsletter, Special issue on Scalable data mining algorithms,
2000(2):14–20, 2.
• [Hadoop]: Apache Software Foundation. (2010). Hadoop. Retrieved from https://hadoop.apache.org
References