Ontologies mining using association rules

Presented by:
ChemsEddine Berbague STIC May. 2015
Supervisor: Pr.Seridi Hassina
Co-supervisor: Dr. Beldjoudi Samia
Jury members:
Dr. Hariati Mehdi
Dr. Mendjel Mehdi
Master Project Presentation
«Association Rules Mining: Ontological Approach»
University of Badji Mokhtar-Annaba
Computer Science Departement

2
• The work presented in the next slides is partially taken
and improved from the work of :
▫ Claudia Marinica. 2010 (Association Rule Interactive Post-processing
using Rule Schemas and Ontologies - ARIPSO).
Note

3
Timeline
2
3
5
6
Introduction
Existing Approaches
Proposed Approach
Application
Conclusion

4
Introduction
2
3
4
5
Introduction
Existing Approaches
Proposed Approach
Application
Conclusion
Context
Problematic

5
Context and project axe
This Project is about two main tasks:
• Knowledge extraction.
• Ontologies enrichment.
• Axe : mining ontologies using association rules to extract useful
knowledge.

6
Knowledge extraction: general scheme and definition
"...extracting from data original information, previously
unknown, and potentially useful."
[fayyad et al.,1996]

7
Data mining: association rules

8
STEP 1: generate frequent item-sets
STEP 2: generate association rules
Reduce the number of item-sets using support threshold
Reduce the number of comparisons
FP-Groth algorithm
Calculate support
Hashsets
Using advanced data structure
8
• APRIORI is one of well-known algorithms used for association rules
extraction. It identifies frequent sets from transactional datasets.
Data mining: association rules algorithm

9
Step1 :
generating
frequent sets

10
Step 2 : generating
association rules

11
The steps of APRIORI algorithm
Step 5
For every frequent set m, generate
all non-empty subsets E
Step 6
For each sub-set non-empty s of E,
generate the rules: "s => (E-s)" if the
confidence C [support (s) / support(E))]>
min_conf
Step 3
Scan the transactional dataset to get the support of
each k-item-set, then filter the set in regard to
min_sup, to get the set 𝑳𝑘 of most frequent k-item-
sets
Step 4
Set of
candidates
= Null
Step 1
Scan the transactional dataset to get the support of
each 1-itemsset
Step 2
use 𝐿𝑘−1 join 𝐿𝑘−1 to generate the set of k-
itemsets candidates. No
Yes

12
• APPRIORI is limited in the different steps of the extraction process:
 Wide number of rules.
 Semantically meaningless confidence and support measures.
 User help is required to extract the targeted rules.
 The complexity of the algorithm is O(b).
APRIORI: limitations

13
• Advantages : unsupervised technique, readable results , full sets
• limits: big volume and low quality of the extracted rules:
• invalid statistically.
▫ Onions => pain
• redundant:
▫ R1: X, Y=> Z [c];X => Y [c1]; X => Z [c2]
▫ c1>c or c2>c => R1 is redundant
• Known by the expert
▫ X => Y (rule can be acquired from the context)
• useless for the expert
▫ X => Y (rule is semantically meaningless such as apple implies skirt)
• Difficulty of the manual analyze
• The complexity of the algorithm
▫ Complexity O(b)
• Need:
▫ Eliminate the un-interesting rules.
▫ Target the rules of quality.
Data mining: association rules problematic

« an ontology is an explicit and formal
specification of a shared
conceptualization" [Gruber,1993]
14
Knowledge engineering:
the ontologies
« introducing an ontology in an
information system allows to reduce the
conceptual and terminological confusion
and offers a shared understanding that
enhances the communication, the
sharing, , the interpretation, and the
possible re-using"[gandon,2006]
Formal definition:
O={C,G,I,P}
C=Concepts- elements of the domain.
G= Graph of concepts- relation is-one
I=Instances – individuals of the concept
P=Properties- relation between concepts
Food
product
Fruit
grape
red grape
green
grape
appel pear
Dairy
product
milk
cheese
butter
Meat
chicken
beef

15
« semantic web is a part of the current web in which the information is represented
semantically, and allows machines and users to better function
together."
[berners-lee et al.,2001]
• Knowledge representation languages:
▫ RDF,OWL,...
▫ OWL-DL is based on the description logic and can be defined by an accurate and
decidable formalism.
• Reasoning engine:
▫ action-classification of concepts ,test of coherence et test of instantiation.
▫ Fact, Racer, Pellet,...
▫ Querying language: SparQL.
Knowledge engineering: semantic web

16
• Increase the use of ontologies in the process of association rules
extraction:
• Convert the ontologies intro a transaction dataset.
▫ Benefit from the semantic richness to improve the quality of association
rules.
▫ Reduce the complexity of the classical association rules algorithms.
Objectives

17
I. A new method to extract transactional information from the
ontologies.
II. Developing an application that allows to extract, validate, and
visualize the association rules.
III. Using the Framework HADOOP to extract frequent item-sets.
IV. Experimentations on NiceTag ontology.
Contribution

18
Timeline
2
3
4
5
Introduction
Existing approaches
Proposed approach
Application
Conclusion
Complexity problem
Quality problem
Conclusion

19
• FP-Growth identifies all frequent item-sets without generating candidate item-sets.
• Approach of two steps:
▫ Step 1: Build a compact data structure named FP-tree. This step requires to pass by the
dataset.
▫ Step 2: Directly extract frequent item-sets from the FP-tree.
Complexity problem: FP-Growth algorithm

20
• Algorithm MAFIA : [Burdick, 2005]:
▫ Extract maximal frequent item-sets.
• Algorithm CHARM: [J. Zaki et al, 2002]
▫ Extract closed item-sets.
Complexity problem: more algorithms

21
• Pruning: minimal augmentation (MICF) [bayardo et al.1999]
▫ R1 : milk, pork => pear[S=20%,C=71%]
▫ R2 :milk => pear [S=25%,C=70%] =>R1 is redundant
▫ R3 :pork= >pear [S=30%,C=72%]
• Deduce summaries [liu et al.,1999;Srikant et agrwal,1996]
▫ Apple => pork
▫ Pear => pork Fruit=>pork
Quality problem: post-processing technique

22
• Features of the selected rules [Silberschats et Tuzhilin,1995] :
▫ Novelty : unexpected rules for the expert.
▫ Actionability : useful rules, allow an expert to take decisions.
• Quality metrics: [Freitas,1999]
▫ Objective measures.
▫ Subjective measures.
• Objective metrics (data-based)
[Piatetsky-shapiro,1991;Guillet and Hamilton,2007]
• Based-data statistical indicator of the association rules significance,
• Advantage : non-supervised quality metrics are easy to apply.
• Disadvantages: not adequate for personalized criterion.
Quality problem: metrics

23
• Models [klementtinen et al., 1994]
• principal: the expert defines his expectations on which the association rules can be selected.
• Representing the expert expectations:
• inclusive pattern (PI) et restrictive pattern (RP)
• Selection technique: syntactic.
• Example:
▫ (PI) Fruit, Dairy products => Meat
▫ (PE) Pear, Dairy products => Meat
▫ R1: Pear, Milk => Pork
▫ R2: Apple , Milk => Chicken
▫ R3: Beef , Milk=> raisin
• R2 is selected.
Quality problem: models
Food
product
Fruit
grape
red grape
green
grape
appel pear
Dairy
product
milk
cheese
butter
Meat
chicken
beef

24
Quality problem: post-processing technique
I. Association rules extraction using the classical method.
II. Knowledge model: enrichment of a model by an expert.
III. Phase of post-processing ARIPSO [Claudia Marinica. 2010] : apply
pruning/selection models.

25
• Previous approaches have a limited use of ontologies.
• Using filtering models is a hard process which depends on the existence
of an expert.
Conclusion

26
Plan de travail
Ontological approach
2
3
4
5
Introduction
Approche existantes
Proposed approach
Application
Conclusion
Description logic
Semantic web and ontologies
Conclusion

27
Ontological layers: T-Box & A-Box
• Attributes assertion
• Concepts assertion
• Associations assertion
• Consistence verification
• Satisfability verification
T-Box A-Box
• Get/ search
• Instance verification
• training
• Coherence testing
Identity
evaluation
homonymie
Search
the text
• Define axioms
• Infer and classify concepts
• Infer associations
• Test the equivalence
• Test the implication
• Test the satisfability
Reasoning
« Extract a
knowledge base is
uncovering hidden
information»

28
Semantic web
SELECT ?player
WHERE {
?player rdf:type mnply:MonopolyPlayer .
}

29
• It exists many syntaxes to represent an ontology, we cite among them,
the next:
• Manchester OWL Syntax
 OWL/XML
 OWL Functional Syntax
 RDF/XML
 Turtle
 Latex
• OWL API permits to interrogate the ontology with different queries.
Ontologies: representation syntaxes
<owl:Class rdf:ID="Lait">
<rdfs:subClassOf
rdf:resource="&food;PotableLiquid"/>
<rdfs:label xml:lang="en">wine</rdfs:label>
<rdfs:label xml:lang="fr">vin</rdfs:label>
</owl:Class>

30
• Exploit the semantic richness to:
▫ Extract transactions:
 Step 1 : extract a T-Box model.
 Step 2 : extract an A-Box model.
▫ Apply an extraction algorithm to generate the association rules.
 How to achieve this task ?
Association rules extraction: ontological approach

31
Ontological approach : general scheme
APPRIORI F-PTREE
Validation and
visualisation
HADOOP
Association rules
extraction
Associations rules and
frequent item-sets
Transactions
Ontology manager
T-Box
extraction
A-Box
extraction
Transactions
extraction
Concepts-based
filtering
Instances-based
filtering
Table T-Box Table A-Box
Algorithm
choice
Ontology
User
filtering
1 2

32
T-Box layer
ID Item-sets
Patient <p1, disease>, <p2, drug>, <p3, cardiologist>,<p4,
gynecologist>, <p5, person>,
disease <p6-,symptom>,<p1-,patient>,<p7,drug>
doctor <p8-, cardiologist>, <p9-, gynecologist>, <p10, person>
symptom <p6, disease>
drug <p2-, patient>, <p7-, disease>
cardiologist <p8, doctor>, <p3-, patient>
gynecologist <p9, doctor>, <p4-, patient>
Person <p5-, patient>, <p10-, doctor>
1
patient
drug
doctor
disease
sympto
m
person
gynecol
ogist
cardiol
ogist
p
3
p
9
p
1
p
2
p
5
p
4
p
7
p
6 p
8
p
1
0

33
A A-Box layer
2
ID Item-sets
Pat 10 <p1, disease 12>, <p2, drug 23 >, <p3, cardiolo x>,<p5,
person>
Pat 12 <p1, disease 12>, <p2, drug 24 >, <p3, cardiolo x>, <p5,
person>
doct 23 <p8-, cardiolo>, <p10, person>
symptom 45 <p6, disease 12>
patient
drug
doctor
disease
sympto
m
person
gynecol
ogist
cardiol
ogist
p
3
p
9
p
1
p
2
p
5
p
4
p
7
p
6 p
8
p
1
0

34
Frequent item-sets extraction using HADOOP
Files of the
ontology
Resulted files
MAP
Identify all possible k-item-sets
REDUCE
Calculate the support of all k-item-
sets
Context
HADOOP
Using HADOOP to extract frequent item-sets

35
Ontological approach : steps of association rules extraction
F-PTREE
Generate frequent item-sets
Set of frequent item-sets
Generating association rules using multi-
threading process
Set of association rules
Sub-set of rules
Support threshold
Apriori
Hadoop

36
Ontological approach : running flow
1 4
3 5
2 6 7
Ontology loading from a set of files

37
1 4
3 5
2 6 7
T-Box extraction to text file

38
1 4
3 5
2 6 7
T-Box filtering using GUI filter

39
1 4
3 5
2 6 7
A-Box extraction to text file

40
1 4
3 5
2 6 7
Association rules extraction
We used three algorithms to extract
association rules:
• APRIORI [R. Agrawal et al, 1994]
• Fp-growth [J. Han et al, 2000]
• HADOOP Framework

41
1 4
3 5
2 6 7
Validation and visualization of rules

42
1 4
3 5
2 6 7
Association rules storing

44
• Association rules mining suffer two main issues:
▫ Data complexity processing.
▫ Association rules quality.
• Semantic web can be exploited successfully to improve the quality of
association rules.
• In this project:
• We have extracted a transactional dataset.
• We have applied different frequent item-sets extraction techniques.
• We implemented a visual application to mine ontologies for association
rules.
Conclusion

46
• [Claudia Marinica. 2010] Association Rule Interactive Post-processing using Rule Schemas and Ontologies -
ARIPSO.
• [fayyad et al.,1996]: Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. From data mining to
knowledge discovery in databases. AI Magazine, 17:37 – 54, 1996.
• [gandon,2006]: Fabien Gandon. Ontologies informatiques, May 2006.
• [gruber,1993]: Thomas R. Gruber. Toward principles for the design of ontologies used for knowledge sharing. In
Nicola Guarino and Roberto Poli, editors, Formal Ontology in Conceptual Analysis and Knowledge
Representation. Kluwer AcademicPublishers, 1993.
• [berners-lee et al.,2001]: Tim Berners-Lee, James Hendler, and Ora Lassila. The semantic web - a new form of
web content that is meaningful to computers will unleash a revolution of new possibilities. Scientific American,
2001.
• [bayardo et al.1999]: Roberto J. Bayardo Jr., Rakesh Agrawal, and Dimitrios Gunopulos. Constraintbased rule
mining in large, dense databases. ICDE ’99: Proceedings of the 15th International Conference on Data
Engineering, pages 188–197, 1999
• [liu et al.,1999]: Bing Liu, Wynne Hsu, and Yiming Ma. Pruning and summarizing the discovered associations. In
KDD ’99: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data
mining, pages 125–134.ACM, 1999.
REFERENCES

47
• [Srikant et agrwal,1996]: Ramakrishnan Srikant and Rakesh Agrawal. Mining quantitative association rules in
large relational tables. In Proceedings of the 1996 ACM SIGMOD international conference on Management of
data, pages 1–12, 1996.
• [Silberschats et Tuzhilin,1995] : Abraham Silberschatz and Alexander Tuzhilin. On subjective measures of
interestingness in knowledge discovery. Knowledge Discovery and Data Mining (KDD), pages 275–281, 1995.
• [Piatetsky-shapiro,1991]: G. Piatetsky-Shapiro. Knowledge Discovery in Databases, chapter Discovery, Analysis,
and Presentation of Strong Rules, page 229248. AAAI/MIT Press, 1991.
• [Guillet and Hamilton,2007]: F. Guillet and H. Hamilton. Quality Measures in Data Mining. Studies in
Computational Intelligence, 2007
• [klementtinen et al., 1994]: Mika Klemettinen, Heikki Mannila, Pirjo Ronkainen, Hannu Toivonen, and A. Inkeri
Verkamo. Finding interesting rules from large sets of discovered association rules. International Conference on
Information and Knowledge Management (CIKM), pages 401–407, 1994
• [Burdick, 2005]: Doug Burdick, Manuel Calimlim, Jason Flannick, Johannes Gehrke, and Tomi Yiu. Mafia: A
maximal frequent itemset algorithm. IEEE Transactions on Knowledge and Data Engineering, 17(11):1490–1504,
2005
REFERENCES

48
• [J. Zaki et al, 2002]: Mohammed J. Zaki and Ching J. Hsiao. Charm: An efficient algorithm for
• closed itemset mining. In Proceedings of SIAM’02, 2002.
• [R. Agrawal et al, 1994]: Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules.
Procedings of 20th International Conference Very Large Data Bases, VLDB, pages 487–499, 1994.
• [J. Han et al, 2000]: Jiawei Han and Jian Pei. Mining frequent patterns by pattern-growth: methodology and
implications. ACM SIGKDD Explorations Newsletter, Special issue on Scalable data mining algorithms,
2000(2):14–20, 2.
• [Hadoop]: Apache Software Foundation. (2010). Hadoop. Retrieved from https://hadoop.apache.org
References

Ontologies mining using association rules

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Ontologies mining using association rules

Similar to Ontologies mining using association rules (20)

Recently uploaded

Recently uploaded (20)

Ontologies mining using association rules