Automatically Enriching a Thesaurus with Information from Dictionaries

Transcrição

Automatically Enriching a Thesaurus with Information from Dictionaries
Automatically Enriching a Thesaurus with Information
from Dictionaries
Hugo Gonçalo Oliveira1
Paulo Gomes
{hroliv,pgomes}@dei.uc.pt
Cognitive & Media Systems Group
CISUC, Universidade de Coimbra
October 11, 2011
1
supported by FCT scholarship grant SFRH/BD/44955/2008
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
1 / 18
Index
1
Introduction
2
Proposed approach
3
Enriching TeP with synonymy in PAPEL
4
Evaluation
5
Concluding remarks
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
2 / 18
Introduction
Lexical knowledge bases
Thesaurus, lexical networks, lexical ontologies, ...
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
3 / 18
Introduction
Lexical knowledge bases
Thesaurus, lexical networks, lexical ontologies, ...
Structured on words and their meanings
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
3 / 18
Introduction
Lexical knowledge bases
Thesaurus, lexical networks, lexical ontologies, ...
Structured on words and their meanings
Try to cover the whole language
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
3 / 18
Introduction
Lexical knowledge bases
Thesaurus, lexical networks, lexical ontologies, ...
Structured on words and their meanings
Try to cover the whole language
No specific domain
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
3 / 18
Introduction
Lexical knowledge bases
Thesaurus, lexical networks, lexical ontologies, ...
Structured on words and their meanings
Try to cover the whole language
No specific domain
Essential for developing NLP tools for a language
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
3 / 18
Introduction
Lexical knowledge bases
Thesaurus, lexical networks, lexical ontologies, ...
Structured on words and their meanings
Try to cover the whole language
No specific domain
Essential for developing NLP tools for a language
I
Useful for NLP tasks (eg. word-sense disambiguation,
question-answering, determining similarities, ...)
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
3 / 18
Introduction
Lexical knowledge bases
Thesaurus, lexical networks, lexical ontologies, ...
Structured on words and their meanings
Try to cover the whole language
No specific domain
Essential for developing NLP tools for a language
I
I
Useful for NLP tasks (eg. word-sense disambiguation,
question-answering, determining similarities, ...)
See Princeton WordNet [Fellbaum, 1998]
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
3 / 18
Introduction
Free lexical knowledge bases for Portuguese
Public domain thesaurus:
I
I
2
3
TeP [Maziero et al., 2008]
OpenThesaurus.PT2
http://openthesaurus.caixamagica.pt/
http://pt.wiktionary.org/
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
4 / 18
Introduction
Free lexical knowledge bases for Portuguese
Public domain thesaurus:
I
I
TeP [Maziero et al., 2008]
OpenThesaurus.PT2
Collaborative dictionary
I
2
3
Portuguese Wiktionary3
http://openthesaurus.caixamagica.pt/
http://pt.wiktionary.org/
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
4 / 18
Introduction
Free lexical knowledge bases for Portuguese
Public domain thesaurus:
I
I
TeP [Maziero et al., 2008]
OpenThesaurus.PT2
Collaborative dictionary
I
Portuguese Wiktionary3
Public domain lexical network
I
2
3
PAPEL [Gonçalo Oliveira et al., 2010]
http://openthesaurus.caixamagica.pt/
http://pt.wiktionary.org/
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
4 / 18
Introduction
Free lexical knowledge bases for Portuguese
Public domain thesaurus:
I
I
TeP [Maziero et al., 2008]
OpenThesaurus.PT2
Collaborative dictionary
I
Portuguese Wiktionary3
Public domain lexical network
I
PAPEL [Gonçalo Oliveira et al., 2010]
Lexical ontology [coming soon]
I
2
3
Onto.PT [Gonçalo Oliveira and Gomes, 2010]
http://openthesaurus.caixamagica.pt/
http://pt.wiktionary.org/
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
4 / 18
Introduction
Free lexical knowledge bases for Portuguese
Public domain thesaurus:
I
I
TeP [Maziero et al., 2008]
OpenThesaurus.PT2
Collaborative dictionary
I
Portuguese Wiktionary3
Public domain lexical network
I
PAPEL [Gonçalo Oliveira et al., 2010]
Lexical ontology [coming soon]
I
Onto.PT [Gonçalo Oliveira and Gomes, 2010]
More complementary than overlapping
2
3
http://openthesaurus.caixamagica.pt/
http://pt.wiktionary.org/
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
4 / 18
Introduction
Free lexical knowledge bases for Portuguese
Public domain thesaurus:
I
I
TeP [Maziero et al., 2008]
OpenThesaurus.PT2
Collaborative dictionary
I
Portuguese Wiktionary3
Public domain lexical network
I
PAPEL [Gonçalo Oliveira et al., 2010]
Lexical ontology [coming soon]
I
Onto.PT [Gonçalo Oliveira and Gomes, 2010]
More complementary than overlapping
Fruitful to merge some of them in a unique broader resource
2
3
http://openthesaurus.caixamagica.pt/
http://pt.wiktionary.org/
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
4 / 18
Introduction
This work
Integrate synonymy information from dictionaries in a thesaurus
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
5 / 18
Introduction
This work
Integrate synonymy information from dictionaries in a thesaurus
1
Extraction of synpairs from dictionaries
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
5 / 18
Introduction
This work
Integrate synonymy information from dictionaries in a thesaurus
1
2
Extraction of synpairs from dictionaries
Assigning synpairs to synsets
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
5 / 18
Introduction
This work
Integrate synonymy information from dictionaries in a thesaurus
1
2
3
Extraction of synpairs from dictionaries
Assigning synpairs to synsets
Clustering remaining pairs
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
5 / 18
Introduction
This work
Integrate synonymy information from dictionaries in a thesaurus
1
2
3
Extraction of synpairs from dictionaries
Assigning synpairs to synsets
Clustering remaining pairs
Apply the procedure in the enrichment of TeP with PAPEL
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
5 / 18
Proposed approach
Extracting synpairs from dictionaries
mente, n: cérebro, cabeça, intelecto
[mind, n: brain, head, intellect]
máquina, n: o mesmo que computador
[machine, n: the same as computer ]
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
6 / 18
Proposed approach
Extracting synpairs from dictionaries
mente, n: cérebro, cabeça, intelecto
[mind, n: brain, head, intellect]
I
(cérebro, mente) (cabeça, mente) (intelecto, mente)
[(brain, mind) (head, mind) (intellect, mind)]
máquina, n: o mesmo que computador
[machine, n: the same as computer ]
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
6 / 18
Proposed approach
Extracting synpairs from dictionaries
mente, n: cérebro, cabeça, intelecto
[mind, n: brain, head, intellect]
I
(cérebro, mente) (cabeça, mente) (intelecto, mente)
[(brain, mind) (head, mind) (intellect, mind)]
máquina, n: o mesmo que computador
[machine, n: the same as computer ]
I
(computador, máquina)
[(computer, machine)]
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
6 / 18
Proposed approach
Assigning synpairs to synsets
p = (wx , wy ) + Sa = (w1 , w2 , ..., wn ) → Sa = (w1 , w2 , ..., wn , wx , wy )
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
7 / 18
Proposed approach
Assigning synpairs to synsets
p = (wx , wy ) + Sa = (w1 , w2 , ..., wn ) → Sa = (w1 , w2 , ..., wn , wx , wy )
Synonymy graph G
I
I
I
All the extracted synpairs
Nodes represent words (eg. wx , wy )
p = (wx , wy ) establishes an edge between wx and wy
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
7 / 18
Proposed approach
Assigning synpairs to synsets
For each synpair p = (wx , wy )
1
a
If Si ∈ T : wx ∈ Si ∧ wy ∈ Si , nothing is done.
Any measure for computing the similarity of two vectors can be used
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
8 / 18
Proposed approach
Assigning synpairs to synsets
For each synpair p = (wx , wy )
1
If Si ∈ T : wx ∈ Si ∧ wy ∈ Si , nothing is done.
2
Select all synsets Cj ∈ C : C ⊂ T , C = {C1 , C2 , ..., Cn }
∀(Cj ∈ C ) : wx ∈ Cj ∨ wy ∈ Cj .
a
Any measure for computing the similarity of two vectors can be used
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
8 / 18
Proposed approach
Assigning synpairs to synsets
For each synpair p = (wx , wy )
1
If Si ∈ T : wx ∈ Si ∧ wy ∈ Si , nothing is done.
2
Select all synsets Cj ∈ C : C ⊂ T , C = {C1 , C2 , ..., Cn }
∀(Cj ∈ C ) : wx ∈ Cj ∨ wy ∈ Cj .
3
If |C | = 1, p + C1 .
a
Any measure for computing the similarity of two vectors can be used
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
8 / 18
Proposed approach
Assigning synpairs to synsets
For each synpair p = (wx , wy )
1
If Si ∈ T : wx ∈ Si ∧ wy ∈ Si , nothing is done.
2
Select all synsets Cj ∈ C : C ⊂ T , C = {C1 , C2 , ..., Cn }
∀(Cj ∈ C ) : wx ∈ Cj ∨ wy ∈ Cj .
3
If |C | = 1, p + C1 .
4
Compute the adjacency vector [p] = [wx ] + [wy ]. The adjacency
vector of a word is a column of the matrix M, [wj ] = [Mj ];
a
Any measure for computing the similarity of two vectors can be used
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
8 / 18
Proposed approach
Assigning synpairs to synsets
For each synpair p = (wx , wy )
1
If Si ∈ T : wx ∈ Si ∧ wy ∈ Si , nothing is done.
2
Select all synsets Cj ∈ C : C ⊂ T , C = {C1 , C2 , ..., Cn }
∀(Cj ∈ C ) : wx ∈ Cj ∨ wy ∈ Cj .
3
If |C | = 1, p + C1 .
4
Compute the adjacency vector [p] = [wx ] + [wy ]. The adjacency
vector of a word is a column of the matrix M, [wj ] = [Mj ];
5
Compute the adjacency vector of each Cj ∈ C
P|Cj |
[Cj ] = k=1
[wk ] : wk ∈ Cj ;
a
Any measure for computing the similarity of two vectors can be used
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
8 / 18
Proposed approach
Assigning synpairs to synsets
For each synpair p = (wx , wy )
1
If Si ∈ T : wx ∈ Si ∧ wy ∈ Si , nothing is done.
2
Select all synsets Cj ∈ C : C ⊂ T , C = {C1 , C2 , ..., Cn }
∀(Cj ∈ C ) : wx ∈ Cj ∨ wy ∈ Cj .
3
If |C | = 1, p + C1 .
4
Compute the adjacency vector [p] = [wx ] + [wy ]. The adjacency
vector of a word is a column of the matrix M, [wj ] = [Mj ];
5
Compute the adjacency vector of each Cj ∈ C
P|Cj |
[Cj ] = k=1
[wk ] : wk ∈ Cj ;
6
Select the most similar synset
Cbest : sim(p, Cbest )a = max(sim(p, Cj ));
a
Any measure for computing the similarity of two vectors can be used
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
8 / 18
Proposed approach
Assigning synpairs to synsets
For each synpair p = (wx , wy )
1
If Si ∈ T : wx ∈ Si ∧ wy ∈ Si , nothing is done.
2
Select all synsets Cj ∈ C : C ⊂ T , C = {C1 , C2 , ..., Cn }
∀(Cj ∈ C ) : wx ∈ Cj ∨ wy ∈ Cj .
3
If |C | = 1, p + C1 .
4
Compute the adjacency vector [p] = [wx ] + [wy ]. The adjacency
vector of a word is a column of the matrix M, [wj ] = [Mj ];
5
Compute the adjacency vector of each Cj ∈ C
P|Cj |
[Cj ] = k=1
[wk ] : wk ∈ Cj ;
6
Select the most similar synset
Cbest : sim(p, Cbest )a = max(sim(p, Cj ));
7
p + Cbest .
a
Any measure for computing the similarity of two vectors can be used
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
8 / 18
Proposed approach
Clustering remaining pairs
G 0 is established by the remaining pairs
1
Sparse matrix M 0 (|N| × |N|)
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
9 / 18
Proposed approach
Clustering remaining pairs
G 0 is established by the remaining pairs
1
Sparse matrix M 0 (|N| × |N|)
2
Mij0 = sim([wi ], [wj ])
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
9 / 18
Proposed approach
Clustering remaining pairs
G 0 is established by the remaining pairs
1
Sparse matrix M 0 (|N| × |N|)
2
Mij0 = sim([wi ], [wj ])
3
Normalise the columns of M, so that
Gonçalo Oliveira & Gomes (CISUC)
P|Mj |
k=1
KDBI, EPIA 2011
Mjk = 1
October 11, 2011
9 / 18
Proposed approach
Clustering remaining pairs
G 0 is established by the remaining pairs
1
Sparse matrix M 0 (|N| × |N|)
2
Mij0 = sim([wi ], [wj ])
3
Normalise the columns of M, so that
4
Extract cluster Si from each row Mi0 , with the words wj where Mij0 > θ
Gonçalo Oliveira & Gomes (CISUC)
P|Mj |
k=1
KDBI, EPIA 2011
Mjk = 1
October 11, 2011
9 / 18
Proposed approach
Clustering remaining pairs
G 0 is established by the remaining pairs
1
Sparse matrix M 0 (|N| × |N|)
2
Mij0 = sim([wi ], [wj ])
3
Normalise the columns of M, so that
4
Extract cluster Si from each row Mi0 , with the words wj where Mij0 > θ
5
For each Si : Si ∪ Sj = Sj and Si ∩ Sj = Si , Si is discarded.
Gonçalo Oliveira & Gomes (CISUC)
P|Mj |
k=1
KDBI, EPIA 2011
Mjk = 1
October 11, 2011
9 / 18
Enriching TeP with synonymy in PAPEL
Coverage of the synpairs by TeP
POS
Nouns
Verbs
Adjectives
4
Synpairs
37,452
21,465
19,073
In TeP
27.38%
43.01%
37.60%
|C |4 = 0
14.98%
1.34%
5.58%
|C | = 1
12.01%
4.04%
8.22%
|C | > 1
45.63%
51.66%
48.60%
|C |
3.86
6.64
4.26
Number of candidate synsets
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
10 / 18
Enriching TeP with synonymy in PAPEL
Coverage of the synpairs by TeP
POS
Nouns
Verbs
Adjectives
Synpairs
37,452
21,465
19,073
In TeP
27.38%
43.01%
37.60%
|C |4 = 0
14.98%
1.34%
5.58%
|C | = 1
12.01%
4.04%
8.22%
|C | > 1
45.63%
51.66%
48.60%
|C |
3.86
6.64
4.26
Experimentation was performed using the cosine similarity
4
Number of candidate synsets
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
10 / 18
Enriching TeP with synonymy in PAPEL
Results – words
Thesaurus
TeP 2.0
After assignments
Clusters
Final thesaurus
POS
Nouns
Verbs
Adjectives
Nouns
Verbs
Adjectives
Nouns
Verbs
Adjectives
Nouns
Verbs
Adjectives
Gonçalo Oliveira & Gomes (CISUC)
Total
17,158
10,827
14,586
23,775
12,818
17,158
8,546
502
1,858
30,369
13,090
18,525
Ambiguous
5,805
4,905
3,735
10,418
7,094
6,294
701
8
39
12,045
7,221
6,550
KDBI, EPIA 2011
Words
Avg(senses)
1.71
2.08
1.46
2.09
2.64
1.83
1.15
1.02
1.03
1.96
2.62
1.80
Most ambig.
20
41
19
37
42
22
8
3
4
38
42
23
October 11, 2011
11 / 18
Enriching TeP with synonymy in PAPEL
Results – synsets
Thesaurus
TeP 2.0
After assignments
Clusters
Final thesaurus
POS
Nouns
Verbs
Adjectives
Nouns
Verbs
Adjectives
Nouns
Verbs
Adjectives
Nouns
Verbs
Adjectives
Gonçalo Oliveira & Gomes (CISUC)
Total
8,254
3,978
6,066
8,254
3,978
6,066
3,524
220
820
11,778
4,198
6,886
Avg(size)
3.56
5.67
3.50
6.01
8.50
5.17
2.78
2.34
2.33
5.05
8.18
4.84
KDBI, EPIA 2011
Synsets
size = 2
size > 25
3,079
0
939
48
3,033
19
1,930
179
702
217
2,369
120
2,247
0
174
0
656
0
4,177
179
876
217
3,025
120
max(size)
21
53
43
150
148
110
13
6
10
150
148
110
October 11, 2011
12 / 18
Evaluation
Assignments evaluation
Manual evaluation of sample assignments
Two judges for each assignment
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
13 / 18
Evaluation
Assignments evaluation
Manual evaluation of sample assignments
Two judges for each assignment
POS
Nouns
Verbs
Adjectives
Sample
100 assigns. × 2
100 assigns. × 2
100 assigns. × 2
Gonçalo Oliveira & Gomes (CISUC)
153
142
151
Correct
(76.50%)
(71.00%)
(75.50%)
KDBI, EPIA 2011
47
58
49
Incorrect
(23.50%)
(29.00%)
(24.50%)
Agreement
77.00%
74.00%
75.00%
October 11, 2011
13 / 18
Evaluation
Assignments evaluation
Manual evaluation of sample assignments
Two judges for each assignment
POS
Nouns
Verbs
Adjectives
Synpair
Synset
Judge 1
Judge 2
(escrutı́nio,votação)
(decisão,desempate)
(plano,gizamento)
(venerar,homenagear)
(atacar,combater)
(obter,rapar)
(grandioso,épico)
(delicado,requintado)
(falido,queimado)
votação;voto;sufrágio
resolução;objetivação;tenção;intenção
planı́cie;chã;chanura;plaino;plano;planura
venerar;cultuar;adorar;idolatrar
atacar;inciar
depilar;despelar;pelar;raspar;rapar;rascar
admirável;fabuloso;grandioso
difı́cil;complicado;delicado
queimado;incendiado
1
0
0
1
0
0
1
0
0
1
1
0
1
1
0
1
1
0
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
13 / 18
Evaluation
Clustering
Manual evaluation of clusters
Two judges for each cluster
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
14 / 18
Evaluation
Clustering
Manual evaluation of clusters
Two judges for each cluster
Cluster is correct if, in some context, all its words might have the
same meaning
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
14 / 18
Evaluation
Clustering
Manual evaluation of clusters
Two judges for each cluster
Cluster is correct if, in some context, all its words might have the
same meaning
Table: Evaluation of clustering
POS
Nouns
Verbs
Adjectives
Sample
105 × 2
105 × 2
105 × 2
Gonçalo Oliveira & Gomes (CISUC)
Correct
179 (85.24%)
193 (91.90%)
189 (90.00%)
KDBI, EPIA 2011
Incorrect
31 (14.76%)
17
(8.10%)
21 (10.00%)
Agreement
91.43%
87.62%
85.71%
October 11, 2011
14 / 18
Evaluation
Clustering
Manual evaluation of clusters
Two judges for each cluster
Cluster is correct if, in some context, all its words might have the
same meaning
Figure: Examples of connected subgraphs and resulting clusters.
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
14 / 18
Concluding remarks
Update: computing similarity
Sum the adjacencies
I
One vector per synset: [Cj ] =
I
sim(p, Cj ) = sim([p], [Cj ])
Gonçalo Oliveira & Gomes (CISUC)
P|Cj |
k=1 [wk ]
KDBI, EPIA 2011
: wk ∈ Cj ;
October 11, 2011
15 / 18
Concluding remarks
Update: computing similarity
Sum the adjacencies
I
One vector per synset: [Cj ] =
I
sim(p, Cj ) = sim([p], [Cj ])
P|Cj |
k=1 [wk ]
: wk ∈ Cj ;
Average similarity of the pair with each synset element
I
One vector per synset element: [Cj ] = ([w1 ], ..., [wn ]), n = |Cj |
|Cj |
P
I
sim(p, Cj ) =
cos([p],[Mwk ])
k=1
Gonçalo Oliveira & Gomes (CISUC)
|Cj |
, wk ∈ C j
KDBI, EPIA 2011
October 11, 2011
15 / 18
Concluding remarks
Update: computing similarity
Sum the adjacencies
I
One vector per synset: [Cj ] =
I
sim(p, Cj ) = sim([p], [Cj ])
P|Cj |
k=1 [wk ]
: wk ∈ Cj ;
Average similarity of the pair with each synset element
I
One vector per synset element: [Cj ] = ([w1 ], ..., [wn ]), n = |Cj |
|Cj |
P
I
sim(p, Cj ) =
cos([p],[Mwk ])
k=1
|Cj |
, wk ∈ C j
Gold resource of 220 synpairs and possible assignments
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
15 / 18
Concluding remarks
Update: computing similarity
Sum the adjacencies
I
One vector per synset: [Cj ] =
I
sim(p, Cj ) = sim([p], [Cj ])
P|Cj |
k=1 [wk ]
: wk ∈ Cj ;
Average similarity of the pair with each synset element
I
One vector per synset element: [Cj ] = ([w1 ], ..., [wn ]), n = |Cj |
|Cj |
P
I
sim(p, Cj ) =
cos([p],[Mwk ])
k=1
|Cj |
, wk ∈ C j
Gold resource of 220 synpairs and possible assignments
I
Variable cut point θ on similarity
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
15 / 18
Concluding remarks
Update: computing similarity
Sum the adjacencies
I
One vector per synset: [Cj ] =
I
sim(p, Cj ) = sim([p], [Cj ])
P|Cj |
k=1 [wk ]
: wk ∈ Cj ;
Average similarity of the pair with each synset element
I
One vector per synset element: [Cj ] = ([w1 ], ..., [wn ]), n = |Cj |
|Cj |
P
I
sim(p, Cj ) =
cos([p],[Mwk ])
k=1
|Cj |
, wk ∈ C j
Gold resource of 220 synpairs and possible assignments
I
I
Variable cut point θ on similarity
Possible to assign the same synpair to 0 ≤ n ≤ |C | synsets
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
15 / 18
Concluding remarks
Update: computing similarity
Sum the adjacencies
I
One vector per synset: [Cj ] =
I
sim(p, Cj ) = sim([p], [Cj ])
P|Cj |
k=1 [wk ]
: wk ∈ Cj ;
Average similarity of the pair with each synset element
I
One vector per synset element: [Cj ] = ([w1 ], ..., [wn ]), n = |Cj |
|Cj |
P
I
sim(p, Cj ) =
cos([p],[Mwk ])
k=1
|Cj |
, wk ∈ C j
Gold resource of 220 synpairs and possible assignments
I
I
Variable cut point θ on similarity
Possible to assign the same synpair to 0 ≤ n ≤ |C | synsets
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
15 / 18
Concluding remarks
Final remarks
Flexible method for enriching thesaurus with synonymy in dictionaries
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
16 / 18
Concluding remarks
Final remarks
Flexible method for enriching thesaurus with synonymy in dictionaries
Applied to the enrichment of a Portuguese thesaurus
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
16 / 18
Concluding remarks
Final remarks
Flexible method for enriching thesaurus with synonymy in dictionaries
Applied to the enrichment of a Portuguese thesaurus
This work was made in the scope of Onto.PT
I
Automatic creation of a lexical ontology for Portuguese
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
16 / 18
Concluding remarks
Final remarks
Flexible method for enriching thesaurus with synonymy in dictionaries
Applied to the enrichment of a Portuguese thesaurus
This work was made in the scope of Onto.PT
I
I
Automatic creation of a lexical ontology for Portuguese
Extraction + integration of lexical information from textual sources
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
16 / 18
Concluding remarks
Final remarks
Flexible method for enriching thesaurus with synonymy in dictionaries
Applied to the enrichment of a Portuguese thesaurus
This work was made in the scope of Onto.PT
I
I
I
I
Automatic creation of a lexical ontology for Portuguese
Extraction + integration of lexical information from textual sources
Soon freely available!
Check http://ontopt.dei.uc.pt
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
16 / 18
Concluding remarks
Thank you!
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
17 / 18
References
References I
[Fellbaum, 1998] Fellbaum, C., editor (1998).
WordNet: An Electronic Lexical Database (Language, Speech, and Communication).
The MIT Press.
[Gonçalo Oliveira and Gomes, 2010] Gonçalo Oliveira, H. and Gomes, P. (2010).
Onto.PT: Automatic Construction of a Lexical Ontology for Portuguese.
In Proc. 5th European Starting AI Researcher Symposium (STAIRS 2010). IOS Press.
[Gonçalo Oliveira et al., 2010] Gonçalo Oliveira, H., Santos, D., and Gomes, P. (2010).
Extracção de relações semânticas entre palavras a partir de um dicionário: o PAPEL e sua avaliação.
Linguamática, 2(1):77–93.
[Maziero et al., 2008] Maziero, E. G., Pardo, T. A. S., Felippo, A. D., and Dias-da-Silva, B. C. (2008).
A Base de Dados Lexical e a Interface Web do TeP 2.0 - Thesaurus Eletrônico para o Português do Brasil.
In VI Workshop em Tecnologia da Informação e da Linguagem Humana (TIL), pages 390–392.
Gonçalo Oliveira & Gomes (CISUC)
KDBI, EPIA 2011
October 11, 2011
18 / 18