Verbetes v3

Transcrição

Verbetes v3
Task 3 – Web Community Sensing & Task 6 – Query and Visualization
REACTION Workshop – January 31th, 2013
Verbetes v3
Verbetes v3
31 January 2013
Luis Rei
[email protected]
@lmrei
http://luisrei.com
Jorge Teixeira
[email protected]
Context: Verbetes
• Associate office or profession (ergo) to a person’s name
• Temporal information
• Applied to portuguese news
Portuguese prime-minister Pedro Passos Coelho ...
Pedro Passos Coelho
Portuguese prime-minister
Context: Voxx
Context: O Mundo Visto Daqui (MVDI)
Verbetes v3: Motivation
• Extract new types/categories of entities
(Organizations, Products, Locations)
• Enrich entities with new descriptors:
• Corporate title
• Dates of Birth/Death
• Photos
• Company foundation date
• ...
Verbetes v3: Plan
1. NER (Identification, Classification)
2. Disambiguation
3. Descriptor Extraction
4. Fusion & Automatic Cross-Validation
(Wikipedia, Freebase, ...)
5. Web Service
1. Named Entity Recognition
Apple’s CEO, Tim Cook, said the iPhone has cannibalized some iPod business.
ORG
[Apple]’s CEO, [Tim Cook], said the [iPhone] has cannibalized some
[iPod] business.
Person
Product
2. Disambiguation
Same Category
Vitor Pereira
(Person)
Different Category
Francisco Sá Carneiro
Person
Coach
President of The
Referee Commission
Local (Airport)
Organization (Institute)
3. Descriptor Extraction
ERGO
[Apple]’s CEO, [Tim Cook], said the [iPhone] has cannibalized some
[iPod] business.
4. Fusion
Entity
Descriptor
Muammar Gaddafi
Presidente da Liga
Portuguesa de Futebol
Muammar
Moammar
Mu'ammar
Moamar
...
Gaddafi
Gathafi
Kadafi
Qaddafi
...
Presidente da Liga de
Futebol
Presidente da Liga
Portuguesa de Futebol
Profissional
5. Web Services
12
NER: State of The Art Approaches
Get an annotated corpus
Train a model
Use it to extract entities
Research Problems
• How to annotate the corpus
• Corpus age effect on training data
• Language dependency of all the tools
• Feature extraction, Tokenization, ...
Our Approch: Bootstrapping CRFs
Barack Obama
Hillary Clinton
...
+
Names
Testset
(un-annotated)
List
News
Annotated
[Barack Obama] nomeou...
3. Training
1. Annotate
News
4 - Test,
Extract,
Add
Model
2. Extract Features
Trainset
nomeou verbo singular nomear ...
Corpus & Evaluation
• News articles from Sapo (~3M)
• Select 20,000 that have names in seed list
• Precision - 100 articles/test
• For each name extracted from each news article:
correct or not.
• Recall - 40 articles/test
• For each article, calculate recall
context
extracted entity
correct
... Cristiano Ronaldo ...
Cristiano Ronaldo
TRUE
... Pedro Passos Coelho ...
Pedro Passos
FALSE
Preliminary Results
Precision
(avg of 3 tests, 1 bootstrap
iteration)
Recall
(single test, 1 bootstrap
iteration)
0.97
0.56
(std = 0.34)
F = 0.71
Possible Applications: O Mundo Numa Rede
Q&A
Thank You,
Luis Rei
Verbetes v3