Foundations of Information Retrieval
Transcrição
Foundations of Information Retrieval
Text Analytics Introduction to Information Retrieval Ulf Leser Content of this Lecture • Information Retrieval – Introduction – Documents – Queries • A first idea: Boolean queries and vector space model Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 2 Information Retrieval • Information Retrieval is about helping a user in finding information • IR is about finding information, not data [BYRN99] • IR builds systems for end users, not for programmers – No SQL – IR today is used by almost everybody, databases are not • IR searches text – >90% of all information is available in text, not in databases Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 3 History of IR • ~300 ad. Library of Alexandria , ~700.000 „documents“ • 1450: Bookprint • 19th century: Indices and concordances – Extremely expensive and laborious • • • • • • • KWIC and concordances: H.P. Luhn, IBM (1958) Probabilistic models: Maron & Kuhns (1960) Boolean queries: Lockheed (~1960) Vector Space Model: Salton, Cornell (1965) Digital libraries, SGML, metadata standards Mid 90s: The web, web search engines End 90s: Personalized search engines, enterprise search etc. • 90% of all information is believed to be in unstructured form Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 4 Information Retrieval: The Problem • A bit more concrete: Help user in quickly finding information from a given set of documents – A corpus (linguistics) or document collection – Quickly means: The right documents and possibly the right passages in documents Query IR System (Ranked) Results Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 Document base Corpus 5 Why is it hard? • Homonyms and context – Fenster (Glas, Computer), Berlin (BRD, USA), Boden (Dach, Fussboden), … • Synonyms – Computer, PC, Rechner, Server, … • Specific queries – What was the score of Bayern München versus Stuttgart im DFB Pokal in 1998? Who scored the first goal for Stuttgart? – How many hours of sunshine on average has a day in Krete in May? – What are the last lines of “To be or not to be” in the third stage? • Typical web queries have 1,6 terms • “Information broker” is a profession Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 6 … to Quickly Find Information • Speed – Time to execute a query (should be fast) • Indexing, parallelization, compression, cluster, … – Time to answer the query (should be fast too) • Reducing the time to “answer the query” is harder – – – – – Success of search engines: Better results (and fast!) Guess users need and return (only) relevant information Process-orientation: User feedback, query history, … Reliable information extraction Ranking • Information overload – “We are drowing in data, but starving for knowledge” – In the corpus is large, ranking is top priority – Different search modes: What’s new? What’s certain? Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 7 Document or Passage • Searching only metadata or searching within documents • Returning only documents or returning relevant passages in document Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 8 KWIC: KeyWord In Context Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 9 Documents • Natural language text • Might be grammatical correct (books, newspapers) or not (BLOGs, newsgroups, spoken language) • May have structure (abstract, summary, chapters, …) or not (plain text) • May have (explicit or in-text) metadata (title, author, year, …) or not • May refer to other documents (hyperlinks) • Various formats (ASCII, PDF, DOC, XML, …) Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 10 IR and XML Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 11 XML and IR • XML has two flavors – Data oriented: Highly structured, except from database, short documents, high volume, no inline tags, structure-oriented search – Document oriented: Less structured, large documents, inline tags, text-oriented search • This was the original purpose of XML (and SGML!) • XQuery/XPath is strictly targeting data – Focus on structural parts of a query – No full-text search, few search options within a value – Semantics: Exact matching Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 12 Queries • Users formulate queries – Keywords – Phrases – Logical operations (AND, OR, NOT, ...) • Web search: “-ulf +leser” – True questions (e.g. MS-Word help) – Structured queries (author=… and title contains …) • Querying with document – Find documents similar to this one • Query refinement – Refine my query based on first results – Find documents within the result set of the previous search Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 13 IR: An Iterative, Multi-Stage, Complex Process Q2 Q1 Q4 Q3 Q5 Q0 • “moving through many actions towards a general goal of satisfactory completion of research related to an information need.” – “Berry-picking”, Bates 89 Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 14 Searching with Metadata (PubMed/Medline) Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 15 Query Refinement Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 16 Metadata Standard: Dublin Core • • • Dublin Core Metadata Initiative (W3C), 1995 15 attributes – – – – – – – – – – – – – – – identifier: format: type: language title subject: coverage: description: creator: publisher: contributor: rights: source: relation: date: E.g. ISBN/ISSN, URL/PURL, DOI, … MIME-Typ, media type, collection, image, text, … Keywords Scope of doc in space and/or time free text Last person manipulating the doc Copyright, licenses, … Other doc to other docs Date or period Plus refinements Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 17 Example • Usage in HTML • <head profile="http://dublincore.org/documents/dcq-html/"> <title>Dublin Core</title> <link rel="schema.DC" href="http://purl.org/dc/..." /> <link rel="schema.DCTERMS" href="http://purl.org/..." /> <meta name="DC.format" scheme=„…" content="text/htm" /> <meta name="DC.type" scheme=„…" content="Text" /> <meta name="DC.publisher" content="Jimmy Whales" /> <meta name="DC.subject" content="Dublin Core Metadata" /> <meta name="DC.creator" content="Björn G. Kulms" /> <meta name="DCTERMS.license" scheme="DCTERMS.URI„ content="http://www.gnu.org/copyleft/fdl.html" /> </head> Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 18 Historic Texts • Sachsenspiegel, ~1250 – ”Swerlenrecht kůnnen wil•d~ volge dis buches lere.alrest sul wi mer ken, daz ...” • Multiple representations – Facsimile – Digitalization / diplomacy • How well can the facsimile be reproduced? – Differences in individual writers (proliferating errors) – Different translations – Different editions Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 19 Text Encoding Initiative • www.tei-c.org: – „The Text Encoding Initiative (TEI) is a consortium which collectively develops and maintains a standard for the representation of texts in digital form. Its chief deliverable is a set of guidelines which specify encoding methods for machine-readable texts, chiefly in the humanities, social sciences and linguistics.“ • Very complex schema for representing text and annotation – Graphical features, linguistic annotation, editions, metadata, … • Widely used in linguistics • Based on XML stand-off architecture, extensible Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 20 Note • This lecture is on analyzing / searching in pure text – No structure, no metadata etc. • See lectures on databases, XML, semi-structured data, semantic web, … Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 21 Components of an IR System Corpus Parsing Document IDs Normalization Term Weighting IR Database Normalization Filtering Parsing Ranking Query-Interface Feedback User Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 Frakes et al. 1992 22 Other Buzzwords • Document management systems (DMS) – Large commercial market, links to OCR, workflow systems, print management systems, mail management, etc. – Legal issues – Every DMS includes an IR system • Knowledge management – “More sophisticated” DMS with semantic searching • Ontologies, thesauri, topic maps, … – Social aspects: incentives, standard procedures, enterprise vocabulary • “if Siemens knew what Siemens knows” • Digital libraries – Somewhat broader and less technical • Includes social aspects, archiving, multimedia, … Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 23 Enterprise Content Management • „The technologies used to capture, manage, store, deliver, and preserve information to support business processes. „ • Authorization and authentication • Business process management and document flow • Compliance: legal requirements – Record management – Pharma, Finance, … • Collaboration and sharing – Inter and intra organizations – Transactions, locks, … • Publishing: What, when, where Quelle: AIIM International – Web, catalogues, mail push, … • … Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 24 Technique versus Content • IR is about techniques for searching document collections • Creating doc collections is a business: Content provider – Pre-selected collections: classified business news, new patents, … – Example for content • Medline: >5000 Journals, 16 Mill citations, >500K additions every year • Journals not in Medline don’t exist • Thompson: patent search and indexing – Example for classifying content • Institute for Scientific Information (ISI) • Impact factors: which journals count? – Added value: Categorization, classification, filtering of docs • Example: MeSH terms, ACM keywords, … • Sometimes, content provider buy tech companies and vice versa Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 25 Content of this Lecture • Information Retrieval • A first idea: Boolean queries and vector space model Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 26 Logical View • The logical view on a document determines what we query – E.g.: Only metadata, only title, only abstract, full text, … • Logical view on the raw text – Stemming – Stopword removal – Transformation of special characters (Umlaute, greek letters, XML/HTML encodings, …) – Removal of formatting information (HTML), tags (XML), … • Bag of words (BoW) – View on the text which ignores order of words – Extremely popular – Used form now on when not indicated otherwise Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 27 Preprocessing Quelle: [BYRN99] • • • • • Tokenization can be a real issue Special characters can be a real issue Format conversion (PDF, OCR, PostScript, …) can be a real issue Effect of normalization is not clear Must be applied to all docs and to query (partially) Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 28 A First Approach: Boolean Queries (rough idea) • Search all docs which “match” a logical expression of words – – – – “bono AND u2” “bono AND (Arkansas OR Indiana)” “bono AND NOT u2” Note: Search engines don’t support “OR” • Evaluation on BoW b – An atom (term) w in the expression evaluates to true if w∈b – Compute values of all atoms – Compute value of expression • Advantages: Fast, simple • Disadvantages – No ranking of documents: Either relevant or not – Logic is hard to understand for ordinary users – Typical queries are too simple – too many results (without ranking) Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 29 Vector Space Model - Concept Quelle: Henrich, Information Retrieval 1 Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 30 Vector Space Model – Maths (rough idea) • Let d by a doc, D be a collection of docs with d∈D • Let w() map a doc or a collection to its bag of words • Let vd by a vector for d with – vd[i]=0 if the i’th word in w(D) is not contained in w(d) – vd[i]=1 if the i’th word in w(DL) is contained in w(d) • Let vq by the corresponding vector for query q • Compute a relevance score for each document by | w ( D )| rel (d , q ) = v d • vq | vd | * | v q | = ∑ v [i] * v [i] i =1 d 2 [ ] v i ∑d * Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 q 2 [ ] v i ∑q 31 Angel between Doc and Query Star • The relevance is the angle Astronomy between the doc and the query in |w(D)| - dimensional space • Compute relevance score for all d∈D • Return docs in order of decreasing relevance • Advantages – – – – Movie stars Mammals Diet Still quite simple “Meaningful” ranking – better results than Boolean model Allows approximate answers Dominant model today • Disadvantages – Words are assumed to be statistically independent Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 32 Outlook: Latent Semantic Indexing • We usually also want to find documents which don‘t even contain the words we are searching for, but which contain the concepts we are trying to express by our query – But what is a concept? • LSI approach – Map query and docs into a concept space – Compare and rank in concept space – Technical: Dimensionality reduction (like PCA etc.) • Other approach – Find relevant docs using VSM with score above threshold – Find consensus (e.g. words in all docs) – Use consensus as new query and iterate Ulf Leser: Text Analytics, Vorlesung, Sommersemester 2008 33