Text clustering and classification /(Klastrování a klasifikace textů)

Gabašová, Evelina

Klastrování a klasifi kace textů

bachelor thesis (DEFENDED)

View/Open

Záznam o průběhu obhajoby (164.0Kb)

Permanent link

http://hdl.handle.net/20.500.11956/13039

Identifiers

Study Information System: 46458

Referee

Hric, Jan

Faculty / Institute

Faculty of Mathematics and Physics

Discipline

General Computer Science

Department

Department of Theoretical Computer Science and Mathematical Logic

Date of defense

10. 9. 2007

Publisher

Univerzita Karlova, Matematicko-fyzikální fakulta

Language

English

Grade

Excellent

Klastrování a klasi kace textů jsou důležitými úlohami strojového učení. V této práci je prezentována kombinace jejich přístupů. Hlavním účelem bylo automaticky připravit množinu klastrů (nebo obecně konceptů), které by následně sloužily jako trénovací data pro naučení klasi fikátoru. Tato práce zahrnuje teoretické pozadí, detaily implementace a výsledky experimentů pro klastrování a klasifi kaci textových dokumentů. Trénovací soubor dokumentů je nejprve hierarchicky klastrování algoritmem bisecting k-means. Výsledek tohoto procesu je možné upravovat a vylepšovat s využitím expertní znalosti. Tímto způsobem vytvořená hierarchická struktura je použita pro naučení naivního bayesovského klasifi kátoru, který je následně využit k roztřídění testovací množiny dokumentů. Pro tyto účely byl vyvinut program, jehož výsledky jsou zhodnoceny a porovnány při zpracování českých a anglických dokumentů.

Abstract (English)

Text clustering and classi cation are important machine learning tasks. In this work, a combination of their approaches is presented. The main purpose was to automatically prepare a set of clusters (or generally concepts), which would subsequently serve as a training data for learning of a classiffi er. This work comprises of theoretical background, implementation details and experimental results of clustering and classi cation of text documents. A train set of documents is rst hierarchically clustered by the bisecting k-means algorithm. The result is o ered to an expert for modifi cations and possible improvements of the hierarchy. Following this, the resulting structure is used for learning of a naive Bayes classi er and a test set of documents is classi ed by it. A program was developed to perform these tasks and its results are evaluated and compared in processing document collections written in both English and Czech.

Citace dokumentu

Metadata

Show full item record