Automatické čištění HTML dokumentů

Marek, Michal

Automatické čištění HTML dokumentů

dc.contributor.advisor	Pecina, Pavel
dc.creator	Marek, Michal
dc.date.accessioned	2017-04-06T10:42:11Z
dc.date.available	2017-04-06T10:42:11Z
dc.date.issued	2007
dc.identifier.uri	http://hdl.handle.net/20.500.11956/13024
dc.description.abstract	Tato práce popisuje systém pro automatické čištění HTML dokumentů, který byl použit při účasti Univerzity Karlovy v soutěži CLEAN- EVAL 2007. CLEANEVAL je sdílená úloha (shared task) a soutěž automatických systémů pro čištění libovolných stránek s cílem použít webová data jako korpus v počítačové lingvistice a zpracování přirozeného jazyka. Tuto úlohu řešíme jako problém značkování sekvencí (sequence labeling) a náš experimentální systém je založen na algoritmu Conditional Random Fields, používajícím vlastnosti (features) bloků textu odvozené z textového obsahu a HTML struktury analyzovaných webových stránek.	cs_CZ
dc.description.abstract	This paper describes a system for automatic cleaning of HTML documents, which was used in the participation of the Charles University in CLEANEVAL 2007. CLEANEVAL is a shared task and competitive evaluation of automatic systems for cleaning arbitrary web pages with the goal of preparing web data for use as a corpus in the area of computational linguistics and natural language processing. We try to solve this task as a sequence-labeling problem and our experimental system is based on Conditional Random Fields exploiting a set of features extracted from textual content and HTML structure of analyzed web pages for each block of text.	en_US
dc.language	English	cs_CZ
dc.language.iso	en_US
dc.publisher	Univerzita Karlova, Matematicko-fyzikální fakulta	cs_CZ
dc.title	Automatické čištění HTML dokumentů	en_US
dc.type	bakalářská práce	cs_CZ
dcterms.created	2007
dcterms.dateAccepted	2007-09-11
dc.description.department	Ústav formální a aplikované lingvistiky	cs_CZ
dc.description.department	Institute of Formal and Applied Linguistics	en_US
dc.description.faculty	Faculty of Mathematics and Physics	en_US
dc.description.faculty	Matematicko-fyzikální fakulta	cs_CZ
dc.identifier.repId	45804
dc.title.translated	Automatické čištění HTML dokumentů	cs_CZ
dc.contributor.referee	Straňák, Pavel
dc.identifier.aleph	000831340
thesis.degree.name	Bc.
thesis.degree.level	bakalářské	cs_CZ
thesis.degree.discipline	Programování	cs_CZ
thesis.degree.discipline	Programming	en_US
thesis.degree.program	Computer Science	en_US
thesis.degree.program	Informatika	cs_CZ
uk.thesis.type	bakalářská práce	cs_CZ
uk.taxonomy.organization-cs	Matematicko-fyzikální fakulta::Ústav formální a aplikované lingvistiky	cs_CZ
uk.taxonomy.organization-en	Faculty of Mathematics and Physics::Institute of Formal and Applied Linguistics	en_US
uk.faculty-name.cs	Matematicko-fyzikální fakulta	cs_CZ
uk.faculty-name.en	Faculty of Mathematics and Physics	en_US
uk.faculty-abbr.cs	MFF	cs_CZ
uk.degree-discipline.cs	Programování	cs_CZ
uk.degree-discipline.en	Programming	en_US
uk.degree-program.cs	Informatika	cs_CZ
uk.degree-program.en	Computer Science	en_US
thesis.grade.cs	Výborně	cs_CZ
thesis.grade.en	Excellent	en_US
uk.abstract.cs	Tato práce popisuje systém pro automatické čištění HTML dokumentů, který byl použit při účasti Univerzity Karlovy v soutěži CLEAN- EVAL 2007. CLEANEVAL je sdílená úloha (shared task) a soutěž automatických systémů pro čištění libovolných stránek s cílem použít webová data jako korpus v počítačové lingvistice a zpracování přirozeného jazyka. Tuto úlohu řešíme jako problém značkování sekvencí (sequence labeling) a náš experimentální systém je založen na algoritmu Conditional Random Fields, používajícím vlastnosti (features) bloků textu odvozené z textového obsahu a HTML struktury analyzovaných webových stránek.	cs_CZ
uk.abstract.en	This paper describes a system for automatic cleaning of HTML documents, which was used in the participation of the Charles University in CLEANEVAL 2007. CLEANEVAL is a shared task and competitive evaluation of automatic systems for cleaning arbitrary web pages with the goal of preparing web data for use as a corpus in the area of computational linguistics and natural language processing. We try to solve this task as a sequence-labeling problem and our experimental system is based on Conditional Random Fields exploiting a set of features extracted from textual content and HTML structure of analyzed web pages for each block of text.	en_US
uk.file-availability	V
uk.publication.place	Praha	cs_CZ
uk.grantor	Univerzita Karlova, Matematicko-fyzikální fakulta, Ústav formální a aplikované lingvistiky	cs_CZ
dc.identifier.lisID	990008313400106986