Oprava gramatiky v češtině

Pechman, Petr

Czech Grammar Error Correction

dc.contributor.advisor	Straka, Milan
dc.creator	Pechman, Petr
dc.date.accessioned	2024-11-28T18:57:58Z
dc.date.available	2024-11-28T18:57:58Z
dc.date.issued	2024
dc.identifier.uri	http://hdl.handle.net/20.500.11956/190601
dc.description.abstract	Představujeme systém na opravu gramatických chyb v českém jazyce. Systém je založen na přístupu neuronového strojového překladu. Požíváme architekturu Trans- former, která je závislá na velkém množství anotovaných dat. Vzhledem k tomu, že pro většinu jazyků včetně češtiny není k dispozici dostatek anotovaných dat, volíme syn- tetické generování dat. Do syntetických chyb zavádíme, jak chyby jednoduché, tak i složitější - typické české chyby. Pro usnadnění experimentování vyvíjíme systém schopný generovat data v reálném čase a rovnou na těchto datech trénovat model. Následně navrhujeme několik vylepšení, jako je převzorkování jazykových domén nebo výběr zdroje dat pro syntetické generování. Náš nejvýkonnější model dosahuje nejlepších výsledků v českém jazyce vůči modelům, které jsou srovnatelně velké. Implementace je zveře- jněna na GitHub pod adresou: https://github.com/petrpechman/czech_gec/tree/ MasterThesis_PechmanPetr_2024. 1	cs_CZ
dc.description.abstract	We present a grammatical error correction system for correcting the Czech language. The system is based on the neural machine translation approach. We utilize the Trans- former architecture, which depends on a large amount of annotated data. Given that for most languages, including Czech, there is not enough annotated data available, we opt to generate synthetic data with artificial errors. We generate not only using sim- ple language-independent errors, but we also introduce typical Czech errors. To facili- tate quick experimentation, we develop a flexible training pipeline capable of real-time data generation. Consequently, we evaluate the effect of several proposed improvements such as oversampling of language domains or a choice of data source for synthetic gen- eration. Our best-performing model achieves state-of-the-art results in the Czech lan- guage for comparable model size. The implementation is released on GitHub at https: //github.com/petrpechman/czech_gec/tree/MasterThesis_PechmanPetr_2024. 1	en_US
dc.language	Čeština	cs_CZ
dc.language.iso	cs_CZ
dc.publisher	Univerzita Karlova, Matematicko-fyzikální fakulta	cs_CZ
dc.subject	grammar error correction\|GECCC\|Czech	en_US
dc.subject	oprava gramatiky\|GECCC\|čeština	cs_CZ
dc.title	Oprava gramatiky v češtině	cs_CZ
dc.type	diplomová práce	cs_CZ
dcterms.created	2024
dcterms.dateAccepted	2024-06-10
dc.description.department	Institute of Formal and Applied Linguistics	en_US
dc.description.department	Ústav formální a aplikované lingvistiky	cs_CZ
dc.description.faculty	Matematicko-fyzikální fakulta	cs_CZ
dc.description.faculty	Faculty of Mathematics and Physics	en_US
dc.identifier.repId	254605
dc.title.translated	Czech Grammar Error Correction	en_US
dc.contributor.referee	Rosen, Alexandr
thesis.degree.name	Mgr.
thesis.degree.level	navazující magisterské	cs_CZ
thesis.degree.discipline	Computer Science - Artificial Intelligence	en_US
thesis.degree.discipline	Informatika - Umělá inteligence	cs_CZ
thesis.degree.program	Computer Science - Artificial Intelligence	en_US
thesis.degree.program	Informatika - Umělá inteligence	cs_CZ
uk.thesis.type	diplomová práce	cs_CZ
uk.taxonomy.organization-cs	Matematicko-fyzikální fakulta::Ústav formální a aplikované lingvistiky	cs_CZ
uk.taxonomy.organization-en	Faculty of Mathematics and Physics::Institute of Formal and Applied Linguistics	en_US
uk.faculty-name.cs	Matematicko-fyzikální fakulta	cs_CZ
uk.faculty-name.en	Faculty of Mathematics and Physics	en_US
uk.faculty-abbr.cs	MFF	cs_CZ
uk.degree-discipline.cs	Informatika - Umělá inteligence	cs_CZ
uk.degree-discipline.en	Computer Science - Artificial Intelligence	en_US
uk.degree-program.cs	Informatika - Umělá inteligence	cs_CZ
uk.degree-program.en	Computer Science - Artificial Intelligence	en_US
thesis.grade.cs	Výborně	cs_CZ
thesis.grade.en	Excellent	en_US
uk.abstract.cs	Představujeme systém na opravu gramatických chyb v českém jazyce. Systém je založen na přístupu neuronového strojového překladu. Požíváme architekturu Trans- former, která je závislá na velkém množství anotovaných dat. Vzhledem k tomu, že pro většinu jazyků včetně češtiny není k dispozici dostatek anotovaných dat, volíme syn- tetické generování dat. Do syntetických chyb zavádíme, jak chyby jednoduché, tak i složitější - typické české chyby. Pro usnadnění experimentování vyvíjíme systém schopný generovat data v reálném čase a rovnou na těchto datech trénovat model. Následně navrhujeme několik vylepšení, jako je převzorkování jazykových domén nebo výběr zdroje dat pro syntetické generování. Náš nejvýkonnější model dosahuje nejlepších výsledků v českém jazyce vůči modelům, které jsou srovnatelně velké. Implementace je zveře- jněna na GitHub pod adresou: https://github.com/petrpechman/czech_gec/tree/ MasterThesis_PechmanPetr_2024. 1	cs_CZ
uk.abstract.en	We present a grammatical error correction system for correcting the Czech language. The system is based on the neural machine translation approach. We utilize the Trans- former architecture, which depends on a large amount of annotated data. Given that for most languages, including Czech, there is not enough annotated data available, we opt to generate synthetic data with artificial errors. We generate not only using sim- ple language-independent errors, but we also introduce typical Czech errors. To facili- tate quick experimentation, we develop a flexible training pipeline capable of real-time data generation. Consequently, we evaluate the effect of several proposed improvements such as oversampling of language domains or a choice of data source for synthetic gen- eration. Our best-performing model achieves state-of-the-art results in the Czech lan- guage for comparable model size. The implementation is released on GitHub at https: //github.com/petrpechman/czech_gec/tree/MasterThesis_PechmanPetr_2024. 1	en_US
uk.file-availability	V
uk.grantor	Univerzita Karlova, Matematicko-fyzikální fakulta, Ústav formální a aplikované lingvistiky	cs_CZ
thesis.grade.code	1
dc.contributor.consultant	Náplava, Jakub
uk.publication-place	Praha	cs_CZ
uk.thesis.defenceStatus	O