Korpusy jako zdroje dat pro úpravy nástrojů automatické morfologické analýzy (Slovotvorné varianty adjektiv na [(ou)|í]cí z hlediska morfologického značkování)

Osolsobě, Klára; Čermák, Petr

Corpora as Data Sources for the Up-Grading of Morphological Tagging

Vědecký článek

Zobrazit/otevřít

Trvalý odkaz

http://hdl.handle.net/20.500.11956/96413

Identifikátory

Kolekce

Číslo 2 [8]

Autoři

Osolsobě, Klára

Čermák, Petr

Datum vydání

2015

Nakladatel

Univerzita Karlova, Filozofická fakulta
Praha

Zdrojový dokument

Časopis pro moderní filologii (Journal for Modern Philology) (web)
ISSN: 2336-6591
Rok vydání periodika: 2015
Ročník periodika: 2015
Číslo periodika: 2

Odkaz na licenční podmínky

https://creativecommons.org/licenses/by-nc-nd/2.0/

Klíčová slova (česky)

verbální adjektivum, morfologické značkování, automatická morfologická analýza, varianta, slovotvorba

Klíčová slova (anglicky)

gerund/deverbal adjective, pos tagging, automatic morphological analysis, variant, derivational, morphology

Adjectives ending with -oucí/-ící are regularly derived from verbs and hence are not usually listed in any of the Czech monolingual dictionaries. On the level of automatic morphological analysis (the dictionary) of Czech they should be generated from verbal roots and tagged as verbal adjectives (pos tag: AG.*). The data from Czech corpora prove a) inconsistencies in tagging and b) gaps in the dictionary. The main cause of both kinds of insufficiency is the existence of variants on the level of verbal forms from which the verbal adjectives are potentially derived. Consequently, text corpora are a significant sourceof knowledge about the formation and use of adjectives with endings -oucí/-ící that can be important for both a) automatic morphological analysis of Czech and b) theoretical description of Czech grammar(derivational morphology). Our goal is to present a corpus-based study of the Czech gerund, i.e. verbaladjectives with -oucí/-ící. The link between the inflected and the word-formation variants will bedemonstrated using material from the SYN corpus (2,6 billion tokens of written Czech) and the large web corpus czTenTen12 (5,2 billion tokens of Czech text from the Internet — cleaned and deduplicated).

Metadata

Zobrazit celý záznam