dc.contributor.author | Giger, Markus | |
dc.contributor.author | Kocková, Jana | |
dc.date.accessioned | 2025-07-02T08:24:14Z | |
dc.date.available | 2025-07-02T08:24:14Z | |
dc.date.issued | 2025 | |
dc.identifier.issn | 2336-6591 | |
dc.identifier.uri | http://hdl.handle.net/20.500.11956/199840 | |
dc.language.iso | cs_CZ | cs |
dc.publisher | Univerzita Karlova, Filozofická fakulta | cs |
dc.subject | korpusy | cs |
dc.subject | komparativní lingvistika | cs |
dc.subject | tagování | cs |
dc.subject | srovnatelnost dat | cs |
dc.subject | vyváženost korpusů | cs |
dc.title | Pasti dat: srovnatelnost dat jazykových korpusů | cs |
dc.type | Vědecký článek | cs |
dcterms.accessRights | openAccess | |
dcterms.license | http://creativecommons.org/licenses/by-nc-nd/2.0/ | |
uk.abstract.cs | Despite the apparent unambiguity of data provided by corpora, the data reflect different composition of the corpora, different conceptions of the synchronic period of a given language, different linguistic traditions, different orthography and other factors. We focus on the most common reasons affecting the comparability of data in parallel corpora, such as unequal lemmatization, tagging and tokenization, and illustrate them with examples from Czech, German and Russian. For example, when comparing Russian and Czech verb forms and lemmas, the data provided by the corpora are not comparable, because in Russian, unlike in Czech, the reflexive and non-reflexive forms are assigned to different lemmas and the verb lemma includes participles, whereas the corresponding Czech forms are tagged as adjectives, in accordance with Czech philological tradition. The differing approaches to tokenization are also reflected in the overall size of the corpus, indirectly affecting the comparability of relative frequencies. | cs |
dc.publisher.publicationPlace | Praha | cs |
uk.internal-type | uk_publication | |
dc.identifier.doi | https://doi.org/10.14712/23366591.2025.1.1 | |
dc.description.startPage | 7 | cs |
dc.description.endPage | 18 | cs |
dcterms.isPartOf.name | Časopis pro moderní filologii | cs |
dcterms.isPartOf.journalYear | 2025 | |
dcterms.isPartOf.journalVolume | 2025 | |
dcterms.isPartOf.journalIssue | 1 | |
dcterms.isPartOf.issn | 2336-6591 | |
dc.relation.isPartOfUrl | https://casopispromodernifilologii.ff.cuni.cz | |