Identification of typical features of machine translation

Glazyrina, Natalia

Identifikace typických rysů strojového překladu

diploma thesis (DEFENDED)

View/Open

Záznam o průběhu obhajoby (347.1Kb)

Permanent link

http://hdl.handle.net/20.500.11956/199274

Identifiers

Study Information System: 253036

Referee

Popel, Martin

Faculty / Institute

Faculty of Mathematics and Physics

Discipline

Computer Science - Language Technologies and Computational Linguistics

Department

Institute of Formal and Applied Linguistics

Date of defense

3. 6. 2025

Publisher

Univerzita Karlova, Matematicko-fyzikální fakulta

Language

English

Grade

Very good

Keywords (Czech)

Keywords (English)

machine translation|neural networks|deep learning|machine learning|natural language processing

Moderní systémy strojového překladu (MT) dosáhly úrovně, kdy jejich výstupy jsou často nerozlišitelné od lidských překladů. Přesto je možné natrénovat klasi- fikátory, které dokážou odlišit lidské překlady od strojových. V této práci se zaměřuji na vývoj klasifikátoru založeného na předtrénovaném vícejazyčném jazykovém modelu XLM-R, jehož úkolem je rozlišit mezi lidskými a strojovými překlady. Výkonnost tohoto modelu porovnávám s baseline klasifikátorem. Trénovací data použitá v této práci pocházejí z datových sad konference WMT z let 2020 až 2022. Pro interpretaci natrénovaného klasifikátoru využívám axiomatickou atribuční metodu. Slova ve větách testovacího souboru anotuji lingvistickými rysy - například univerzálními slovnímy druhy (UPOS), gramatickými pády a syntak- tickými závislostními vztahy - a analyzuji korelace mezi atribučními skóre na úrovni slov a těmito lingvistickými rysy. 1

Abstract (English)

Modern machine translation (MT) systems have advanced to the point where their outputs are often indistinguishable from human translations. Despite this, it is possible to train classifiers that can differentiate between human and machine- generated translations. In this work I focus on developing a classifier based on the pre-trained multilingual language model XLM-R to distinguish between human and machine translations, benchmarking its performance against a baseline classi- fier. Training data for this work were sourced from the WMT conference datasets from 2020 to 2022. To interpret the trained classifier, I apply an axiomatic at- tribution method. I annotate words in test sentences with linguistic features - such as Universal Part of Speech (UPOS) tags, grammatical case, and syntactic dependency relations - and analyze correlations between word-level attribution scores and these linguistic features. 1

Citace dokumentu

Metadata

Show full item record