Information Extraction from structured business documents by learning from similarity

Holeček, Martin

Extrakce informaci ze strukturovanych dokumentu pomoci metod uceni a podobnosti

dizertační práce (OBHÁJENO)

Zobrazit/otevřít

Záznam o průběhu obhajoby (349.6Kb)

Trvalý odkaz

http://hdl.handle.net/20.500.11956/188250

Identifikátory

SIS: 249706

Oponent práce

Liwicki, Marcus

Mesiti, Marco

Fakulta / součást

Matematicko-fyzikální fakulta

Obor

Numerická a výpočtová matematika

Katedra / ústav / klinika

Katedra numerické matematiky

Datum obhajoby

21. 9. 2023

Nakladatel

Univerzita Karlova, Matematicko-fyzikální fakulta

Jazyk

Angličtina

Známka

Prospěl/a

Klíčová slova (česky)

one-shot learning|information extraction|siamese networks|similarity|table detection

Klíčová slova (anglicky)

one-shot learning|information extraction|siamese networks|similarity|table detection

Automatizace zpracování dokumentů si v poslední době získává pozornost kvůli velkému potenciálu usnadnění manuální práce prostřednictvím vylepšených výpočet- ních metod a hardwaru. V této oblasti se neuronové sítě uplatňovaly již dříve - i když byly dosud trénovány pouze na relativně malých datasetech se stovkami dokumentů. Aby bylo možné úspěšně prozkoumat techniky hlubokého učení a zlepšit výsledky extrakce informací, byl sestaven, anonymizován a publikován dataset s více než dvaceti pěti tisíci dokumenty (proforma fakturami, fakturami a vrubopisy). V první části výzkumu prozkoumáme dokumenty z hlediska de- tekce tabulek, představíme přehled metod detekce tabulek a nakonec přeformulu- jeme detekci tabulek jako problém označování textových polí, abychom optimal- izovali mikro F1 skóre na jednotlivých slovech. Ukážeme, že můžeme extrahovat specifické informace ze strukturálně odlišných tabulek nebo struktur podobných tabulkám pomocí jednoho trénovatelného modelu, který představuje komplexní reprezentaci stránky pomocí grafu slov, pozičního embeddingu a trénovatelného embeddingu slov. První část je úspěšně vyřešena novou architekturou neuronové sítě, která dosahuje vysoké úspěšnosti na zkoumaném datasetu. Dále je prezen- tována analýza výkonnosti modelu a je ověřeno, že konvoluce, grafové konvoluce a...

Abstrakt (anglicky)

The automation of document processing is gaining recent attention due to the great potential to reduce manual work through improved methods and hardware. In this area, neural networks have been applied before - even though they have been trained only on relatively small datasets with hundreds of documents so far. To successfully explore deep learning techniques and improve the information ex- traction results, a dataset with more than twenty-five thousand documents (pro forma invoices, invoices and debit note documents) has been compiled, anonymized and is published as a complement of this work. In the first part of the research, we will examine the documents from the point of view of table detection, present a survey on table detection methods and ultimately rephrase the table detection as a text box labelling problem to optimize micro F1 score of per-word classification. We will show that we can extract specific information from structurally different tables or table-like structures with one trainable model that features a comprehen- sive representation of a page using graph over word-boxes, positional embeddings and trainable textual features. The first part is concluded with a novel neural network model that beats multiple baselines and achieves strong, practical results on the presented dataset....

Citace dokumentu

Metadata

Zobrazit celý záznam