Information Extraction from structured business documents by learning from similarity

Holeček, Martin

Extrakce informaci ze strukturovanych dokumentu pomoci metod uceni a podobnosti

dc.contributor.advisor	Maršík, František
dc.creator	Holeček, Martin
dc.date.accessioned	2024-04-08T08:35:25Z
dc.date.available	2024-04-08T08:35:25Z
dc.date.issued	2023
dc.identifier.uri	http://hdl.handle.net/20.500.11956/188250
dc.description.abstract	The automation of document processing is gaining recent attention due to the great potential to reduce manual work through improved methods and hardware. In this area, neural networks have been applied before - even though they have been trained only on relatively small datasets with hundreds of documents so far. To successfully explore deep learning techniques and improve the information ex- traction results, a dataset with more than twenty-five thousand documents (pro forma invoices, invoices and debit note documents) has been compiled, anonymized and is published as a complement of this work. In the first part of the research, we will examine the documents from the point of view of table detection, present a survey on table detection methods and ultimately rephrase the table detection as a text box labelling problem to optimize micro F1 score of per-word classification. We will show that we can extract specific information from structurally different tables or table-like structures with one trainable model that features a comprehen- sive representation of a page using graph over word-boxes, positional embeddings and trainable textual features. The first part is concluded with a novel neural network model that beats multiple baselines and achieves strong, practical results on the presented dataset....	en_US
dc.description.abstract	Automatizace zpracování dokumentů si v poslední době získává pozornost kvůli velkému potenciálu usnadnění manuální práce prostřednictvím vylepšených výpočet- ních metod a hardwaru. V této oblasti se neuronové sítě uplatňovaly již dříve - i když byly dosud trénovány pouze na relativně malých datasetech se stovkami dokumentů. Aby bylo možné úspěšně prozkoumat techniky hlubokého učení a zlepšit výsledky extrakce informací, byl sestaven, anonymizován a publikován dataset s více než dvaceti pěti tisíci dokumenty (proforma fakturami, fakturami a vrubopisy). V první části výzkumu prozkoumáme dokumenty z hlediska de- tekce tabulek, představíme přehled metod detekce tabulek a nakonec přeformulu- jeme detekci tabulek jako problém označování textových polí, abychom optimal- izovali mikro F1 skóre na jednotlivých slovech. Ukážeme, že můžeme extrahovat specifické informace ze strukturálně odlišných tabulek nebo struktur podobných tabulkám pomocí jednoho trénovatelného modelu, který představuje komplexní reprezentaci stránky pomocí grafu slov, pozičního embeddingu a trénovatelného embeddingu slov. První část je úspěšně vyřešena novou architekturou neuronové sítě, která dosahuje vysoké úspěšnosti na zkoumaném datasetu. Dále je prezen- tována analýza výkonnosti modelu a je ověřeno, že konvoluce, grafové konvoluce a...	cs_CZ
dc.language	English	cs_CZ
dc.language.iso	en_US
dc.publisher	Univerzita Karlova, Matematicko-fyzikální fakulta	cs_CZ
dc.subject	one-shot learning\|information extraction\|siamese networks\|similarity\|table detection	cs_CZ
dc.subject	one-shot learning\|information extraction\|siamese networks\|similarity\|table detection	en_US
dc.title	Information Extraction from structured business documents by learning from similarity	en_US
dc.type	dizertační práce	cs_CZ
dcterms.created	2023
dcterms.dateAccepted	2023-09-21
dc.description.department	Department of Numerical Mathematics	en_US
dc.description.department	Katedra numerické matematiky	cs_CZ
dc.description.faculty	Faculty of Mathematics and Physics	en_US
dc.description.faculty	Matematicko-fyzikální fakulta	cs_CZ
dc.identifier.repId	249706
dc.title.translated	Extrakce informaci ze strukturovanych dokumentu pomoci metod uceni a podobnosti	cs_CZ
dc.contributor.referee	Liwicki, Marcus
dc.contributor.referee	Mesiti, Marco
thesis.degree.name	Ph.D.
thesis.degree.level	doktorské	cs_CZ
thesis.degree.discipline	Computational mathematics	en_US
thesis.degree.discipline	Numerická a výpočtová matematika	cs_CZ
thesis.degree.program	Computational mathematics	en_US
thesis.degree.program	Numerická a výpočtová matematika	cs_CZ
uk.thesis.type	dizertační práce	cs_CZ
uk.taxonomy.organization-cs	Matematicko-fyzikální fakulta::Katedra numerické matematiky	cs_CZ
uk.taxonomy.organization-en	Faculty of Mathematics and Physics::Department of Numerical Mathematics	en_US
uk.faculty-name.cs	Matematicko-fyzikální fakulta	cs_CZ
uk.faculty-name.en	Faculty of Mathematics and Physics	en_US
uk.faculty-abbr.cs	MFF	cs_CZ
uk.degree-discipline.cs	Numerická a výpočtová matematika	cs_CZ
uk.degree-discipline.en	Computational mathematics	en_US
uk.degree-program.cs	Numerická a výpočtová matematika	cs_CZ
uk.degree-program.en	Computational mathematics	en_US
thesis.grade.cs	Prospěl/a	cs_CZ
thesis.grade.en	Pass	en_US
uk.abstract.cs	Automatizace zpracování dokumentů si v poslední době získává pozornost kvůli velkému potenciálu usnadnění manuální práce prostřednictvím vylepšených výpočet- ních metod a hardwaru. V této oblasti se neuronové sítě uplatňovaly již dříve - i když byly dosud trénovány pouze na relativně malých datasetech se stovkami dokumentů. Aby bylo možné úspěšně prozkoumat techniky hlubokého učení a zlepšit výsledky extrakce informací, byl sestaven, anonymizován a publikován dataset s více než dvaceti pěti tisíci dokumenty (proforma fakturami, fakturami a vrubopisy). V první části výzkumu prozkoumáme dokumenty z hlediska de- tekce tabulek, představíme přehled metod detekce tabulek a nakonec přeformulu- jeme detekci tabulek jako problém označování textových polí, abychom optimal- izovali mikro F1 skóre na jednotlivých slovech. Ukážeme, že můžeme extrahovat specifické informace ze strukturálně odlišných tabulek nebo struktur podobných tabulkám pomocí jednoho trénovatelného modelu, který představuje komplexní reprezentaci stránky pomocí grafu slov, pozičního embeddingu a trénovatelného embeddingu slov. První část je úspěšně vyřešena novou architekturou neuronové sítě, která dosahuje vysoké úspěšnosti na zkoumaném datasetu. Dále je prezen- tována analýza výkonnosti modelu a je ověřeno, že konvoluce, grafové konvoluce a...	cs_CZ
uk.abstract.en	The automation of document processing is gaining recent attention due to the great potential to reduce manual work through improved methods and hardware. In this area, neural networks have been applied before - even though they have been trained only on relatively small datasets with hundreds of documents so far. To successfully explore deep learning techniques and improve the information ex- traction results, a dataset with more than twenty-five thousand documents (pro forma invoices, invoices and debit note documents) has been compiled, anonymized and is published as a complement of this work. In the first part of the research, we will examine the documents from the point of view of table detection, present a survey on table detection methods and ultimately rephrase the table detection as a text box labelling problem to optimize micro F1 score of per-word classification. We will show that we can extract specific information from structurally different tables or table-like structures with one trainable model that features a comprehen- sive representation of a page using graph over word-boxes, positional embeddings and trainable textual features. The first part is concluded with a novel neural network model that beats multiple baselines and achieves strong, practical results on the presented dataset....	en_US
uk.file-availability	V
uk.grantor	Univerzita Karlova, Matematicko-fyzikální fakulta, Katedra numerické matematiky	cs_CZ
thesis.grade.code	P
uk.publication-place	Praha	cs_CZ
uk.thesis.defenceStatus	O