Data Lineage Analysis for PySpark and Python ORM Libraries

Jurčo, Andrej

Analýza datových toků pro PySpark a ORM knihovny jazyka Python

dc.contributor.advisor	Parízek, Pavel
dc.creator	Jurčo, Andrej
dc.date.accessioned	2023-07-24T23:27:05Z
dc.date.available	2023-07-24T23:27:05Z
dc.date.issued	2023
dc.identifier.uri	http://hdl.handle.net/20.500.11956/181592
dc.description.abstract	In the world of ETL tools and data processing, Python is one of the main languages used in practice. Python scripts that define data manipulations usually use the same Python framework, PySpark, which is the Python API for the Spark framework, alongside database libraries, using their ORM features. These ORM features usually work in a similar way in most of the relevant libraries. Recently, MANTA Flow, a highly automated data lineage analysis tool, was extended with a Python language scanner and now it is in the phase of being extended to support more commonly used frameworks. In this work, we analyzed the PySpark library and the SQLAlchemy ORM technology in order to extend the MANTA's Python scanner with the support for these two frequently used tools. In case of the PySpark library, we designed and implemented a core of the plugin to the Python scanner which supports elementary functionality. The plugin is capable of analyzing various DataFrame input and output options available in PySpark for both file and database data sources, and it is able to propagate data flows during transformations with reasonable level of overapproximation, as demonstrated in the work. In case of the SQLAlchemy ORM, we designed a solution that would allow the scanner to analyze the ORM source code and its core could be used to...	en_US
dc.description.abstract	Vo svete ETL nástrojov a spracovania dát je Python jedným z najčastejšie použí- vaných jazykov. Skripty napísané v jazyku Python, ktoré definujú manipuláciu s dá- tami, zvyčajne používajú rovnakú knižnicu, PySpark, čo je Python API pre framework Spark, spoločne s databázovými knižnicami, využívajúc ich ORM funkcionalitu. Táto funkcionalita zvyčajne funguje podobným spôsobom vo väčšine relevantných knižníc. Nedávno bol MANTA Flow, vysoko automatizovaný nástroj na analýzu data lineage, rozšírený o skener jazyka Python a teraz je vo fáze rozširovania o podporu bežných frameworkov. V tejto práci sme analyzovali knižnicu PySpark a technológiu SQLAlchemy ORM s cieľom rozšíriť Python skener firmy MANTA o podporu týchto dvoch často používaných nástro- jov. V prípade knižnice PySpark sme navrhli a implementovali jadro pluginu pre skener jazyka Python, ktorý podporuje elementárnu funkcionalitu. Plugin je schopný analyzo- vať rôzne vstupné a výstupné možnosti DataFramov dostupné v PySparku pre súborové aj databázové dátové zdroje a je schopný propagácie dátových tokov počas transformá- cií s primeranou úrovňou overaproximácie, ako sme v práci demonštrovali. V prípade SQLAlchemy ORM sme navrhli riešenie, ktoré umožní skeneru analyzovať zdrojový kód využívajúci funkctionalitu ORM a jeho jadro by bolo možné použiť aj pre...	cs_CZ
dc.language	English	cs_CZ
dc.language.iso	en_US
dc.publisher	Univerzita Karlova, Matematicko-fyzikální fakulta	cs_CZ
dc.subject	data lineage\|data flow\|python\|symbolic analysis	en_US
dc.subject	data lineage\|python\|symbolická analýza\|dátové toky	cs_CZ
dc.title	Data Lineage Analysis for PySpark and Python ORM Libraries	en_US
dc.type	diplomová práce	cs_CZ
dcterms.created	2023
dcterms.dateAccepted	2023-06-06
dc.description.department	Katedra distribuovaných a spolehlivých systémů	cs_CZ
dc.description.department	Department of Distributed and Dependable Systems	en_US
dc.description.faculty	Faculty of Mathematics and Physics	en_US
dc.description.faculty	Matematicko-fyzikální fakulta	cs_CZ
dc.identifier.repId	247480
dc.title.translated	Analýza datových toků pro PySpark a ORM knihovny jazyka Python	cs_CZ
dc.contributor.referee	Škoda, Petr
thesis.degree.name	Mgr.
thesis.degree.level	navazující magisterské	cs_CZ
thesis.degree.discipline	Informatika - Softwarové a datové inženýrství	cs_CZ
thesis.degree.discipline	Computer Science - Software and Data Engineering	en_US
thesis.degree.program	Informatika - Softwarové a datové inženýrství	cs_CZ
thesis.degree.program	Computer Science - Software and Data Engineering	en_US
uk.thesis.type	diplomová práce	cs_CZ
uk.taxonomy.organization-cs	Matematicko-fyzikální fakulta::Katedra distribuovaných a spolehlivých systémů	cs_CZ
uk.taxonomy.organization-en	Faculty of Mathematics and Physics::Department of Distributed and Dependable Systems	en_US
uk.faculty-name.cs	Matematicko-fyzikální fakulta	cs_CZ
uk.faculty-name.en	Faculty of Mathematics and Physics	en_US
uk.faculty-abbr.cs	MFF	cs_CZ
uk.degree-discipline.cs	Informatika - Softwarové a datové inženýrství	cs_CZ
uk.degree-discipline.en	Computer Science - Software and Data Engineering	en_US
uk.degree-program.cs	Informatika - Softwarové a datové inženýrství	cs_CZ
uk.degree-program.en	Computer Science - Software and Data Engineering	en_US
thesis.grade.cs	Výborně	cs_CZ
thesis.grade.en	Excellent	en_US
uk.abstract.cs	Vo svete ETL nástrojov a spracovania dát je Python jedným z najčastejšie použí- vaných jazykov. Skripty napísané v jazyku Python, ktoré definujú manipuláciu s dá- tami, zvyčajne používajú rovnakú knižnicu, PySpark, čo je Python API pre framework Spark, spoločne s databázovými knižnicami, využívajúc ich ORM funkcionalitu. Táto funkcionalita zvyčajne funguje podobným spôsobom vo väčšine relevantných knižníc. Nedávno bol MANTA Flow, vysoko automatizovaný nástroj na analýzu data lineage, rozšírený o skener jazyka Python a teraz je vo fáze rozširovania o podporu bežných frameworkov. V tejto práci sme analyzovali knižnicu PySpark a technológiu SQLAlchemy ORM s cieľom rozšíriť Python skener firmy MANTA o podporu týchto dvoch často používaných nástro- jov. V prípade knižnice PySpark sme navrhli a implementovali jadro pluginu pre skener jazyka Python, ktorý podporuje elementárnu funkcionalitu. Plugin je schopný analyzo- vať rôzne vstupné a výstupné možnosti DataFramov dostupné v PySparku pre súborové aj databázové dátové zdroje a je schopný propagácie dátových tokov počas transformá- cií s primeranou úrovňou overaproximácie, ako sme v práci demonštrovali. V prípade SQLAlchemy ORM sme navrhli riešenie, ktoré umožní skeneru analyzovať zdrojový kód využívajúci funkctionalitu ORM a jeho jadro by bolo možné použiť aj pre...	cs_CZ
uk.abstract.en	In the world of ETL tools and data processing, Python is one of the main languages used in practice. Python scripts that define data manipulations usually use the same Python framework, PySpark, which is the Python API for the Spark framework, alongside database libraries, using their ORM features. These ORM features usually work in a similar way in most of the relevant libraries. Recently, MANTA Flow, a highly automated data lineage analysis tool, was extended with a Python language scanner and now it is in the phase of being extended to support more commonly used frameworks. In this work, we analyzed the PySpark library and the SQLAlchemy ORM technology in order to extend the MANTA's Python scanner with the support for these two frequently used tools. In case of the PySpark library, we designed and implemented a core of the plugin to the Python scanner which supports elementary functionality. The plugin is capable of analyzing various DataFrame input and output options available in PySpark for both file and database data sources, and it is able to propagate data flows during transformations with reasonable level of overapproximation, as demonstrated in the work. In case of the SQLAlchemy ORM, we designed a solution that would allow the scanner to analyze the ORM source code and its core could be used to...	en_US
uk.file-availability	V
uk.grantor	Univerzita Karlova, Matematicko-fyzikální fakulta, Katedra distribuovaných a spolehlivých systémů	cs_CZ
thesis.grade.code	1
uk.publication-place	Praha	cs_CZ
uk.thesis.defenceStatus	O