Proudové algoritmy pro Lp vzorkování velkých dat

Adámek, Jan

Streaming Algorithms for Lp Sampling from Large Datasets

dc.contributor.advisor	Veselý, Pavel
dc.creator	Adámek, Jan
dc.date.accessioned	2024-11-28T20:19:45Z
dc.date.available	2024-11-28T20:19:45Z
dc.date.issued	2024
dc.identifier.uri	http://hdl.handle.net/20.500.11956/193058
dc.description.abstract	Rozsáhlé výpočty často vyžadují práci s daty daleko většími, než kolik máme k dis- pozici paměti. To vytváří potřebu umět shrnout velká data v malém prostoru. Jeden z možných postupů je Lp vzorkování. Jeho cílem je z proudu dat budujícího vektor frekvencí náhodně vybrat vzorek indexu s pravděpodobností úměrnou p-té mocnině jeho frekvence. V této práci popíšeme hlavní existující algoritmy pro Lp vzorkování s p = 0 a p = 2. Při tom představíme drobné vylepšení algoritmu pro Distinct sampling a doplníme odhad frekvence pro algoritmus Truly perfect sampler. Poté tyto algoritmy implementujeme a experimentálně vyhodnotíme jejich efektivitu.	cs_CZ
dc.description.abstract	Large-scale computations often require working with datasets far larger than the avail- able memory. That creates the need to summarise large data in small space. One of the possible techniques is Lp sampling. Its goal is to take a stream of data defining a vector of frequencies and randomly sample an index with the probability proportional to the p-th power of its frequency. In this work we will describe the main existing algorithms for Lp sampling with p = 0 a p = 2. In the process we will introduce a slight algorith- mic improvement for Distinct Sampling and extend the Truly Perfect Sampler algorithm with frequency estimation. Next we will implement these algorithms and experimentally evaluate their efficiency.	en_US
dc.language	Čeština	cs_CZ
dc.language.iso	cs_CZ
dc.publisher	Univerzita Karlova, Matematicko-fyzikální fakulta	cs_CZ
dc.subject	sampling\|linear sketching\|streaming algorithms\|data summaries\|precision sampling algorithm\|distinct sampling	en_US
dc.subject	vzorkování\|lineární sketching\|proudové algoritmy\|souhrny dat\|algoritmus precision sampling\|vzorkování nezávislé na frekvenci	cs_CZ
dc.title	Proudové algoritmy pro Lp vzorkování velkých dat	cs_CZ
dc.type	bakalářská práce	cs_CZ
dcterms.created	2024
dcterms.dateAccepted	2024-09-05
dc.description.department	Computer Science Institute of Charles University	en_US
dc.description.department	Informatický ústav Univerzity Karlovy	cs_CZ
dc.description.faculty	Matematicko-fyzikální fakulta	cs_CZ
dc.description.faculty	Faculty of Mathematics and Physics	en_US
dc.identifier.repId	270271
dc.title.translated	Streaming Algorithms for Lp Sampling from Large Datasets	en_US
dc.contributor.referee	Vu, Tung Anh
thesis.degree.name	Bc.
thesis.degree.level	bakalářské	cs_CZ
thesis.degree.discipline	Computer Science with specialisation in Programming and Software Development	en_US
thesis.degree.discipline	Informatika se specializací Programování a vývoj software	cs_CZ
thesis.degree.program	Computer Science	en_US
thesis.degree.program	Informatika	cs_CZ
uk.thesis.type	bakalářská práce	cs_CZ
uk.taxonomy.organization-cs	Matematicko-fyzikální fakulta::Informatický ústav Univerzity Karlovy	cs_CZ
uk.taxonomy.organization-en	Faculty of Mathematics and Physics::Computer Science Institute of Charles University	en_US
uk.faculty-name.cs	Matematicko-fyzikální fakulta	cs_CZ
uk.faculty-name.en	Faculty of Mathematics and Physics	en_US
uk.faculty-abbr.cs	MFF	cs_CZ
uk.degree-discipline.cs	Informatika se specializací Programování a vývoj software	cs_CZ
uk.degree-discipline.en	Computer Science with specialisation in Programming and Software Development	en_US
uk.degree-program.cs	Informatika	cs_CZ
uk.degree-program.en	Computer Science	en_US
thesis.grade.cs	Velmi dobře	cs_CZ
thesis.grade.en	Very good	en_US
uk.abstract.cs	Rozsáhlé výpočty často vyžadují práci s daty daleko většími, než kolik máme k dis- pozici paměti. To vytváří potřebu umět shrnout velká data v malém prostoru. Jeden z možných postupů je Lp vzorkování. Jeho cílem je z proudu dat budujícího vektor frekvencí náhodně vybrat vzorek indexu s pravděpodobností úměrnou p-té mocnině jeho frekvence. V této práci popíšeme hlavní existující algoritmy pro Lp vzorkování s p = 0 a p = 2. Při tom představíme drobné vylepšení algoritmu pro Distinct sampling a doplníme odhad frekvence pro algoritmus Truly perfect sampler. Poté tyto algoritmy implementujeme a experimentálně vyhodnotíme jejich efektivitu.	cs_CZ
uk.abstract.en	Large-scale computations often require working with datasets far larger than the avail- able memory. That creates the need to summarise large data in small space. One of the possible techniques is Lp sampling. Its goal is to take a stream of data defining a vector of frequencies and randomly sample an index with the probability proportional to the p-th power of its frequency. In this work we will describe the main existing algorithms for Lp sampling with p = 0 a p = 2. In the process we will introduce a slight algorith- mic improvement for Distinct Sampling and extend the Truly Perfect Sampler algorithm with frequency estimation. Next we will implement these algorithms and experimentally evaluate their efficiency.	en_US
uk.file-availability	V
uk.grantor	Univerzita Karlova, Matematicko-fyzikální fakulta, Informatický ústav Univerzity Karlovy	cs_CZ
thesis.grade.code	2
uk.publication-place	Praha	cs_CZ
uk.thesis.defenceStatus	O