Parallel Algorithms for Clustering High-Dimensional Data

Skýpala, Daniel

Paralelní algoritmy pro shlukování dat ve vysoké dimenzi

bachelor thesis (DEFENDED)

View/Open

Záznam o průběhu obhajoby (347.3Kb)

Permanent link

http://hdl.handle.net/20.500.11956/200783

Identifiers

Study Information System: 273460

Referee

Rozhoň, Václav

Faculty / Institute

Faculty of Mathematics and Physics

Discipline

Computer Science with specialisation in Foundations of Computer Science

Department

Computer Science Institute of Charles University

Date of defense

20. 6. 2025

Publisher

Univerzita Karlova, Matematicko-fyzikální fakulta

Language

English

Grade

Excellent

Keywords (Czech)

shlukování|masivně paralelní výpočty|paralelní algoritmy|k-medián|k-means

Keywords (English)

clustering|massively parallel computation|parallel algorithms|k-median|k-means

Shlukování je v Informatice široce studovaný problém s mnoha aplikacemi v růz- ných oblastech. Kvůli značné velikosti instancí vyžaduje shlukování efektivní paralelní algoritmy. V této práci se zaměřujeme na skupinu masivně paralelních algorithmů pro shlukávání postavených na konzistentním geometrickém hešování, kterou vyvinuli Czu- maj a další. Představujeme nové algoritmus pro rychlé konzistení hešování. Dále tyto algoritmy implementujeme a porovnáváme je se současnými protějšky. Pro problém Fa- cility Location naše algoritmy dosahují značného zrychlení nad předchozími algoritmy s drobnou penalizací v ceně řešení. Dále naše algoritmy srovnáváme s algoritmy pro k- medián a k-means z Pythoní knihovny scikit-learn. V případě k-mediánu naše algoritmy je překnovávají a jsou narozdíl od Pythoních schopné počítat velké vstupy. Široce po- užívány a značně zoptimalizovaný algoritmus k-means++, se kterým se srovnáváme, se ukázal jako rychlejší a vydal lepší výsledky.

Abstract (English)

Clustering is a widely studied problem in computer science with many applications across fields. Due to the large size of clustering instances, this problem calls for effi- cient parallel algorithms. In this thesis, we focus on a set of massively parallel clustering algorithms based on consistent geometric hashing designed by Czumaj et al. We intro- duce a new algorithm for fast consistent geometric hashing. Next, we implement these algorithms and compare them to current state-of-the-art algorithms. For Facility Loca- tion, our algorithms greatly outperform the speed of previous algorithms while producing solutions with small or no cost increase. For k-median and k-means, we compare our al- gorithms to Python's scikit-learn ones. For k-median, we provide solutions outclassing current ones and being able to compute larger inputs, unlike baselines. For k-means, scikit's widely used and highly optimised k-means++ proved better than our algorithms.

Citace dokumentu

Metadata

Show full item record