K-means web clustering amb Hadoop MapReduce

This paper proposes a solution to the problem of clustering large amount of web documents. The Hadoop framework, implementation of MapReduce distributed programming paradigm, developed by Google, plays a very important role in this field due to its scalability and ease to parallelize software. This is the reason why it is used in this project. Meanwhile, K-means clustering algorithm is easily adaptable to MapReduce programming model and provides proper results for web documents. The documents will be represented as a frequence vectors of terms and keywords and this is what algorithm needs to work. The developed software uses Hadoop in order to perform both tasks which make up the overall process: document processing and the clustering. Web documents are in HTML, which is not suitable for K-means. It is necessary preprocess them to extract descriptors and to pass them to the clustering algorithm. This is the first part of the process. The second part, K-means on Hadoop, goes beyond typical Hadoop execution, using most of the tools which Hadoop provides to make document clusters, from descriptors obtained from first part of the process.

Descripció

Treballs Finals de Grau d'Enginyeria Informàtica, Facultat de Matemàtiques, Universitat de Barcelona, Any: 2013, Director: Eloi Puertas i Prats

Matèries

Anàlisi de conglomerats, Processament distribuït de dades, Programari, Treballs de fi de grau

Matèries (anglès)

Cluster analysis, Distributed processing in electronic data processing, Computer software, Bachelor's theses

Col·leccions

Treballs Finals de Grau (TFG) - Enginyeria Informàtica
Programari - Treballs de l'alumnat

Pàgina completa de l'ítem

Citació

HUÉLAMO SEGURA, Alberto. K-means web clustering amb Hadoop MapReduce. [consulta: 25 de febrer de 2026]. [Disponible a: https://hdl.handle.net/2445/47608]

Estadístiques

Exportar metadades

JSON - METS

Fitxers

Tipus de document

Data de publicació

Llicència de publicació

K-means web clustering amb Hadoop MapReduce

Títol de la revista

Autors

Director/Tutor

ISSN de la revista

Títol del volum

Recurs relacionat

Resum

Descripció

Matèries

Matèries (anglès)

Citació

Col·leccions

Citació

Exportar metadades

Fitxers

Tipus de document

Data de publicació

Llicència de publicació

K-means web clustering amb Hadoop MapReduce

Títol de la revista

Autors

Director/Tutor

ISSN de la revista

Títol del volum

Recurs relacionat

Resum

Descripció

Matèries

Matèries (anglès)

Citació

Col·leccions

Citació

Exportar metadades

Compartir registre