K-means web clustering amb Hadoop MapReduce

Huélamo Segura, Alberto

Please use this identifier to cite or link to this item: http://hdl.handle.net/2445/47608

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Puertas i Prats, Eloi	-
dc.contributor.author	Huélamo Segura, Alberto	-
dc.date.accessioned	2013-11-08T10:12:47Z	-
dc.date.available	2013-11-08T10:12:47Z	-
dc.date.issued	2013-06	-
dc.identifier.uri	http://hdl.handle.net/2445/47608	-
dc.description	Treballs Finals de Grau d'Enginyeria Informàtica, Facultat de Matemàtiques, Universitat de Barcelona, Any: 2013, Director: Eloi Puertas i Prats	ca
dc.description.abstract	This paper proposes a solution to the problem of clustering large amount of web documents. The Hadoop framework, implementation of MapReduce distributed programming paradigm, developed by Google, plays a very important role in this field due to its scalability and ease to parallelize software. This is the reason why it is used in this project. Meanwhile, K-means clustering algorithm is easily adaptable to MapReduce programming model and provides proper results for web documents. The documents will be represented as a frequence vectors of terms and keywords and this is what algorithm needs to work. The developed software uses Hadoop in order to perform both tasks which make up the overall process: document processing and the clustering. Web documents are in HTML, which is not suitable for K-means. It is necessary preprocess them to extract descriptors and to pass them to the clustering algorithm. This is the first part of the process. The second part, K-means on Hadoop, goes beyond typical Hadoop execution, using most of the tools which Hadoop provides to make document clusters, from descriptors obtained from first part of the process.	ca
dc.format.extent	70 p.	-
dc.format.mimetype	application/pdf	-
dc.language.iso	cat	ca
dc.rights	memòria: cc-by-nc-sa (c) Alberto Huélamo Segura, 2013	-
dc.rights	codi: GPL (c) Alberto Huélamo Segura, 2013	-
dc.rights.uri	http://creativecommons.org/licenses/by-sa/3.0/es	-
dc.rights.uri	http://www.gnu.org/licenses/gpl-3.0.ca.html	-
dc.source	Treballs Finals de Grau (TFG) - Enginyeria Informàtica	-
dc.subject.classification	Anàlisi de conglomerats	cat
dc.subject.classification	Processament distribuït de dades	cat
dc.subject.classification	Programari	cat
dc.subject.classification	Treballs de fi de grau	cat
dc.subject.other	Cluster analysis	eng
dc.subject.other	Distributed processing in electronic data processing	eng
dc.subject.other	Computer software	eng
dc.subject.other	Bachelor's theses	eng
dc.title	K-means web clustering amb Hadoop MapReduce	ca
dc.type	info:eu-repo/semantics/bachelorThesis	ca
dc.rights.accessRights	info:eu-repo/semantics/openAccess	ca
Appears in Collections:	Programari - Treballs de l'alumnat Treballs Finals de Grau (TFG) - Enginyeria Informàtica

Files in This Item:

File	Description	Size	Format
memoria.pdf	Memòria	1.72 MB	Adobe PDF	View/Open
src.zip	codi font	6.32 MB	zip	View/Open

Show simple item record

This item is licensed under a Creative Commons License