Please use this identifier to cite or link to this item: http://hdl.handle.net/2445/47608
Full metadata record
DC FieldValueLanguage
dc.contributor.advisorPuertas i Prats, Eloi-
dc.contributor.authorHuélamo Segura, Alberto-
dc.date.accessioned2013-11-08T10:12:47Z-
dc.date.available2013-11-08T10:12:47Z-
dc.date.issued2013-06-
dc.identifier.urihttp://hdl.handle.net/2445/47608-
dc.descriptionTreballs Finals de Grau d'Enginyeria Informàtica, Facultat de Matemàtiques, Universitat de Barcelona, Any: 2013, Director: Eloi Puertas i Pratsca
dc.description.abstractThis paper proposes a solution to the problem of clustering large amount of web documents. The Hadoop framework, implementation of MapReduce distributed programming paradigm, developed by Google, plays a very important role in this field due to its scalability and ease to parallelize software. This is the reason why it is used in this project. Meanwhile, K-means clustering algorithm is easily adaptable to MapReduce programming model and provides proper results for web documents. The documents will be represented as a frequence vectors of terms and keywords and this is what algorithm needs to work. The developed software uses Hadoop in order to perform both tasks which make up the overall process: document processing and the clustering. Web documents are in HTML, which is not suitable for K-means. It is necessary preprocess them to extract descriptors and to pass them to the clustering algorithm. This is the first part of the process. The second part, K-means on Hadoop, goes beyond typical Hadoop execution, using most of the tools which Hadoop provides to make document clusters, from descriptors obtained from first part of the process.ca
dc.format.extent70 p.-
dc.format.mimetypeapplication/pdf-
dc.language.isocatca
dc.rightsmemòria: cc-by-nc-sa (c) Alberto Huélamo Segura, 2013-
dc.rightscodi: GPL (c) Alberto Huélamo Segura, 2013-
dc.rights.urihttp://creativecommons.org/licenses/by-sa/3.0/es-
dc.rights.urihttp://www.gnu.org/licenses/gpl-3.0.ca.html-
dc.sourceTreballs Finals de Grau (TFG) - Enginyeria Informàtica-
dc.subject.classificationAnàlisi de conglomeratscat
dc.subject.classificationProcessament distribuït de dadescat
dc.subject.classificationProgramaricat
dc.subject.classificationTreballs de fi de graucat
dc.subject.otherCluster analysiseng
dc.subject.otherDistributed processing in electronic data processingeng
dc.subject.otherComputer softwareeng
dc.subject.otherBachelor's theseseng
dc.titleK-means web clustering amb Hadoop MapReduceca
dc.typeinfo:eu-repo/semantics/bachelorThesisca
dc.rights.accessRightsinfo:eu-repo/semantics/openAccessca
Appears in Collections:Programari - Treballs de l'alumnat
Treballs Finals de Grau (TFG) - Enginyeria Informàtica

Files in This Item:
File Description SizeFormat 
memoria.pdfMemòria1.72 MBAdobe PDFView/Open
src.zipcodi font6.32 MBzipView/Open


This item is licensed under a Creative Commons License Creative Commons