Probabilistic topic modeling with latent Dirichlet allocation on Apache Spark

dc.contributor.advisorCerquides Bueno, Jesús
dc.contributor.authorCortés Hinojosa, Carlos Omar
dc.date.accessioned2016-10-19T07:31:35Z
dc.date.available2016-10-19T07:31:35Z
dc.date.issued2016-06-05
dc.descriptionTreballs Finals de Grau d'Enginyeria Informàtica, Facultat de Matemàtiques, Universitat de Barcelona, Any: 2016, Director: Jesús Cerquides Buenoca
dc.description.abstractIn a world in which we have access to vast amount of data, it is important to develop new tools that allow us to navigate through it. Probabilistic topic models are statistical methods to analyse text corpora and discover themes that best explain its documents. In this work, we introduce probabilistic topic models with special focus on one of the most common models called Latent Dirichlet Allocation (LDA). To learn LDA model from data, we present two variational inference algorithms for batch and online learning. Both algorithms are implemented on a popular Big Data computing framework known as Apache Spark. We introduce this framework and We study the algorithm scalability and topic coherence in two different news data sets from New York Times and BBC News. The results point out to the need to tune up Apache Spark in order to boost its performance and to the goodness of the resulting topics in the BBC News dataset.ca
dc.format.extent59 p.
dc.format.mimetypeapplication/pdf
dc.identifier.urihttps://hdl.handle.net/2445/102724
dc.language.isoengca
dc.rightsmemòria: cc-by-nc-sa (c) Carlos Omar Cortés Hinojosa, 2016
dc.rightscodi: GPL (c) Carlos Omar Cortés Hinojosa, 2016
dc.rights.accessRightsinfo:eu-repo/semantics/openAccessca
dc.rights.urihttp://creativecommons.org/licenses/by-sa/3.0/es
dc.rights.urihttp://www.gnu.org/licenses/gpl-3.0.ca.html
dc.sourceTreballs Finals de Grau (TFG) - Enginyeria Informàtica
dc.subject.classificationMètodes estadísticscat
dc.subject.classificationTractament del llenguatge natural (Informàtica)cat
dc.subject.classificationProgramaricat
dc.subject.classificationTreballs de fi de graucat
dc.subject.classificationAlgorismes computacionalsca
dc.subject.classificationDades massivesca
dc.subject.otherStatistical methodseng
dc.subject.otherNatural language processing (Computer science)eng
dc.subject.otherComputer softwareeng
dc.subject.otherBachelor's theseseng
dc.subject.otherComputer algorithmseng
dc.subject.otherBig dataeng
dc.titleProbabilistic topic modeling with latent Dirichlet allocation on Apache Sparkca
dc.typeinfo:eu-repo/semantics/bachelorThesisca

Fitxers

Paquet original

Mostrant 1 - 2 de 2
Carregant...
Miniatura
Nom:
memoria.pdf
Mida:
2.58 MB
Format:
Adobe Portable Document Format
Descripció:
Memòria
Carregant...
Miniatura
Nom:
codi_font.zip
Mida:
2.15 MB
Format:
ZIP file
Descripció:
Codi font