Carregant...
Miniatura

Tipus de document

Treball de fi de grau

Data de publicació

Llicència de publicació

memòria: cc-by-nc-sa (c) Carlos Omar Cortés Hinojosa, 2016
Si us plau utilitzeu sempre aquest identificador per citar o enllaçar aquest document: https://hdl.handle.net/2445/102724

Probabilistic topic modeling with latent Dirichlet allocation on Apache Spark

Títol de la revista

ISSN de la revista

Títol del volum

Recurs relacionat

Resum

In a world in which we have access to vast amount of data, it is important to develop new tools that allow us to navigate through it. Probabilistic topic models are statistical methods to analyse text corpora and discover themes that best explain its documents. In this work, we introduce probabilistic topic models with special focus on one of the most common models called Latent Dirichlet Allocation (LDA). To learn LDA model from data, we present two variational inference algorithms for batch and online learning. Both algorithms are implemented on a popular Big Data computing framework known as Apache Spark. We introduce this framework and We study the algorithm scalability and topic coherence in two different news data sets from New York Times and BBC News. The results point out to the need to tune up Apache Spark in order to boost its performance and to the goodness of the resulting topics in the BBC News dataset.

Descripció

Treballs Finals de Grau d'Enginyeria Informàtica, Facultat de Matemàtiques, Universitat de Barcelona, Any: 2016, Director: Jesús Cerquides Bueno

Citació

Citació

CORTÉS HINOJOSA, Carlos omar. Probabilistic topic modeling with latent Dirichlet allocation on Apache Spark. [consulta: 2 de febrer de 2026]. [Disponible a: https://hdl.handle.net/2445/102724]

Exportar metadades

JSON - METS

Compartir registre