Efficient transformers applied to video classification

dc.contributor.advisorEscalera Guerrero, Sergio
dc.contributor.advisorClapés i Sintes, Albert
dc.contributor.advisorPujol, David
dc.contributor.authorMartínez Pérez, Oriol
dc.date.accessioned2023-10-25T08:33:32Z
dc.date.available2023-10-25T08:33:32Z
dc.date.issued2023-06-12
dc.descriptionTreballs Finals de Grau de Matemàtiques, Facultat de Matemàtiques, Universitat de Barcelona, Any: 2023, Director: Sergio Escalera Guerrero, Albert Clapés i Sintes i David Pujolca
dc.description.abstract[en] Transformers, with the self-attention mechanism on its core, have shown great performance on several Machine Learning areas such as NLP or Computer Vision since its appearance at 2017 [1]. However, its quadratic time and memory complexity on the input length makes its application prohibitive when dealing with large input sequences. This motivated the appearance of several self-attention reformulations in order to lower its complexity and make its development less costly. We focus on three of these self-attention mechanisms applied to video classification: Cosformer [2], Nyströmformer [3] and Linformer [4]. Concretely, our goal in this project is to suggest which of them is best suited for this task. To evaluate each model performance, we design a personalizable Transformer with interchangeable self attention mechanisms and train it using a simplified dataset derived from EpicKitchens-100 [5]. We carefully describe the Transformer architecture, explaining the purpose of each of its modules, and provide and overall description of how internally works. Preliminary results indicate that Nyströmformer is the best option, being the model which converged faster and achieved the best trade off between computational cost and classification metrics. Linformer obtained similar results and Cosformer apparently failed to perform the classification. The theoretical formalization of the aforementioned self-attention mechanisms is essential for their results interpretation. Hence, we also provide an in-depth mathematical description of both the original self-attention mechanism presented by Vaswani [1] and the three efficient mechanisms. We realize a complexity analy- sis of all mechanisms and expose its main properties, linking the theoretical basis with the results.ca
dc.format.extent48 p.
dc.format.mimetypeapplication/pdf
dc.identifier.urihttps://hdl.handle.net/2445/203127
dc.language.isoengca
dc.rightscc-by-nc-nd (c) Oriol Martínez Pérez, 2023
dc.rights.accessRightsinfo:eu-repo/semantics/openAccessca
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/3.0/es/*
dc.sourceTreballs Finals de Grau (TFG) - Matemàtiques
dc.subject.classificationVisió per ordinadorca
dc.subject.classificationAprenentatge automàtic
dc.subject.classificationTractament del llenguatge natural (Informàtica)ca
dc.subject.classificationTreballs de fi de grauca
dc.subject.otherComputer visionen
dc.subject.otherMachine learning
dc.subject.otherNatural language processing (Computer science)en
dc.subject.otherBachelor's thesesen
dc.titleEfficient transformers applied to video classificationca
dc.typeinfo:eu-repo/semantics/bachelorThesisca

Fitxers

Paquet original

Mostrant 1 - 1 de 1
Carregant...
Miniatura
Nom:
tfg_oriol_martinez_perez.pdf
Mida:
1.88 MB
Format:
Adobe Portable Document Format
Descripció:
Memòria