El Dipòsit Digital ha actualitzat el programari. Contacteu amb dipositdigital@ub.edu per informar de qualsevol incidència.

 
Carregant...
Miniatura

Tipus de document

Tesi

Versió

Versió publicada

Data de publicació

Llicència de publicació

cc by-nc (c) Zhang, Zejian, 2025
Si us plau utilitzeu sempre aquest identificador per citar o enllaçar aquest document: https://hdl.handle.net/2445/223497

Deep Learning Approaches for Human Activity Understanding

Títol de la revista

ISSN de la revista

Títol del volum

Resum

[eng] Understanding human activities is crucial for developing practical applications that benefit society. Temporal action localization (TAL) in untrimmed videos is one of the most challenging tasks in this field. While significant progress has been made over the years, the methods developed are still far from being suitable for real-world use, and TAL remains an ongoing challenge. This thesis aims to address this challenge task through three contributions. First, we propose a dual hierarchical model capable of extracting and fusing both local, fine-grained boundary details and broader, high-level semantic contexts for TAL. In this method, the second hierarchical design enables the model to uncover actions of varying durations, leveraging the features learned from the first hierarchy. Our findings show that fusing temporal contexts at different scales is essential for precise TAL. In this approach, the model utilizes the self-attention mechanism in Transformer encoders. However, due to the quadratic complexity of self-attention, methods relying on it may struggle to handle real-world-length videos. Next, we present a comprehensive experimental comparison to determine which temporal feature encoder should be selected under different conditions. We analyzed 12 models, equipped with pure Transformer encoders, pure Mamba Blocks, and combinations of both into a unified encoder for TAL. The experimental results suggest that the choice of encoder depends heavily on the specific dataset. Nevertheless, the pure Mamba Block emerges as the preferred option for unknown datasets due to its performance and lower complexity. Finally, we introduce UDIVA-HHOI, a novel large-scale audio-visual dyadic human-human-object interaction dataset. This dataset provides rich, extremely short-duration and concurrent actions, featuring both low-level physical actions and high-level goal-oriented actions and the objects involved in these actions—elements not typically represented in commonly used TAL benchmarks. UDIVA-HHOI opens up new possibilities for addressing the detection of complex interactive actions in real-world scenarios. Our preliminary study confirms its potential, and our analysis also offers recommendations for selecting an appropriate feature encoder for future research on this new benchmark, with the Mamba Block being the preferred choice.

Descripció

Citació

Citació

ZHANG, Zejian. Deep Learning Approaches for Human Activity Understanding. [consulta: 30 de novembre de 2025]. [Disponible a: https://hdl.handle.net/2445/223497]

Exportar metadades

JSON - METS

Compartir registre