Carregant...
Fitxers
Tipus de document
TesiVersió
Versió publicadaData de publicació
Llicència de publicació
Si us plau utilitzeu sempre aquest identificador per citar o enllaçar aquest document: https://hdl.handle.net/2445/223497
Deep Learning Approaches for Human Activity Understanding
Títol de la revista
Autors
ISSN de la revista
Títol del volum
Resum
[eng] Understanding human activities is crucial for developing practical applications that benefit society. Temporal action localization (TAL) in untrimmed videos is one of the most challenging tasks in this field. While significant progress has been made over the years, the methods developed are still far from being suitable for real-world use, and TAL remains an ongoing challenge. This thesis aims to address this challenge task through three contributions.
First, we propose a dual hierarchical model capable of extracting and fusing both local, fine-grained boundary details and broader, high-level semantic contexts for TAL. In this method, the second hierarchical design enables the model to uncover actions of varying durations, leveraging the features learned from the first hierarchy. Our findings show that fusing temporal contexts at different scales is essential for precise TAL. In this approach, the model utilizes the self-attention mechanism in Transformer encoders. However, due to the quadratic complexity of self-attention, methods relying on it may struggle to handle real-world-length videos. Next, we present a comprehensive experimental comparison to determine which temporal feature encoder should be selected under different conditions. We analyzed 12 models, equipped with pure Transformer encoders, pure Mamba Blocks, and combinations of both into a unified encoder for TAL. The experimental results suggest that the choice of encoder depends heavily on the specific dataset. Nevertheless, the pure Mamba Block emerges as the preferred option for unknown datasets due to its performance and lower complexity. Finally, we introduce UDIVA-HHOI, a novel large-scale audio-visual dyadic human-human-object interaction dataset. This dataset provides rich, extremely short-duration and concurrent actions, featuring both low-level physical actions and high-level goal-oriented actions and the objects involved in these actions—elements not typically represented in commonly used TAL benchmarks. UDIVA-HHOI opens up new possibilities for addressing the detection of complex interactive actions in real-world scenarios. Our preliminary study confirms its potential, and our analysis also offers recommendations for selecting an appropriate feature encoder for future research on this new benchmark, with the Mamba Block being the preferred choice.
Descripció
Matèries (anglès)
Citació
Citació
ZHANG, Zejian. Deep Learning Approaches for Human Activity Understanding. [consulta: 30 de novembre de 2025]. [Disponible a: https://hdl.handle.net/2445/223497]