In video action recognition, the Dollar detector has been widely used to extract Spatio-Temporal Interest Points (STIPs) from action video sequence. It generates two kinds of information: STIP position and the respond value. However, in many cases, the detector respond, which measures the strength of local motion changes, is ignored. By utilizing such information, we propose to build a Hierarchical STIP Saliency (HSS) framework to provide different types of motion information. A novel local feature named Mixed Neighborhood Feature (MNF), which integrates the similarity and position relationship between local features, is put forward, and encoded by localityconstrained linear coding. Then, by partitioning video sequence along temporal direction, a group of sub-STVs are produced, and their corresponding descriptors are obtained with a max-pooling-on-absolute-value technique. In classification stage, Locality-constrained Group Sparse Representation (LGSR) is adopted as classifier to utilize the intrinsic group information of these sub-STV features. The experiments on the KTH and UCF Sports datasets show that in contrast to the classical recognition systems published recently, our recognition system based on the HSS and MNF achieves good performance.


