1.Deep High-Resolution Representation Learning for Human Pose
Estimation(HRNetV1)

2.High-Resolution Representations for Labeling Pixels and
Regions(HRNetV2,HRNetV2p)

1.Introduction

Human pose estimation ( Also known as key point detection ) Designed from the size of W*H*3 Image of I Medium detection K A key point or part ( for example , elbow , Wrist, etc ) The location of . The state-of-the-art method transforms this problem into estimation K Size of W*
H Heat map of {H1,H2,…,HK}, Each heat map Hk Denotes the second k Location confidence of key points .

Typical attitude estimation network

Hourglass: Symmetric codec network

 Cascaded pyramid networks:refinenet The feature images of different scales are convoluted and fused

SimpleBaseline: Resolution recovery using transposed convolution in decoder

 Combination with dilated convolutions: Using hole convolution to enlarge receptive field in encoder

  The characteristics of these networks :

Two processes :High-to-Low process ( Generating low resolution advanced feature representation ) and Low-to-High( Restore high resolution ) process , And the two processes are serial .

Some networks will fuse high-level feature graph with low-level feature graph .

2.network architecture

advantage :1. High resolution feature representation is maintained throughout the process , Gradually increase High-to-Low Subnet of , The multiresolution subnets are connected in parallel .

2. Exchanging information repeatedly between parallel multiresolution subnets , Multi scale fusion , High resolution features and low resolution features enhance each other .

The network is divided into 4 individual stage, each stage It's better than the last one stage One more branch , The new branch is to the previous one stage All feature maps are processed strided
convolution The results after fusion , The resolution size is half that of the previous branch , Double the number of channels , each stage from mutil-resolution block form .

  every last mutil-resolution block It is divided into two parts :

(a) multi-resolution group convolution: Several parallel branches , Each branch contains 4 Residual units

(b) multi-resolution convolution(exchange unit): Multi scale feature fusion is carried out

The third 3 individual stage Of exchange unit Sketch Map :

  high , in , Three low resolution feature maps are fused with each other : Use for high resolution images strided
convolution, The low resolution image is up sampled and processed 1*1 Convolution of , Because the fusion strategy is to add elements , It is necessary to adjust the channel number of different resolution feature maps to the same number .

3. Human posture detection

HRNetV1: Only output high resolution feature map

  experimental result :

experiment :COCO Keypoint Detection

Experimental results of verification set

  And the best at the moment SimpleBaseline compare ,HRNet-W32( Number of channels :32,64,128,256) and HRNet-W48( Number of channels :48,96,192,384) Use fewer parameters , Less computation and higher performance .

4. Semantic segmentation , Face key point detection

HRNetV2: Use all resolution feature maps , After sampling on the low resolution feature map, it is spliced with the high resolution feature map , after 1*1 convolution ,softmax Layer generated segmentation prediction graph

experiment :Cityscapes Segmentation

Validation set :

 HRNetV2-W40 In comparison UNet++,DeepLabv3 When the number of parameters is less , Achieve higher quality mIOU

HRNetV2-W48 In and PSPNet The parameters are the same , Achieve higher quality mIOU

  Test set :

  experiment :PASCAL context Segmentation

experimental result

  In the evaluation method without background tag and with background tag ,HRNetV2-W48 Both show better performance .

5. image classification

HRNet-Wx-C:4 Analysis of three characteristic maps with different resolutions bottleneck layer , After doubling the number of channels , From the high-resolution image in turn strided
convolution Element addition with low resolution image , After 1*1 Convolution doubles the channel (1024->2048), After global average pooling, it is sent to the classifier .

  experiment :ImageNet Classification

And Resnet contrast

 HRNets In and Resnets When the calculation amount of parameter is similar , Results and ResNets Quite , And slightly better than ResNets.

6. object detection

HRNetV2p: take HRNetV2 The stitched feature maps generate different levels of feature representation through average pooling operation of different scales , after 1*1 After convolution, the feature pyramid is formed

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Technology