[ Thesis reading ]HRNetV1,HRNetV2,HRNetV2p - Blog

[{"createTime":1735734952000,"id":1,"img":"hwy_ms_500_252.jpeg","link":"https://activity.huaweicloud.com/cps.html?fromacct=261f35b6-af54-4511-a2ca-910fa15905d1&utm_source=V1g3MDY4NTY=&utm_medium=cps&utm_campaign=201905","name":"华为云秒杀","status":9,"txt":"华为云38元秒杀","type":1,"updateTime":1735747411000,"userId":3},{"createTime":1736173885000,"id":2,"img":"txy_480_300.png","link":"https://cloud.tencent.com/act/cps/redirect?redirect=1077&cps_key=edb15096bfff75effaaa8c8bb66138bd&from=console","name":"腾讯云秒杀","status":9,"txt":"腾讯云限量秒杀","type":1,"updateTime":1736173885000,"userId":3},{"createTime":1736177492000,"id":3,"img":"aly_251_140.png","link":"https://www.aliyun.com/minisite/goods?userCode=pwp8kmv3","memo":"","name":"阿里云","status":9,"txt":"阿里云2折起","type":1,"updateTime":1736177492000,"userId":3},{"createTime":1735660800000,"id":4,"img":"vultr_560_300.png","link":"https://www.vultr.com/?ref=9603742-8H","name":"Vultr","status":9,"txt":"Vultr送$100","type":1,"updateTime":1735660800000,"userId":3},{"createTime":1735660800000,"id":5,"img":"jdy_663_320.jpg","link":"https://3.cn/2ay1-e5t","name":"京东云","status":9,"txt":"京东云特惠专区","type":1,"updateTime":1735660800000,"userId":3},{"createTime":1735660800000,"id":6,"img":"new_ads.png","link":"https://www.iodraw.com/ads","name":"发布广告","status":9,"txt":"发布广告","type":1,"updateTime":1735660800000,"userId":3},{"createTime":1735660800000,"id":7,"img":"yun_910_50.png","link":"https://activity.huaweicloud.com/discount_area_v5/index.html?fromacct=261f35b6-af54-4511-a2ca-910fa15905d1&utm_source=aXhpYW95YW5nOA===&utm_medium=cps&utm_campaign=201905","name":"底部","status":9,"txt":"高性能云服务器2折起","type":2,"updateTime":1735660800000,"userId":3}]

1.Deep High-Resolution Representation Learning for Human Pose
Estimation(HRNetV1)

2.High-Resolution Representations for Labeling Pixels and
Regions(HRNetV2,HRNetV2p)

1.Introduction

Human pose estimation （ Also known as key point detection ） Designed from the size of W*H*3 Image of I Medium detection K A key point or part （ for example , elbow , Wrist, etc ） The location of . The state-of-the-art method transforms this problem into estimation K Size of W*
H Heat map of {H1,H2,…,HK}, Each heat map Hk Denotes the second k Location confidence of key points .

Typical attitude estimation network

Hourglass： Symmetric codec network

Cascaded pyramid networks：refinenet The feature images of different scales are convoluted and fused

SimpleBaseline： Resolution recovery using transposed convolution in decoder

Combination with dilated convolutions： Using hole convolution to enlarge receptive field in encoder

The characteristics of these networks ：

Two processes :High-to-Low process ( Generating low resolution advanced feature representation ) and Low-to-High( Restore high resolution ) process , And the two processes are serial .

Some networks will fuse high-level feature graph with low-level feature graph .

2.network architecture

advantage ：1. High resolution feature representation is maintained throughout the process , Gradually increase High-to-Low Subnet of , The multiresolution subnets are connected in parallel .

2. Exchanging information repeatedly between parallel multiresolution subnets , Multi scale fusion , High resolution features and low resolution features enhance each other .

The network is divided into 4 individual stage, each stage It's better than the last one stage One more branch , The new branch is to the previous one stage All feature maps are processed strided
convolution The results after fusion , The resolution size is half that of the previous branch , Double the number of channels , each stage from mutil-resolution block form .

every last mutil-resolution block It is divided into two parts :

(a) multi-resolution group convolution: Several parallel branches , Each branch contains 4 Residual units

(b) multi-resolution convolution(exchange unit): Multi scale feature fusion is carried out

The third 3 individual stage Of exchange unit Sketch Map ：

high , in , Three low resolution feature maps are fused with each other : Use for high resolution images strided
convolution, The low resolution image is up sampled and processed 1*1 Convolution of , Because the fusion strategy is to add elements , It is necessary to adjust the channel number of different resolution feature maps to the same number .

3. Human posture detection

HRNetV1: Only output high resolution feature map

experimental result ：

experiment ：COCO Keypoint Detection

Experimental results of verification set

And the best at the moment SimpleBaseline compare ,HRNet-W32( Number of channels :32,64,128,256) and HRNet-W48( Number of channels :48,96,192,384) Use fewer parameters , Less computation and higher performance .

4. Semantic segmentation , Face key point detection

HRNetV2: Use all resolution feature maps , After sampling on the low resolution feature map, it is spliced with the high resolution feature map , after 1*1 convolution ,softmax Layer generated segmentation prediction graph

experiment :Cityscapes Segmentation

Validation set ：

HRNetV2-W40 In comparison UNet++,DeepLabv3 When the number of parameters is less , Achieve higher quality mIOU

HRNetV2-W48 In and PSPNet The parameters are the same , Achieve higher quality mIOU

Test set ：

experiment ：PASCAL context Segmentation

experimental result

In the evaluation method without background tag and with background tag ,HRNetV2-W48 Both show better performance .

5. image classification

HRNet-Wx-C:4 Analysis of three characteristic maps with different resolutions bottleneck layer , After doubling the number of channels , From the high-resolution image in turn strided
convolution Element addition with low resolution image , After 1*1 Convolution doubles the channel (1024->2048), After global average pooling, it is sent to the classifier .

experiment ：ImageNet Classification

And Resnet contrast

HRNets In and Resnets When the calculation amount of parameter is similar , Results and ResNets Quite , And slightly better than ResNets.

6. object detection

HRNetV2p: take HRNetV2 The stitched feature maps generate different levels of feature representation through average pooling operation of different scales , after 1*1 After convolution, the feature pyramid is formed

Technology

Java296 blogs
Python265 blogs
Vue125 blogs
C Language122 blogs
Algorithm108 blogs
MySQL96 blogs
Flow Chart85 blogs
JavaScript79 blogs
More...