Extracting and interpreting information regarding humans from visual data is an active area of research in Computer Vision. Traditional models map information about humans to 2D models (stick like figures) leading to incomplete representation. We implement an optimized model for mapping pixels for a
Deep Fashion
Extracting and interpreting information regarding humans from visual data is an active area of research in Computer Vision. Traditional models map information about humans to 2D models (stick like figures) leading to incomplete representation. We implement an optimized model for mapping pixels for a human in a 2D image to a generalized 3D human model. The model builds upon the recent state of the art method named DensePose-RCNN [1] and substitutes backbone architecture with a combination of fire modules and depthwise-convolutions to reduce the parameters and inference latency. We have applied this optimized model for texture transfer and extraction from the persons pictures. Precisely, using the 3D representation (IUV Maps) of person we extract textures information from a set of images of a person and use this information for texture transfer to other subjects or texture mapping. To demonstrate this capability we have built a unified Android application for texture extraction, transfer and retrieval.
Efficiently retrieve and transfer textures in human appearance to replace the contemporary expensive and laborious techniques being used
Reduce inference time so that the architecture can be used in real world applications
Reduce model size so that the model can be feasible to be use over low resource devices
Map a character/texture onto an image so the cost of constructing animations can be reduced for industries such as gaming and animations
The aforementioned network architecture devised by Facebook provides high accuracy and the model is robust to scale variance, translation, overlapping instances, background clutter, and occlusions. However, it needs very high computation power even at test time, which is evident from the fact that it currently runs on a GTX 1080 at test time. This deters the real-world use of this algorithm on devices like cell phones that have very low processing power and memory capacity in comparison.
Taking these limitations into consideration, we propose changes to reduce the overall number of parameters and operations needed by the model. We replace normal convolutions with Depthwise Separable convolutions, which reduce number of operations by 8-9 times. Furthermore, we pool early in the network as suggested by the squeeze net paper to reduce the spatial dimensions being used later in the network. The proposed network (figure 5) can be divided into the following 3 sub-networks:
Backbone or feature extractor
Region Proposal Network
Detector Network

Figure 4 - Inverted residual blocks
We would ideally like for Human Pose Estimators to generate a 3D model of the entire human body instead of just a few joints. Moreover, we want this algorithm to be able to do this all-in real time giving decent FPS without requiring a high-end GPU and even on devices other than PC such as a camera or mobile.
Many previous works have tried to achieve this objective. A few works partially succeeded in emulating this approach e.g. DenseReg that performs 3D Pose Estimation for faces. However, the first ones to achieve a high level of accuracy while mapping 2D images to 3D models was Facebookâs AI team responsible for making DensePose- a full-fledged 3D Human Pose Estimator. The problem however is that it requires a high-end GPU at test time and even the GTX 1080 gives 4-5 FPS on 800x1100 images. This means that people with low end GPUs and devices without GPUs cannot use it.
We plan to experiment with different network architectures, loss functions and other 3D Pose Estimator components that might help us optimize these Pose Estimators to generate a 3D model of the entire body with a faster test time without the need of heavy GPUs.
Results:
DensePose is the state-of-the-art algorithm used for 3D pose estimation which was released in February 2018, by Facebook AI Research with collaboration of INRIA. They used DensePose-RCNN to estimate 3d human poses trained on hardware resources shown in Table 6.
| Architecture | Framework | Estimated Training Time | No of GPUs (Nvidia K-80) | Optimizer | Batch Size | GPU Sharing |
| DensePose-RCNN | Caffe2 | 2 Days | 8 | ADAM | 8 | Asynchronous |
| DenseMobile R-CNN | Caffe2 | 60 Days | 1 | SGD | 1 | - |
| DenseMobile R-CNN | Tensorflow | 6 Days | 2 | SGD | 2 | Asynchronous |
| DenseSqueeze R-CNN | Caffe2 | 8 Days | 2 | SGD | 2 | Asynchronous |
Table 6 - Comparison of training resources with DensePose-RCNN
We evaluate and compare current results of our model with DensePose-RCNN and summarize the results in Table 7 by evaluating the average precision (AP) of images from COCO minival subset.
| Method | AP50 | AP75 |
| DensePose-RCNN (ResNet-101) | 83.5 | 54.2 |
| DensePose-RCNN (ResNet-50) | 83.7 | 56.3 |
| DenseMobile-RCNN (MobileNet v2) | 55.2 | 31.1 |
| DenseSqueeze-RCNN (SqueezeNet) | 70.31 | 26.18 |
Table 7 - Per-instance evaluation of different architectures on COCO
We have been able to successfully model a network with much less parameters and memory requirements than DensePose-RCNN. It has been achieved by reducing the parameters of backbone.
Conclusion:
Reduction in model size leads to accuracy drop, not always significant however it can lead to significant improvement in speed. We have validated this empirically and shown a directl practical application of model optimization for texture transfer application. Coupled with this architecture, a generative model tailored for the intended application needs can be used for further improved results.
Architecture
DensePose-RCNN
DenseMobile R-CNN
DenseMobile R-CNN
DenseSqueeze R-CNN
Method
DensePose-RCNN (ResNet-101)
DensePose-RCNN (ResNet-50)
DenseMobile-RCNN (MobileNet v2)
DenseSqueeze-RCNN (SqueezeNet)
| Method | AP50 | AP75 |
| DensePose-RCNN (ResNet-101) | 83.5 | 54.2 |
| DensePose-RCNN (ResNet-50) | 83.7 | 56.3 |
| DenseMobile-RCNN (MobileNet v2) | 55.2 | 31.1 |
| DenseSqueeze-RCNN (SqueezeNet) | 70.31 | 26.18 |
It is the era of automation and machine learning and every process industry is now c...
Energy conversion in synchronous generator is possible only, if there exists an excitation...
Farmers in Pakistan have to search for information regarding different crops. Especially n...
This project is to design and fabrication of reconfigurable spider robot. This is Bio-Insp...
While objects of interest can be extracted from flat or 360? images using deep learning te...