MaTe3D: Mask-guided Text-based 3D-aware Portrait Editing

1VCIP,CS, Nankai University, 2University of Science and Technology of China, 3Ant Group, 4 Faculty of Applied Sciences, Macao Polytechnic University, 5Institute for Intelligent Computing, Alibaba Group, 6Nanjing University, 7Computer Vision Center, Universitat Autonoma
*This work was partly done as an intern at Alibaba Group, Corresponding authors

Abstract

Recently, 3D-aware face editing has witnessed remarkable progress. Although current approaches successfully perform mask-guided or text-based editing, these properties have not been combined into a single method. To address this limitation, we propose \textbf{MaTe3D}: mask-guided text-based 3D-aware portrait editing. First, we propose a new SDF-based 3D generator. To better perform masked-based editing (mainly happening in local areas), we propose SDF and density consistency losses, aiming to effectively model both the global and local representations jointly. Second, we introduce an inference-optimized method. We introduce two techniques based on the SDS (Score Distillation Sampling), including a blending SDS and a conditional SDS. The former aims to overcome the mismatch problem between geometry and appearance, ultimately harming fidelity. The conditional SDS contributes to further producing satisfactory and stable results. Additionally, we create CatMask-HQ dataset, a large-scale high-resolution cat face annotations. We perform experiments on both the FFHQ and CatMask-HQ datasets to demonstrate the effectiveness of the proposed method. Our method generates faithfully a edited 3D-aware face image given a modified mask and a text prompt. Our code and models will be publicly released.

Method

MY ALT TEXT

Framework of MaTe3D. MaTe3D Generator (a) consists of Tri-planes Generation and Neural Rendering. Tri-planes Generation constructs 3D volumes of texture and shape in tri-plane representations. Neural Rendering renders 3D-aware image and mask from learned SDF and color. In Inference-optimized editing phase (b), we utilize both a frozen generator ($G_{frz}$) and a learnable generator ($G_{opt}$), which were initialized by the generator in (a). We extract 3D masks from both generators to guide feature fusion in tri-planes in (c). In addition, we propose blending SDS (d) and conditional SDS (e) to achieve mask-guided and text-based editing while maintaining consistent texture across multiple views and producing reasonable geometry.

place gif

Results



Please Stay Tuned!!!

Videos



BibTeX

@article{zhou2023mate3d,
      title     = {MaTe3D: Mask-guided Text-based 3D-aware Portrait Editing},
      author    = {Kangneng Zhou, Daiheng Gao, Xuan Wang, Jie Zhang, Peng Zhang, Xusen Sun, Longhao Zhang, Shiqi Yang, Bang Zhang, Liefeng Bo, Yaxing Wang, Ming-Ming Cheng},
      journal   = {arXiv preprint arXiv:2312.06947},
      website   = {https://MontaEllis.github.io/MaTe3D/},
      year      = {2023}}