MatPedia: A Universal Generative Foundation for High-Fidelity Material Synthesis
CVPR 2026
-
Di Luo*
Nankai University
-
Shuhui Yang*
Tencent Hunyuan
-
Mingxin Yang*
Tecent Hunyuan
-
Jiawei Lu
Nankai University
-
Yixuan Tang
Xi'an Jiaotong University
-
Xintong Han
Tecent Hunyuan
-
Zhuo Chen
Tecent Hunyuan
-
Beibei Wang✝
Nanjing University
-
Chunchao Guo✝
Tecent Hunyuan
Abstract
Physically-based rendering (PBR) materials are fundamental to photorealistic graphics, yet their creation remains labor-intensive and requires specialized expertise. While generative models have advanced material synthesis, existing methods lack a unified representation bridging natural image appearance and PBR properties, leading to fragmented task-specific pipelines and inability to leverage large-scale RGB image data. We present MatPedia, a foundation model built upon a novel joint RGB-PBR representation that compactly encodes materials into two interdependent latents: one for RGB appearance and one for the four PBR maps encoding complementary physical properties. By formulating them as a 5-frame sequence and employing video diffusion architectures, MatPedia naturally captures their correlations while transferring visual priors from RGB generation models. This joint representation enables a unified framework handling multiple material tasks—text-to-material generation, image-to-material generation, and intrinsic decomposition—within a single architecture. Trained on MatHybrid-410K, a mixed corpus combining PBR datasets with large-scale RGB images, MatPedia achieves native 1024×1024 synthesis that substantially surpasses existing approaches in both quality and diversity.
Pipeline
Pipeline of the proposed MatPedia framework. Left: The 3D VAE encodes a shaded RGB frame together with optional PBR maps into a joint RGB-PBR latent representation, where PBR maps are conditioned on the RGB appearance. This compact representation supports both (a) shaded RGB decoding and (b) PBR decoding at native 1024×1024 resolution. Right: The DiT, initialized from large-scale video generation models and adapted via LoRA, operates on the joint latents to perform three tasks: Text-to-PBR (generate RGB/PBR from material captions), Image-to-PBR (generate planar RGB/PBR from distorted input images), and Material Decomposition (recover PBR maps from natural images). DiT blocks integrate self-attention (SA), cross-attention (CA), and LoRA modules to enable flexible conditioning across modalities.
Text-to-Material Generation
Qualitative comparison of text-conditioned PBR material generation among our method, MatFuse, ControlMat, and MaterialPicker. For each prompt, we show the generated PBR maps (Basecolor, Normal, Roughness, Metallic) followed by a render view under point-light illumination. We note that MatFuse generates a specular map rather than a metallic map.
Image-to-Material Generation
Qualitative comparison of image-conditioned PBR generation. For each sample, the first column shows the distorted input image (cropped from the scene), and the second to last columns present the generated material maps together with a rendering under point-light illumination. Our method produces geometrically flattened and artifact-free maps, while MatFuse shows reduced roughness fidelity and Material Palette retains geometric distortions from the input.
Material Decomposition
Qualitative comparison of material decomposition. For each sample, the first column shows the planar input image, and the second to last columns present the generated material maps together with a rendering under environment lighting. Our method produces consistent structural patterns, yielding rendered views that closely match the input appearance.
Citation
Acknowledgements
The website template was borrowed from BakedSDF.