MetaDreamer: Efficient Text-to-3D Creation With Disentangling Geometry and Texture

Gif Description

Abstract

Generative models for 3D object synthesis have seen significant advancements with the incorporation of prior knowledge distilled from 2D diffusion models. Nevertheless, existing 3D synthesis methods still face some challenges, such as multi-view geometric inconsistency and low generation efficiency. This can be attributed to two factors: firstly, the lack of abundant geometric prior knowledge in optimization, and secondly, the entanglement between geometry and texture in conventional 3D generation methods. In response, we introduce MetaDreammer, a two-stage optimization approach that leverages rich 2D(texture) and 3D(geometry) prior knowledge. In the first stage, we focus on optimizing the geometric representation to ensure the geometric integrity and multi-view consistency of the 3D objects. In the second stage, we concentrate on fine-tuning the geometry and optimizing the texture to achieve a more refined 3D object. Through leveraging 2D and 3D prior knowledge in two stages, respectively, we alleviate the entanglement between geometry and texture, thus significantly enhancing optimization efficiency. Furthermore, we introduce non-main object suppression(NMOS) to prevent geometric collapse and propose 3D Knowledge Mining(3DKM) to improve the quality of 3D generation. MetaDreamer can generate high-quality 3D objects based on textual prompts within 20 minutes, and to the best of our knowledge, it is the most efficient text-to-3D generation method. Extensive qualitative and quantitative comparative experiments demonstrate that our method outperforms the state-of-the-art level in both efficiency and the quality of generated 3D content.

Framework

Framework Image

Overview architecture. MetaDreamer is a two-stage coarse-to-fine optimization pipeline designed to generate 3D content from arbitrary input text. In the first stage, we optimize a rough 3D model Instant-NGP guiding by a reference image and view-dependent diffusion prior model simultaneously. In the second stage, we continue to refine Instant-NGP using a text-to-image 2D diffusion prior model. The entire process takes 20 minutes. The entire optimization process only takes 20 minutes.

Comparison With Sota

Method	LatenNeRF	DremFusion	Magic3D	SJC	ProlificDreamer	MetaDreamer
Time(min)	100	60	125	65	420	20
iter	20000	10000	20000	10000	70000	1300

Table:Comparison of training times between MetaDreamer and various text-based 3D methods. All experiments were conducted on a single NVIDIA A100 GPU. All experimental settings (number of iterations, random seeds, etc.) followed the official default settings of threestudio.

A rainbow-colored umbrella

A futuristic, sleek electric car model

a flamingo scratching its neck

A pair of shiny black leather shoes

A green enameled watering can

A cherry red vintage lipstick tube

A cactus with pink flowers

A bent steel crowbar

A long woolen scarf, striped red and black

A red fire hydrant with an open sign on it

A frog with a purple toothbrush in his hand

A shiny red apple

A crumpled silver aluminum soda can

A blue motorbike has a Minnesota license plate

A bright red kite with a frayed tail

A brown and white horse is wearing a blue muzzle

A castle-shaped sandcastle

A ceramic teapot with floral patterns

A chameleon perched on a tree branch

A cobweb-covered old wooden chest

A crisp paper airplane

A donut is covered with glaze

A gleaming silver saxophone

A gold tie is tied under a brown dress shirt with stripes

A partly broken shell of a tortoise

A piece of gray luggage with travel stickers

A plush velvet armchair

A ripe watermelon sliced in half

A rustic wrought-iron candle holder

A shimmering emerald pendant necklace

A sleek, black top hat

A sleek stainless steel teapot

A small porcelain white rabbit figurine

A sparkling crystal chandelier

A sparkling diamond tiara

A steaming mug of hot chocolate with whipped cream

A sturdy mahogany walking cane

A vintage porcelain doll with a frilly dress

An antique glass perfume bottle

An antique wooden rocking horse

An intricate ceramic vase with peonies painted on it

An intricately-carved wooden chess set

An orange motorcycle is shown at close range

Crisp, folded origami paper

Comparison of Results Under the Same Training Time

We compared the generation results of various methods under the same time (20 minutes). It can be observed that our method can generate high-quality 3D objects, while other methods can only produce a blurry outline.

Two-stage optimization

(a)Geometry Stage (b)Texture Stage

We present the optimization results of two stages (360° rendering and normals). The left two columns represent the results of the first stage, while the right two columns depict the results after fine-tuning in the second stage

Appearance control

(a)W/O Control Image (b)Control Image (c)W Control Image

A broccoli

A_donut_is_covered_with_glaze

A Halloween-themed tree

Ablation Study of NMOS

In the second stage, the 2D diffusion prior can only provide RGB constraint but lack 3D prior. NMOS is effective in avoiding noise diffusion, thereby preventing geometric collapse.

(a)w/NMOS (b)wo/NMOS

Ablation Study of 3DDM

From the experimental results, it is evident that 3DKM plays a crucial role in generating higher-quality textures.

(a)w/3DKM (b)wo/3DKM

Compare With Zero123

Zero123 requires one A100 to run 500 iters, taking 7 minutes, while MetaDreamer has a total runtime of 20 minutes. Overall, MetaDreamer excels in both geometry and texture.

(a)Zero123 (b)MetaDreamer