Text-to-3D Model Locally: No API Key, No Cloud, No Cost
Generate 3D meshes from text prompts entirely on your local GPU using OpenAI's Shap-E model. Export to OBJ, GLB, STL — no API key or internet connection required.
You can generate a 3D model from a text prompt right now, on your own machine, with no API key and no subscription. OpenAI's Shap-E model runs fully locally — it produces textured 3D meshes from text or images in under 60 seconds on consumer hardware.
Here's a complete walkthrough: installation, code, export formats, and tricks to improve output quality.
What Is Shap-E?
Shap-E (released by OpenAI Research in 2023) is a conditional generative model for 3D assets. Unlike NeRF-based approaches that take hours to converge on a single object, Shap-E is a feed-forward model — it infers the 3D representation in a single pass through the network.
Under the hood, it generates implicit neural representations (encoded as MLP weights) and converts them to explicit meshes. Two generation modes:
- Text-to-3D: prompt → 3D mesh
- Image-to-3D: single RGB image → 3D mesh
Output formats: PLY point clouds, OBJ meshes, or SDF grids you can march to a mesh.
Hardware Requirements
| Setup | Generation Time | Notes | |---|---|---| | RTX 3090 (24GB) | ~15–25 seconds | Ideal | | RTX 3060 (12GB) | ~30–45 seconds | Works great | | RTX 3070 (8GB) | ~45–60 seconds | Use fp16 | | CPU only | 5–15 minutes | Slow but functional |
Installation
# Clone and install Shap-E
git clone https://github.com/openai/shap-e.git
cd shap-e
pip install -e .
# Also install for mesh export
pip install trimesh open3d
The first run will download the model weights (~2.6GB) automatically.
Text-to-3D: Basic Generation
import torch
from shap_e.diffusion.sample import sample_latents
from shap_e.diffusion.gaussian_diffusion import diffusion_from_config
from shap_e.models.download import load_model, load_config
from shap_e.util.notebooks import create_pan_cameras, decode_latent_images, gif_widget
from shap_e.util.collections import AttrDict
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# Load models (downloads ~2.6GB on first run)
xm = load_model('transmitter', device=device)
model = load_model('text300M', device=device)
diffusion = diffusion_from_config(load_config('diffusion'))
# Generate 3D latent from text prompt
batch_size = 1
guidance_scale = 15.0 # Higher = closer to prompt, less diverse
latents = sample_latents(
batch_size=batch_size,
model=model,
diffusion=diffusion,
guidance_scale=guidance_scale,
model_kwargs=dict(texts=['a red leather armchair'] * batch_size),
progress=True,
clip_denoised=True,
use_fp16=True, # Save VRAM
use_karras=True, # Better sampling schedule
karras_steps=64, # More steps = better quality (64 is a good default)
sigma_min=1e-3,
sigma_max=160,
s_churn=0,
)
print(f"Generated {len(latents)} latent(s)")
Exporting to OBJ / GLB
from shap_e.util.notebooks import decode_latent_mesh
import trimesh
def export_3d_model(latent, output_path: str, format: str = "obj"):
"""
Export a Shap-E latent to a 3D file.
format: 'obj', 'glb', 'stl', 'ply'
"""
# Decode latent to mesh
t = decode_latent_mesh(xm, latent).tri_mesh()
# Build trimesh object
mesh = trimesh.Trimesh(
vertices=t.verts,
faces=t.faces,
vertex_colors=(t.vertex_channels['R'],
t.vertex_channels['G'],
t.vertex_channels['B'])
)
# Fix mesh issues
mesh.fix_normals()
mesh.fill_holes()
# Export
if format == "glb":
mesh.export(output_path + ".glb")
elif format == "stl":
mesh.export(output_path + ".stl")
elif format == "ply":
mesh.export(output_path + ".ply")
else: # obj
mesh.export(output_path + ".obj")
print(f"Exported to: {output_path}.{format}")
return mesh
# Export the first generated model
mesh = export_3d_model(latents[0], "./output/armchair", format="glb")
print(f"Mesh stats: {len(mesh.vertices)} vertices, {len(mesh.faces)} faces")
Image-to-3D Generation
Shap-E can also reconstruct 3D from a single photo:
from PIL import Image
from shap_e.util.image_util import load_image
# Load image (works best with clean product photos or isolated objects)
image = load_image("product_photo.png")
# Use image model instead of text model
image_model = load_model('image300M', device=device)
latents = sample_latents(
batch_size=1,
model=image_model,
diffusion=diffusion,
guidance_scale=3.0, # Lower guidance for image conditioning
model_kwargs=dict(images=[image]),
progress=True,
clip_denoised=True,
use_fp16=True,
use_karras=True,
karras_steps=64,
sigma_min=1e-3,
sigma_max=160,
s_churn=0,
)
export_3d_model(latents[0], "./output/from_photo", format="glb")
Batch Generation Pipeline
prompts = [
"a wooden coffee table",
"a sci-fi helmet with glowing visor",
"a ceramic coffee mug",
"a park bench with armrests",
"a game controller",
]
results = []
for i, prompt in enumerate(prompts):
print(f"\n[{i+1}/{len(prompts)}] Generating: {prompt}")
latents = sample_latents(
batch_size=1,
model=model,
diffusion=diffusion,
guidance_scale=15.0,
model_kwargs=dict(texts=[prompt]),
progress=False,
clip_denoised=True,
use_fp16=True,
use_karras=True,
karras_steps=64,
sigma_min=1e-3,
sigma_max=160,
s_churn=0,
)
filename = f"./output/{prompt.replace(' ', '_')[:40]}"
mesh = export_3d_model(latents[0], filename, format="glb")
results.append({"prompt": prompt, "file": filename + ".glb", "faces": len(mesh.faces)})
# Clear VRAM between generations
torch.cuda.empty_cache()
print("\nBatch complete!")
for r in results:
print(f" ✓ {r['prompt']} → {r['faces']} faces")
Tips for Better Output Quality
Prompt engineering for 3D:
- Be specific about materials: "a wooden oak dining chair" > "a chair"
- Include scale hints: "a small ceramic vase" — helps with proportions
- Reference real object categories the model knows well (furniture, vehicles, tools, food)
- Avoid abstract concepts — "happiness" generates garbage; "a golden trophy cup" works well
Quality settings:
# Higher guidance = more literal, less creative
guidance_scale = 15.0 # Default — good balance
guidance_scale = 20.0 # Tighter adherence to prompt
guidance_scale = 10.0 # More variation/creativity
# More diffusion steps = better geometry
karras_steps = 64 # Fast, decent quality
karras_steps = 128 # ~2x slower, noticeably better edges
Post-process in Blender:
# After export, you can decimate heavy meshes programmatically
import bpy
bpy.ops.import_scene.obj(filepath="./output/armchair.obj")
bpy.ops.object.modifier_add(type='DECIMATE')
bpy.context.object.modifiers["Decimate"].ratio = 0.5 # 50% reduction
bpy.ops.object.modifier_apply(modifier="Decimate")
bpy.ops.export_scene.obj(filepath="./output/armchair_optimized.obj")
Limitations to Know
- Shap-E excels at furniture, vehicles, food, and common objects
- Complex articulated characters (humans, animals) are rough — use it as a base mesh
- Texture quality is moderate; for game-ready assets you'll want to re-bake textures
- Max resolution is determined by the SDF grid size (~128³) — fine for background props
The 3D Viewport Workspace wraps Shap-E and 12 other 3D methods into a production-ready toolkit: image-to-3D, model optimization, LOD chain generation, format conversion, VR/AR prep, and collision mesh generation — all runnable from a clean Python API.