Back to Blog
2026-03-22

Text-to-3D Model Locally: No API Key, No Cloud, No Cost

Generate 3D meshes from text prompts entirely on your local GPU using OpenAI's Shap-E model. Export to OBJ, GLB, STL — no API key or internet connection required.

You can generate a 3D model from a text prompt right now, on your own machine, with no API key and no subscription. OpenAI's Shap-E model runs fully locally — it produces textured 3D meshes from text or images in under 60 seconds on consumer hardware.

Here's a complete walkthrough: installation, code, export formats, and tricks to improve output quality.

What Is Shap-E?

Shap-E (released by OpenAI Research in 2023) is a conditional generative model for 3D assets. Unlike NeRF-based approaches that take hours to converge on a single object, Shap-E is a feed-forward model — it infers the 3D representation in a single pass through the network.

Under the hood, it generates implicit neural representations (encoded as MLP weights) and converts them to explicit meshes. Two generation modes:

  • Text-to-3D: prompt → 3D mesh
  • Image-to-3D: single RGB image → 3D mesh

Output formats: PLY point clouds, OBJ meshes, or SDF grids you can march to a mesh.

Hardware Requirements

| Setup | Generation Time | Notes | |---|---|---| | RTX 3090 (24GB) | ~15–25 seconds | Ideal | | RTX 3060 (12GB) | ~30–45 seconds | Works great | | RTX 3070 (8GB) | ~45–60 seconds | Use fp16 | | CPU only | 5–15 minutes | Slow but functional |

Installation

# Clone and install Shap-E
git clone https://github.com/openai/shap-e.git
cd shap-e
pip install -e .

# Also install for mesh export
pip install trimesh open3d

The first run will download the model weights (~2.6GB) automatically.

Text-to-3D: Basic Generation

import torch
from shap_e.diffusion.sample import sample_latents
from shap_e.diffusion.gaussian_diffusion import diffusion_from_config
from shap_e.models.download import load_model, load_config
from shap_e.util.notebooks import create_pan_cameras, decode_latent_images, gif_widget
from shap_e.util.collections import AttrDict

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Load models (downloads ~2.6GB on first run)
xm = load_model('transmitter', device=device)
model = load_model('text300M', device=device)
diffusion = diffusion_from_config(load_config('diffusion'))

# Generate 3D latent from text prompt
batch_size = 1
guidance_scale = 15.0  # Higher = closer to prompt, less diverse

latents = sample_latents(
    batch_size=batch_size,
    model=model,
    diffusion=diffusion,
    guidance_scale=guidance_scale,
    model_kwargs=dict(texts=['a red leather armchair'] * batch_size),
    progress=True,
    clip_denoised=True,
    use_fp16=True,   # Save VRAM
    use_karras=True, # Better sampling schedule
    karras_steps=64, # More steps = better quality (64 is a good default)
    sigma_min=1e-3,
    sigma_max=160,
    s_churn=0,
)

print(f"Generated {len(latents)} latent(s)")

Exporting to OBJ / GLB

from shap_e.util.notebooks import decode_latent_mesh
import trimesh

def export_3d_model(latent, output_path: str, format: str = "obj"):
    """
    Export a Shap-E latent to a 3D file.
    format: 'obj', 'glb', 'stl', 'ply'
    """
    # Decode latent to mesh
    t = decode_latent_mesh(xm, latent).tri_mesh()
    
    # Build trimesh object
    mesh = trimesh.Trimesh(
        vertices=t.verts,
        faces=t.faces,
        vertex_colors=(t.vertex_channels['R'],
                       t.vertex_channels['G'],
                       t.vertex_channels['B'])
    )
    
    # Fix mesh issues
    mesh.fix_normals()
    mesh.fill_holes()
    
    # Export
    if format == "glb":
        mesh.export(output_path + ".glb")
    elif format == "stl":
        mesh.export(output_path + ".stl")
    elif format == "ply":
        mesh.export(output_path + ".ply")
    else:  # obj
        mesh.export(output_path + ".obj")
    
    print(f"Exported to: {output_path}.{format}")
    return mesh

# Export the first generated model
mesh = export_3d_model(latents[0], "./output/armchair", format="glb")
print(f"Mesh stats: {len(mesh.vertices)} vertices, {len(mesh.faces)} faces")

Image-to-3D Generation

Shap-E can also reconstruct 3D from a single photo:

from PIL import Image
from shap_e.util.image_util import load_image

# Load image (works best with clean product photos or isolated objects)
image = load_image("product_photo.png")

# Use image model instead of text model
image_model = load_model('image300M', device=device)

latents = sample_latents(
    batch_size=1,
    model=image_model,
    diffusion=diffusion,
    guidance_scale=3.0,  # Lower guidance for image conditioning
    model_kwargs=dict(images=[image]),
    progress=True,
    clip_denoised=True,
    use_fp16=True,
    use_karras=True,
    karras_steps=64,
    sigma_min=1e-3,
    sigma_max=160,
    s_churn=0,
)

export_3d_model(latents[0], "./output/from_photo", format="glb")

Batch Generation Pipeline

prompts = [
    "a wooden coffee table",
    "a sci-fi helmet with glowing visor",
    "a ceramic coffee mug",
    "a park bench with armrests",
    "a game controller",
]

results = []
for i, prompt in enumerate(prompts):
    print(f"\n[{i+1}/{len(prompts)}] Generating: {prompt}")
    
    latents = sample_latents(
        batch_size=1,
        model=model,
        diffusion=diffusion,
        guidance_scale=15.0,
        model_kwargs=dict(texts=[prompt]),
        progress=False,
        clip_denoised=True,
        use_fp16=True,
        use_karras=True,
        karras_steps=64,
        sigma_min=1e-3,
        sigma_max=160,
        s_churn=0,
    )
    
    filename = f"./output/{prompt.replace(' ', '_')[:40]}"
    mesh = export_3d_model(latents[0], filename, format="glb")
    results.append({"prompt": prompt, "file": filename + ".glb", "faces": len(mesh.faces)})
    
    # Clear VRAM between generations
    torch.cuda.empty_cache()

print("\nBatch complete!")
for r in results:
    print(f"  ✓ {r['prompt']} → {r['faces']} faces")

Tips for Better Output Quality

Prompt engineering for 3D:

  • Be specific about materials: "a wooden oak dining chair" > "a chair"
  • Include scale hints: "a small ceramic vase" — helps with proportions
  • Reference real object categories the model knows well (furniture, vehicles, tools, food)
  • Avoid abstract concepts — "happiness" generates garbage; "a golden trophy cup" works well

Quality settings:

# Higher guidance = more literal, less creative
guidance_scale = 15.0   # Default — good balance
guidance_scale = 20.0   # Tighter adherence to prompt
guidance_scale = 10.0   # More variation/creativity

# More diffusion steps = better geometry
karras_steps = 64   # Fast, decent quality
karras_steps = 128  # ~2x slower, noticeably better edges

Post-process in Blender:

# After export, you can decimate heavy meshes programmatically
import bpy
bpy.ops.import_scene.obj(filepath="./output/armchair.obj")
bpy.ops.object.modifier_add(type='DECIMATE')
bpy.context.object.modifiers["Decimate"].ratio = 0.5  # 50% reduction
bpy.ops.object.modifier_apply(modifier="Decimate")
bpy.ops.export_scene.obj(filepath="./output/armchair_optimized.obj")

Limitations to Know

  • Shap-E excels at furniture, vehicles, food, and common objects
  • Complex articulated characters (humans, animals) are rough — use it as a base mesh
  • Texture quality is moderate; for game-ready assets you'll want to re-bake textures
  • Max resolution is determined by the SDF grid size (~128³) — fine for background props

The 3D Viewport Workspace wraps Shap-E and 12 other 3D methods into a production-ready toolkit: image-to-3D, model optimization, LOD chain generation, format conversion, VR/AR prep, and collision mesh generation — all runnable from a clean Python API.

→ Get 3D Viewport Workspace on the Shop