A build log for setting up SD1.5 + ControlNet on a GTX 1660 SUPER (6 GB VRAM). The three real roadblocks were not VRAM shortage but: RAM spiking to ~26 GB causing OOM, fp16 producing all-black NaN images, and a forced math SDPA path causing a 3.2× slowdown. A fourth surprise — every benchmark had been measured while spilling to host RAM — appeared later. Full breakdown from cause to fix.
What This Article Covers
- An SD1.5 + ControlNet configuration that works on a 6 GB VRAM GPU
- The fp16 NaN issue on GTX 16xx (TU116) and how to work around it
- Why
sequential_cpu_offloadeats RAM instead of VRAM - Final benchmarks: Hyper-SD15 1-step ~5 s/image, full 25-step ~1 min/image
What I Used
- MSI GeForce GTX 1660 SUPER GAMING X 6GB — the GPU used here
- CFD DDR4-3200 32 GB desktop RAM — 32 GB or more recommended after hitting the OOM wall
- WSL2 (Ubuntu 22.04), Python 3.12, diffusers 0.38
- Realistic Vision V6.0 B1 (noVAE) checkpoint
- ControlNet (OpenPose / Canny / Depth) + LCM-LoRA / Hyper-SD15
Background: SDXL ControlNet Was Abandoned in 1 Hour 13 Minutes
The original plan was to implement ControlNet on an existing SDXL-based environment. On 2026-04-30 at 21:40 the implementation finished. By 22:53 it had been reverted. One hour, thirteen minutes.
RAM was the cause. With SDXL and enable_sequential_cpu_offload, VRAM peaks at ~4 GB — but the entire model lives in RAM. Adding ControlNet (fp32, ~5 GB) on top:
| Configuration | VRAM peak | RAM resident | Result |
|---|---|---|---|
| SDXL fp32 + FaceID | ~4 GB | ~21 GB | ✅ (24 GB system) |
| + ControlNet added | ~4 GB (unchanged) | ~26 GB ❌ | OOM |
VRAM was fine. ControlNet's cost shows up on the RAM side. The confusion — "VRAM is within limits, so why OOM?" — only resolves once you understand how sequential_cpu_offload works.
How sequential_cpu_offload Works
diffusers' enable_sequential_cpu_offload iterates through ~700 UNet submodules, transferring each to VRAM → running inference → moving it back to RAM, once per step. The full model lives in RAM throughout. There are 700 CPU↔GPU round trips per step, with Python GIL pinning a single core at 100%.
SD1.5 weighs ~4 GB (fp32) — it fits in 6 GB VRAM, making sequential_cpu_offload unnecessary. That was the motivation for building a separate SD1.5 environment.
Three Problems That Blocked Progress
① fp16 Produces All-Black NaN Images (Known GTX 16xx Issue)
Five configurations were benchmarked:
| Configuration | Step time | Output |
|---|---|---|
| T1: SD1.5 fp16, no CN, math SDPA | 15.8 s/step | All-black 1224 bytes (NaN) |
| T3: + ControlNet ×1, math SDPA | 22.6 s/step | NaN |
| T4: + flash + mem-efficient SDPA | 7.1 s/step | NaN (faster, but still broken) |
| Switch to fp32 (default SDPA) | see below | ✅ Clean output |
Changing the SDPA kernel did not fix the NaN. The root cause: the GTX 1660 SUPER (TU116) has no Tensor Cores, causing underflow/overflow in fp16 softmax inside the attention layers. This is the same issue behind AUTOMATIC1111's --no-half flag for GTX 16xx cards. Switching to fp32 fixed it immediately.
② Forced math SDPA Was Causing a 3.2× Slowdown
The gap between T3 (22.6 s/step) and T4 (7.1 s/step) traced back to legacy code copied from another environment:
# Carried over from another env (written as an fp16 NaN workaround)
torch.backends.cuda.enable_flash_sdp(False)
torch.backends.cuda.enable_mem_efficient_sdp(False)
torch.backends.cuda.enable_math_sdp(True)
Removing these lines and letting PyTorch auto-select the SDPA kernel gives the 3.2× improvement. Note: flash SDPA requires fp16/bf16, so after switching to fp32 the math kernel is used anyway — the correct combination is "remove the forced lines + use fp32."
③ All Benchmarks Were Measured While Spilling (Discovered Later)
After the environment was set up, adding vram_peak_mb metadata to generated images revealed that fp32 + 1 ControlNet peaks at 6,726 MB VRAM. The WSL2 effective limit on this card is ~6,100 MB.
Everything had been spilling to host RAM. Every earlier number (13.7 s/step, full 25-step in 5 min 55 s) was measured in a spill-to-RAM state.
The fix: enable_model_cpu_offload (model-granularity, ~5–10 CPU↔GPU swaps per step):
| Configuration | --offload none (spilling) | --offload model (current default) | Speedup |
|---|---|---|---|
| Full 25-step + 1 CN | 584.7 s / 6,726 MB VRAM | 61.5 s / 5,298 MB VRAM | 9.5× |
| Hyper-SD15 4-step + 1 CN | 62.9 s / 6,680 MB VRAM | 11.1 s / 5,146 MB VRAM | 5.7× |
| Hyper-SD15 1-step | 32.4 s / 6,680 MB VRAM | 4.9 s / 5,106 MB VRAM | 6.6× |
| Multi-CN (×2) 25-step | 884.6 s / 8,124 MB VRAM | 680.1 s / 6,694 MB VRAM | 1.3× (still spilling) |
With 1 CN, VRAM drops to 5.1–5.3 GB — no spill — yielding the 9.5× improvement. Multi-CN (×2) still peaks at 6.7 GB even with --offload model, so the gain is limited.
Low-Step Distillation for SD1.5: What Replaces SDXL Lightning?
SDXL Lightning has no SD1.5 equivalent — ByteDance only released it for SDXL. SD1.5 alternatives:
- LCM-LoRA (
--lcm): 4-step, ~10 s/image - Hyper-SD15 CFG-distilled (
--hyper {1,2,4,8}): 1-step, ~5 s. Negative prompt disabled. - Hyper-SD15 CFG-preserved (
--hyper-cfg {8,12}): 8-step, ~4.5 min. Negative prompt enabled, less over-saturation.
For pose exploration, --hyper 1 (5 s/image) lets you iterate quickly. For portrait finishing, --hyper-cfg 8 (negative prompt active) is more controllable.
Final Role Split Between the Two Environments
| SD1.5 environment | SDXL environment | |
|---|---|---|
| Primary use | Pose & composition exploration | Final output & FaceID portrait |
| Speed (full 25-step) | ~1 min/image (61.5 s measured) | ~16 min/image |
| Speed (low-step distillation) | Hyper 1-step ~5 s | Lightning 8-step ~6–7 min |
| Resolution | 512×768 (native) | 1280×720 / 832×1216 |
| ControlNet | OpenPose / Canny / Depth ✅ | ❌ (dropped due to RAM limit) |
Recommended workflow: use SD1.5 with --hyper 1 (5 s/image) to rapidly explore poses and compositions, then pass the chosen reference image to the SDXL environment for final output. SD1.5 OpenPose skeleton maps are not directly usable by SDXL ControlNet (different latent spaces), but the original reference image can be passed as IP-Adapter input as a bridge.
FAQ
Q. Can I use fp16 on the GTX 1660 SUPER?
A. fp32 only for now. TU116 has no Tensor Cores, causing underflow/overflow in fp16 softmax inside the attention layers. This is the same root cause as AUTOMATIC1111's --no-half and ComfyUI's --force-fp32. RTX 3060 and later (Ampere) support fp16 and can reach 1–2 s/step with flash SDPA.
Q. Is there a way to add ControlNet to the SDXL environment?
A. In a txt2img + ControlNet only setup (no FaceID), RAM usage stays around ~19 GB and it works. Running FaceID and ControlNet simultaneously pushes RAM to ~26 GB, which makes distributing the work to SD1.5 the more practical choice.
Q. How slow is Multi-ControlNet (×2)?
A. Even with --offload model, VRAM still peaks at 6.7 GB with spilling, so it takes ~11 min/image. The improvement is limited compared to 1 CN (~1 min) because the spill is not fully resolved.
※ Steps and numbers in this article were verified in a 2026-May environment (diffusers 0.38, GTX 1660 SUPER, WSL2 + Ubuntu 22.04). Behavior may differ with different library or driver versions. If something does not work, please let me know in the comments.
Summary
I set up SD1.5 + ControlNet on a GTX 1660 SUPER (6 GB VRAM, 24 GB RAM). After working through three pitfalls — fp16 NaN, forced math SDPA slowdown, and VRAM spill causing misleading benchmarks — the final configuration delivers Hyper-SD15 1-step in ~5 s/image and full 25-step in ~1 min/image. Pairing this with the SDXL environment (16 min/image) for final output makes pose exploration much faster.
If this article was useful, sharing it on X would be appreciated.
My App
I built an iOS reading-log app called My Book Store, available on the App Store. If you want a simple way to manage your bookshelf, give it a try.
References
- diffusers — Hugging Face
- ByteDance/Hyper-SD — Hugging Face
- lllyasviel/ControlNet
- AUTOMATIC1111 Wiki: Install and Run on NVidia GPUs (--no-half explanation)
※ This article is part of an automated blogging experiment using Claude Code.
No comments:
Post a Comment