Saturday, June 6, 2026

Building SD1.5 + ControlNet on a GTX 1660 SUPER — The Bottleneck Was RAM, Not VRAM

A build log for setting up SD1.5 + ControlNet on a GTX 1660 SUPER (6 GB VRAM). The three real roadblocks were not VRAM shortage but: RAM spiking to ~26 GB causing OOM, fp16 producing all-black NaN images, and a forced math SDPA path causing a 3.2× slowdown. A fourth surprise — every benchmark had been measured while spilling to host RAM — appeared later. Full breakdown from cause to fix.

What This Article Covers

  • An SD1.5 + ControlNet configuration that works on a 6 GB VRAM GPU
  • The fp16 NaN issue on GTX 16xx (TU116) and how to work around it
  • Why sequential_cpu_offload eats RAM instead of VRAM
  • Final benchmarks: Hyper-SD15 1-step ~5 s/image, full 25-step ~1 min/image

What I Used

Background: SDXL ControlNet Was Abandoned in 1 Hour 13 Minutes

The original plan was to implement ControlNet on an existing SDXL-based environment. On 2026-04-30 at 21:40 the implementation finished. By 22:53 it had been reverted. One hour, thirteen minutes.

RAM was the cause. With SDXL and enable_sequential_cpu_offload, VRAM peaks at ~4 GB — but the entire model lives in RAM. Adding ControlNet (fp32, ~5 GB) on top:

Configuration VRAM peak RAM resident Result
SDXL fp32 + FaceID ~4 GB ~21 GB ✅ (24 GB system)
+ ControlNet added ~4 GB (unchanged) ~26 GB ❌ OOM

VRAM was fine. ControlNet's cost shows up on the RAM side. The confusion — "VRAM is within limits, so why OOM?" — only resolves once you understand how sequential_cpu_offload works.

How sequential_cpu_offload Works

diffusers' enable_sequential_cpu_offload iterates through ~700 UNet submodules, transferring each to VRAM → running inference → moving it back to RAM, once per step. The full model lives in RAM throughout. There are 700 CPU↔GPU round trips per step, with Python GIL pinning a single core at 100%.

SD1.5 weighs ~4 GB (fp32) — it fits in 6 GB VRAM, making sequential_cpu_offload unnecessary. That was the motivation for building a separate SD1.5 environment.

Three Problems That Blocked Progress

① fp16 Produces All-Black NaN Images (Known GTX 16xx Issue)

Five configurations were benchmarked:

Configuration Step time Output
T1: SD1.5 fp16, no CN, math SDPA 15.8 s/step All-black 1224 bytes (NaN)
T3: + ControlNet ×1, math SDPA 22.6 s/step NaN
T4: + flash + mem-efficient SDPA 7.1 s/step NaN (faster, but still broken)
Switch to fp32 (default SDPA) see below ✅ Clean output

Changing the SDPA kernel did not fix the NaN. The root cause: the GTX 1660 SUPER (TU116) has no Tensor Cores, causing underflow/overflow in fp16 softmax inside the attention layers. This is the same issue behind AUTOMATIC1111's --no-half flag for GTX 16xx cards. Switching to fp32 fixed it immediately.

② Forced math SDPA Was Causing a 3.2× Slowdown

The gap between T3 (22.6 s/step) and T4 (7.1 s/step) traced back to legacy code copied from another environment:

# Carried over from another env (written as an fp16 NaN workaround)
torch.backends.cuda.enable_flash_sdp(False)
torch.backends.cuda.enable_mem_efficient_sdp(False)
torch.backends.cuda.enable_math_sdp(True)

Removing these lines and letting PyTorch auto-select the SDPA kernel gives the 3.2× improvement. Note: flash SDPA requires fp16/bf16, so after switching to fp32 the math kernel is used anyway — the correct combination is "remove the forced lines + use fp32."

③ All Benchmarks Were Measured While Spilling (Discovered Later)

After the environment was set up, adding vram_peak_mb metadata to generated images revealed that fp32 + 1 ControlNet peaks at 6,726 MB VRAM. The WSL2 effective limit on this card is ~6,100 MB.

Everything had been spilling to host RAM. Every earlier number (13.7 s/step, full 25-step in 5 min 55 s) was measured in a spill-to-RAM state.

The fix: enable_model_cpu_offload (model-granularity, ~5–10 CPU↔GPU swaps per step):

Configuration --offload none (spilling) --offload model (current default) Speedup
Full 25-step + 1 CN 584.7 s / 6,726 MB VRAM 61.5 s / 5,298 MB VRAM 9.5×
Hyper-SD15 4-step + 1 CN 62.9 s / 6,680 MB VRAM 11.1 s / 5,146 MB VRAM 5.7×
Hyper-SD15 1-step 32.4 s / 6,680 MB VRAM 4.9 s / 5,106 MB VRAM 6.6×
Multi-CN (×2) 25-step 884.6 s / 8,124 MB VRAM 680.1 s / 6,694 MB VRAM 1.3× (still spilling)

With 1 CN, VRAM drops to 5.1–5.3 GB — no spill — yielding the 9.5× improvement. Multi-CN (×2) still peaks at 6.7 GB even with --offload model, so the gain is limited.

Low-Step Distillation for SD1.5: What Replaces SDXL Lightning?

SDXL Lightning has no SD1.5 equivalent — ByteDance only released it for SDXL. SD1.5 alternatives:

  • LCM-LoRA (--lcm): 4-step, ~10 s/image
  • Hyper-SD15 CFG-distilled (--hyper {1,2,4,8}): 1-step, ~5 s. Negative prompt disabled.
  • Hyper-SD15 CFG-preserved (--hyper-cfg {8,12}): 8-step, ~4.5 min. Negative prompt enabled, less over-saturation.

For pose exploration, --hyper 1 (5 s/image) lets you iterate quickly. For portrait finishing, --hyper-cfg 8 (negative prompt active) is more controllable.

Final Role Split Between the Two Environments

SD1.5 environment SDXL environment
Primary use Pose & composition exploration Final output & FaceID portrait
Speed (full 25-step) ~1 min/image (61.5 s measured) ~16 min/image
Speed (low-step distillation) Hyper 1-step ~5 s Lightning 8-step ~6–7 min
Resolution 512×768 (native) 1280×720 / 832×1216
ControlNet OpenPose / Canny / Depth ✅ ❌ (dropped due to RAM limit)

Recommended workflow: use SD1.5 with --hyper 1 (5 s/image) to rapidly explore poses and compositions, then pass the chosen reference image to the SDXL environment for final output. SD1.5 OpenPose skeleton maps are not directly usable by SDXL ControlNet (different latent spaces), but the original reference image can be passed as IP-Adapter input as a bridge.

FAQ

Q. Can I use fp16 on the GTX 1660 SUPER?

A. fp32 only for now. TU116 has no Tensor Cores, causing underflow/overflow in fp16 softmax inside the attention layers. This is the same root cause as AUTOMATIC1111's --no-half and ComfyUI's --force-fp32. RTX 3060 and later (Ampere) support fp16 and can reach 1–2 s/step with flash SDPA.

Q. Is there a way to add ControlNet to the SDXL environment?

A. In a txt2img + ControlNet only setup (no FaceID), RAM usage stays around ~19 GB and it works. Running FaceID and ControlNet simultaneously pushes RAM to ~26 GB, which makes distributing the work to SD1.5 the more practical choice.

Q. How slow is Multi-ControlNet (×2)?

A. Even with --offload model, VRAM still peaks at 6.7 GB with spilling, so it takes ~11 min/image. The improvement is limited compared to 1 CN (~1 min) because the spill is not fully resolved.

※ Steps and numbers in this article were verified in a 2026-May environment (diffusers 0.38, GTX 1660 SUPER, WSL2 + Ubuntu 22.04). Behavior may differ with different library or driver versions. If something does not work, please let me know in the comments.

Summary

I set up SD1.5 + ControlNet on a GTX 1660 SUPER (6 GB VRAM, 24 GB RAM). After working through three pitfalls — fp16 NaN, forced math SDPA slowdown, and VRAM spill causing misleading benchmarks — the final configuration delivers Hyper-SD15 1-step in ~5 s/image and full 25-step in ~1 min/image. Pairing this with the SDXL environment (16 min/image) for final output makes pose exploration much faster.

If this article was useful, sharing it on X would be appreciated.

My App

I built an iOS reading-log app called My Book Store, available on the App Store. If you want a simple way to manage your bookshelf, give it a try.

View on App Store →

References

※ This article is part of an automated blogging experiment using Claude Code.

No comments:

Post a Comment