ON THE HAND: Building SD1.5 + ControlNet on a GTX 1660 SUPER

Q: Can I use fp16 on the GTX 1660 SUPER?

fp32 only. TU116 has no Tensor Cores, causing fp16 NaN in attention softmax. Same root cause as AUTOMATIC1111's --no-half flag. RTX 3060+ (Ampere) supports fp16 and reaches 1–2 s/step with flash SDPA.

Q: Is there a way to add ControlNet to the SDXL environment?

txt2img + ControlNet only (no FaceID) works at ~19 GB RAM. Running FaceID and ControlNet together exceeds ~26 GB, making SD1.5 the practical choice for ControlNet work.

Q: How slow is Multi-ControlNet (×2)?

Even with --offload model, VRAM peaks at 6.7 GB with spilling, so ~11 min/image. Limited improvement over single CN (~1 min) because spill is not fully resolved.

A build log for setting up SD1.5 + ControlNet on a GTX 1660 SUPER (6 GB VRAM). The three real roadblocks were not VRAM shortage but: RAM spiking to ~26 GB causing OOM, fp16 producing all-black NaN images, and a forced math SDPA path causing a 3.2× slowdown. A fourth surprise — every benchmark had been measured while spilling to host RAM — appeared later. Full breakdown from cause to fix.

What This Article Covers

An SD1.5 + ControlNet configuration that works on a 6 GB VRAM GPU
The fp16 NaN issue on GTX 16xx (TU116) and how to work around it
Why sequential_cpu_offload eats RAM instead of VRAM
Final benchmarks: Hyper-SD15 1-step ~5 s/image, full 25-step ~1 min/image

What I Used

MSI GeForce GTX 1660 SUPER GAMING X 6GB — the GPU used here
CFD DDR4-3200 32 GB desktop RAM — 32 GB or more recommended after hitting the OOM wall
WSL2 (Ubuntu 22.04), Python 3.12, diffusers 0.38
Realistic Vision V6.0 B1 (noVAE) checkpoint
ControlNet (OpenPose / Canny / Depth) + LCM-LoRA / Hyper-SD15

Background: SDXL ControlNet Was Abandoned in 1 Hour 13 Minutes

The original plan was to implement ControlNet on an existing SDXL-based environment. On 2026-04-30 at 21:40 the implementation finished. By 22:53 it had been reverted. One hour, thirteen minutes.

RAM was the cause. With SDXL and enable_sequential_cpu_offload, VRAM peaks at ~4 GB — but the entire model lives in RAM. Adding ControlNet (fp32, ~5 GB) on top:

Configuration	VRAM peak	RAM resident	Result
SDXL fp32 + FaceID	~4 GB	~21 GB	✅ (24 GB system)
+ ControlNet added	~4 GB (unchanged)	~26 GB ❌	OOM

VRAM was fine. ControlNet's cost shows up on the RAM side. The confusion — "VRAM is within limits, so why OOM?" — only resolves once you understand how sequential_cpu_offload works.

How sequential_cpu_offload Works

diffusers' enable_sequential_cpu_offload iterates through ~700 UNet submodules, transferring each to VRAM → running inference → moving it back to RAM, once per step. The full model lives in RAM throughout. There are 700 CPU↔GPU round trips per step, with Python GIL pinning a single core at 100%.

SD1.5 weighs ~4 GB (fp32) — it fits in 6 GB VRAM, making sequential_cpu_offload unnecessary. That was the motivation for building a separate SD1.5 environment.

Three Problems That Blocked Progress

① fp16 Produces All-Black NaN Images (Known GTX 16xx Issue)

Five configurations were benchmarked:

Configuration	Step time	Output
T1: SD1.5 fp16, no CN, math SDPA	15.8 s/step	All-black 1224 bytes (NaN)
T3: + ControlNet ×1, math SDPA	22.6 s/step	NaN
T4: + flash + mem-efficient SDPA	7.1 s/step	NaN (faster, but still broken)
Switch to fp32 (default SDPA)	see below	✅ Clean output

Changing the SDPA kernel did not fix the NaN. The root cause: the GTX 1660 SUPER (TU116) has no Tensor Cores, causing underflow/overflow in fp16 softmax inside the attention layers. This is the same issue behind AUTOMATIC1111's --no-half flag for GTX 16xx cards. Switching to fp32 fixed it immediately.

② Forced math SDPA Was Causing a 3.2× Slowdown

The gap between T3 (22.6 s/step) and T4 (7.1 s/step) traced back to legacy code copied from another environment:

# Carried over from another env (written as an fp16 NaN workaround)
torch.backends.cuda.enable_flash_sdp(False)
torch.backends.cuda.enable_mem_efficient_sdp(False)
torch.backends.cuda.enable_math_sdp(True)

Removing these lines and letting PyTorch auto-select the SDPA kernel gives the 3.2× improvement. Note: flash SDPA requires fp16/bf16, so after switching to fp32 the math kernel is used anyway — the correct combination is "remove the forced lines + use fp32."

③ All Benchmarks Were Measured While Spilling (Discovered Later)

After the environment was set up, adding vram_peak_mb metadata to generated images revealed that fp32 + 1 ControlNet peaks at 6,726 MB VRAM. The WSL2 effective limit on this card is ~6,100 MB.

Everything had been spilling to host RAM. Every earlier number (13.7 s/step, full 25-step in 5 min 55 s) was measured in a spill-to-RAM state.

The fix: enable_model_cpu_offload (model-granularity, ~5–10 CPU↔GPU swaps per step):

Configuration	--offload none (spilling)	--offload model (current default)	Speedup
Full 25-step + 1 CN	584.7 s / 6,726 MB VRAM	61.5 s / 5,298 MB VRAM	9.5×
Hyper-SD15 4-step + 1 CN	62.9 s / 6,680 MB VRAM	11.1 s / 5,146 MB VRAM	5.7×
Hyper-SD15 1-step	32.4 s / 6,680 MB VRAM	4.9 s / 5,106 MB VRAM	6.6×
Multi-CN (×2) 25-step	884.6 s / 8,124 MB VRAM	680.1 s / 6,694 MB VRAM	1.3× (still spilling)

With 1 CN, VRAM drops to 5.1–5.3 GB — no spill — yielding the 9.5× improvement. Multi-CN (×2) still peaks at 6.7 GB even with --offload model, so the gain is limited.

Low-Step Distillation for SD1.5: What Replaces SDXL Lightning?

SDXL Lightning has no SD1.5 equivalent — ByteDance only released it for SDXL. SD1.5 alternatives:

LCM-LoRA (--lcm): 4-step, ~10 s/image
Hyper-SD15 CFG-distilled (--hyper {1,2,4,8}): 1-step, ~5 s. Negative prompt disabled.
Hyper-SD15 CFG-preserved (--hyper-cfg {8,12}): 8-step, ~4.5 min. Negative prompt enabled, less over-saturation.

For pose exploration, --hyper 1 (5 s/image) lets you iterate quickly. For portrait finishing, --hyper-cfg 8 (negative prompt active) is more controllable.

Final Role Split Between the Two Environments

	SD1.5 environment	SDXL environment
Primary use	Pose & composition exploration	Final output & FaceID portrait
Speed (full 25-step)	~1 min/image (61.5 s measured)	~16 min/image
Speed (low-step distillation)	Hyper 1-step ~5 s	Lightning 8-step ~6–7 min
Resolution	512×768 (native)	1280×720 / 832×1216
ControlNet	OpenPose / Canny / Depth ✅	❌ (dropped due to RAM limit)

Recommended workflow: use SD1.5 with --hyper 1 (5 s/image) to rapidly explore poses and compositions, then pass the chosen reference image to the SDXL environment for final output. SD1.5 OpenPose skeleton maps are not directly usable by SDXL ControlNet (different latent spaces), but the original reference image can be passed as IP-Adapter input as a bridge.

FAQ

Q. Can I use fp16 on the GTX 1660 SUPER?

A. fp32 only for now. TU116 has no Tensor Cores, causing underflow/overflow in fp16 softmax inside the attention layers. This is the same root cause as AUTOMATIC1111's --no-half and ComfyUI's --force-fp32. RTX 3060 and later (Ampere) support fp16 and can reach 1–2 s/step with flash SDPA.

Q. Is there a way to add ControlNet to the SDXL environment?

A. In a txt2img + ControlNet only setup (no FaceID), RAM usage stays around ~19 GB and it works. Running FaceID and ControlNet simultaneously pushes RAM to ~26 GB, which makes distributing the work to SD1.5 the more practical choice.

Q. How slow is Multi-ControlNet (×2)?

A. Even with --offload model, VRAM still peaks at 6.7 GB with spilling, so it takes ~11 min/image. The improvement is limited compared to 1 CN (~1 min) because the spill is not fully resolved.

※ Steps and numbers in this article were verified in a 2026-May environment (diffusers 0.38, GTX 1660 SUPER, WSL2 + Ubuntu 22.04). Behavior may differ with different library or driver versions. If something does not work, please let me know in the comments.

Summary

I set up SD1.5 + ControlNet on a GTX 1660 SUPER (6 GB VRAM, 24 GB RAM). After working through three pitfalls — fp16 NaN, forced math SDPA slowdown, and VRAM spill causing misleading benchmarks — the final configuration delivers Hyper-SD15 1-step in ~5 s/image and full 25-step in ~1 min/image. Pairing this with the SDXL environment (16 min/image) for final output makes pose exploration much faster.

If this article was useful, sharing it on X would be appreciated.

My App

I built an iOS reading-log app called My Book Store, available on the App Store. If you want a simple way to manage your bookshelf, give it a try.

View on App Store →

References

※ This article is part of an automated blogging experiment using Claude Code.

ON THE HAND

Saturday, June 6, 2026

Building SD1.5 + ControlNet on a GTX 1660 SUPER — The Bottleneck Was RAM, Not VRAM