Stable Diffusion Model Line: Evolution, Architecture, and Ecosystem
Executive Summary
Stable Diffusion encompasses a series of open (or semi-open) text-to-image models developed by Stability AI and collaborators. Its evolution spans from the original v1.x models (released August 2022) through v2.x (late 2022), the large SDXL models (mid–late 2023), and the new transformer-based SD3 and SD3.5 models (announced 2024). Each generation brought new technical designs and licensing policies. Stability AI’s strategy shifted from fully open releases to gated “community license” releases with revenue thresholds, notably for SD3/3.5.
Related pages:
- Stable Diffusion alternatives
- Midjourney alternatives
- Recraft alternatives
- Adobe Firefly alternatives
- ComfyUI alternatives
Company context: Stability AI (founded 2020) was an early backer and engineer of Stable Diffusion. The company grew rapidly on Stability Diffusion’s success, raising rounds in 2023–24 and facing leadership changes (founder replaced, Sean Parker joined board). By late 2024 Stability AI was pursuing enterprise strategy, introducing revenue-based licenses to monetize its ecosystem.
Availability and licensing: All major models are distributed via Hugging Face (and some via Stability AI’s own studios or API) but with evolving terms. The v1.x and SDXL models use CreativeML Open RAIL++ licensing (free for most uses). In contrast, SD3 and SD3.5 models use a new Stability AI Community License: free for research, non-commercial, and small-business (below US$1M revenue), but requiring enterprise licensing above that threshold. The SD3/3.5 checkpoints are gated on Hugging Face: users must log in, accept terms, and provide contact info before downloading (as documented on the model pages). SDXL weights remain publicly downloadable (with standard license click-through). Cloud integrations exist: e.g. SD3.5 Large is available via Amazon Bedrock (AWS) and NVIDIA NIM, subject to the same gating/licensing.
Technical specifications: The core architectural shift is from diffusion UNets to diffusion transformers. All pre-3 models (v1.x, v2.x, SDXL) are latent diffusion models: a pixel-space autoencoder (usually 4-channel latent at 8× downsampling) feeding a UNet denoiser. They use the CLIP text encoder (v1.x: CLIP ViT-L/14; v2.x and SDXL: OpenCLIP ViT-H/14 and/or CLIP ViT/L) to condition via cross-attention. SDXL added a second text encoder and a smaller “refiner” UNet for polishing. In contrast, SD3/3.5 are Multimodal Diffusion Transformers. They encode text with three pretrained encoders (CLIP ViT/L, OpenCLIP ViT/G, and a large T5) and interleave modality tokens, achieving bidirectional flow between text and image representations. Key specs by version:
-
SD v1.x: 860M–890M parameter UNet (per-line, implied by 23-layer convUNet in LDM). Text encoder: CLIP ViT-L/14. Latent channels 4, downsample 8×. Trained on LAION-2B English subset with aesthetic scoring filter. Classifier-free guidance used. VRAM: ~7–8GB for 512×512 at ~50 steps.
-
SD v2.x (512/768): ~860M parameter UNet (similar scale). Text encoder: OpenCLIP ViT-H/14 (1024-dim context). Latent downsampling 8×. Trained on filtered LAION 5B (aesthetic≥5, explicit NSFW filter) with a v-objective. Resolution: native 512 and 768 variants. Additional variants: depth2img, inpainting, upscaling (with their own conditioning channels). VRAM: 2–3× higher for 768 mode (due to 1.5× linear dimension).
-
SDXL (v1.0, v1.0+refiner): 2.6B parameter core UNet (about 3× larger than SD v2). Dual text encoders: OpenCLIP ViT/G and CLIP ViT/L (per model card). Native 1024×1024 resolution; outputs “in latent space” with 8× downsample, 4-channel latents. Training dataset: proprietary multi-aspect set (LAION 5B-derived) with >540M HQ images, aesthetics >4.3 (report says “massive corpus”, exact composition undisclosed). SDXL uses two-stage diffusion: a base UNet then a separate “refiner” UNet. Token conditioning includes original image size and crop coordinates (micro-conditioning). VRAM: very large; base+refiner loaded concurrently ~12–20GB typical, so use offloading or TensorRT and compile optimizations. An invisible watermark is applied by default in software.
-
SD3 Medium: Introduces the MMDiT backbone. Checkpoint size not explicitly stated (Stability AI papers ranged 450M–8B and implied Medium ≈2–3B?), but ~2B parameters is plausible. Uses three fixed encoders: CLIP ViT/L, OpenCLIP ViT/G, and T5-XXL, with a vocabulary of 50k. Latent downsample 8×, VAE channels 4. Training data: 1.0 billion public+synthetic images for initial training, fine-tuned on 30M highly aesthetic images and 3M human preference (QA pairs). Text/video: no separate refiner stage; quality is single pass. VRAM: medium-high. Users can drop the T5 encoder or use an FP8 version to save memory (packaging variants provided).
-
SD3.5 Large: ~8.1B parameters (announcement). Same triple-encoder scheme, with QK-normalization and dual-attention in transformers. No refiner; a distilled “Turbo” version provides 4-step inference. Training data not detailed. Resolution 1024×1024 (1-megapixel) guidance. VRAM: very high (19GB baseline, reduced to ~11GB with TensorRT FP8).
-
SD3.5 Medium: Referred to as MMDiT-X (with dual-attention layers in first 12 blocks). Parameter count not disclosed; likely 2–3B. Otherwise similar to SD3 Medium. TensorRT optimizations yield ~1.7× speedup (VRAM numbers not given).
Figure: Latent diffusion architecture of early Stable Diffusion (SD v1.x/2.x). A U-Net denoises an 8× downsampled latent; conditioning (text features) enters via cross-attention blocks.
Release Timeline and Key Events
Key release and company events are summarized below.
Timeline Table:
| Date | Event | Details & Source |
|---|---|---|
| 2022-08-22 | SD 1.0 (v1.1) release | Latent diffusion UNet (CLIP), first public release. Demo launched. |
| 2022-12 | SD v2.0 release | New OpenCLIP encoders, 512/768 modes, refined dataset, added depth, inpaint variants. |
| 2023-07 | SDXL 1.0 release | Large UNet (2.6B), dual text encoders, base+refiner pipeline, 1024px. |
| 2024-03 | SD3 (MMDiT) paper | Introduced transformer backbone, rectified flow; source code/models promised. |
| 2024-06-12 | SD3 Medium release | Gated release of Medium model. |
| 2024-10-22 | SD3.5 release | SD3.5 Large & Turbo (4-step) released; 8B param; gated distribution. |
| 2024-11 | SD3.5 Medium release | SD3.5 Medium checkpoint added on HF. |
| 2024-12 | SD3.5 on AWS & NVIDIA | SD3.5 Large deployed on Amazon Bedrock, NVIDIA NIM (with gating). |
| 2025-05 | SD3.5 TensorRT optimization | 2× speed, 40% less VRAM (11GB) on RTX GPUs. |
| 2025-XX | Legal rulings/lawsuits | UK court rules SD model weights not direct infringing copies; ongoing US cases. |
| (2016-2025) | Company events | Founding (2020), Emad departure (2024), CEO/board changes, $101M funding (2023). |
Availability, Gating, and Licensing
Each model generation’s distribution and license differs:
-
SD v1.x: Weights were released openly on Hugging Face under the CreativeML Open RAIL-M license (later termed RAIL++). No gate beyond agreeing to the license. Code was open on GitHub.
-
SD v2.x: Similarly open code with an updated OpenRAIL++ license. The HF weight repository requires acceptance of license terms (OpenRAIL+). There is no contact-info gate, but users must check a box for the license.
-
SDXL 1.0: Open release under CreativeML Open RAIL++. Hugging Face weights download without extra gating (just license click-through). The GitHub “generative-models” repo provided VAE and sample code.
-
SD3 & SD3.5: Weights are gated on Hugging Face. The model pages explicitly state users must log in, complete a license acceptance form, and share contact information before downloading. The underlying license is the Stability AI Community License, introduced July 2024, which permits free use for research, non-commercial, and “qualified small commercial” (revenue<$1M) use, and requires enterprise licensing above that. This is effectively a revenue-trigger license. The model cards reference this threshold and the need to contact Stability AI for higher-tier use.
Cloud availability: Stability AI offers SD3/3.5 on its API and “Stable Assistant” products, and partners have integrated them (e.g., SD3.5 Large is a model on AWS Bedrock and Nvidia NIM). In all cases, the same gating/license applies. A consequence is that many community scripts changed: e.g. HuggingFace’s diffusers docs warn that “the model is gated…you first need to go to the Stable Diffusion 3.5 Large Hugging Face page, fill in the form and accept the gate. Then log in using huggingface-cli.”.
Openness comparison: V1.x, V2.x, and SDXL had fully public weights and open licenses (RAIL++), whereas SD3/3.5 are “open” in source but gated in access and under a revenue-restricted license. All major releases are on Hugging Face (with necessary gating) and Stability AI’s GitHub for code. The SD3.5 launch also provided an “inference-only” GitHub repo (Stability-AI/sd3.5) that automates downloading the gated weights.
Technical Specifications by Version
The table below summarizes core technical specs (architecture, resolution, text encoders, etc.) and contrasts each major Stable Diffusion version. Unspecified or undisclosed figures are marked “(n.d.)”.
| Version | Architecture / Encoder | Cond. Text Encoders | Latent/VAE (channels, downsample) | Native Res. / Guidance | Training Data (size/filter) | License / Access |
|---|---|---|---|---|---|---|
| SD 1.x (v1.4/1.5) | UNet (860M conv layers) | CLIP ViT-L/14 (768-dim) | 4 channels, 8× downsample | 512×512 | LAION-5B English subset, filtered by aesthetic score | OpenRAIL (open); HF weights open |
| SD 2.0/2.1 | UNet (similar scale) | OpenCLIP ViT-H/14 (1024-dim) | 4 channels, 8× downsample | 512/768 | LAION-5B high-aesthetic subset, NSFW filtered | OpenRAIL++ (open, HF gated accept) |
| SDXL 1.0 | UNet (≈2.6B conv weights) | CLIP ViT/L + OpenCLIP ViT/G | 4 ch, 8× downsample (in latent) | 1024×1024 | 540M+ images (multi-aspect LAION subset, aesthetics >4.3) (private) | OpenRAIL++ (open) |
| SDXL refiner | UNet refinement stage | (uses same encoders) | - | 1024×1024 | Trained on faces and details (internal) | - |
| SD3 Medium | MMDiT Transformer (~2–3B?) | CLIP ViT/L, OpenCLIP ViT/G, T5-XXL | 4 ch, 8× downsample | ~1024×1024 (guided) | 1.0B pretrain (synth+public) + 30M aesthetic + 3M preference (stability.ai) | Community License (gated) |
| SD3.5 Large | MMDiT Transformer (8.1B) | CLIP ViT/L, OpenCLIP ViT/G, T5-XXL | 4 ch, 8× downsample | 1024×1024 | (Not disclosed; presumably even larger+prefine) | Community License (gated) |
| SD3.5 Large (Turbo) | Same + ADD Distillation | Same | Same | 1024×1024 | (Distilled version for speed) | Community License (gated) |
| SD3.5 Medium | MMDiT-X Transformer (n.d.) | CLIP/L, OpenCLIP/G, T5-XXL | 4 ch, 8× downsample | 1024×1024? | (Not disclosed) | Community License (gated) |
Each model uses classifier-free guidance by default. SD v1/v2 models include partial-conditioned (dropout) during training; SD3/3.5 use combined loss with pairwise unconditioned examples. The Stable Diffusion diagram above shows the v1/v2/XL latent diffusion flow.
Compute and Inference Cost
Larger models require significantly more compute. Rough estimates:
- SD v1.5: A 512×512 image requires ~50 diffusion steps. With an 860M-parameter UNet on an 8×8 latent (64×64 grid), this is feasible on a 6–8 GB GPU at reduced speed.
- SDXL 1.0: 1024×1024 generation (~128×128 latent) and a 2.6B UNet makes inference ~10× heavier per step. Running the base+refiner together needs ≥10–12 GB VRAM at 28–50 steps. Users often offload or use TensorRT/Quantization (NVIDIA reports 40% memory cut).
- SD3.5 Large: 1024×1024, 8B parameters. Official tests: ~19 GB VRAM for base model at 32 steps; TensorRT FP8 reduces to ~11 GB (2.3× speedup). In BFloat16, model itself was ported.
- SD3.5 Turbo: 4–8 steps only; GPU load is ~1/5 of base for a given quality target, at cost of no classifier guidance.
- SD3 Medium and SD3.5 Medium: Likely ~3–4B parameters; TensorRT speed-ups (~1.7×) reduce footprint, but detailed metrics are unpublished.
Generation flow: Prompt text + initial latent noise
Text encoders (CLIP/T5) -> MMDiT transformer -> VAE decoder -> output image
Table: Pros/Cons and costs by version
| Model | Pros (Quality, Features) | Cons (Cost, Issues) | GPU VRAM (@32 steps, ~1024px) | Notes |
|---|---|---|---|---|
| SD v1.5 | Smallest; very fast; extensive finetune support | Lower fidelity on text, small resolution | ~7 GB (8GB GPUs are sufficient) | — |
| SD v2.1 | Better non-human detail; added inpainting/depth modes | 2x VRAM for 768px mode; people rendering caveats | ~8-10 GB (768) | — |
| SDXL 1.0 | Highest detail, natural composition; refiner improves faces | Very high resource needs; refiner doubles load; complex distribution | ~10-12 GB (base); +10 GB (refiner) | Invisible watermark included |
| SD3 Medium | Major leap in prompt fidelity and typography; encoded knowledge via T5 | High memory due to T5; slower per step; gating makes access less trivial | ~12+ GB (varies by batch) | Optional no-T5 variant, FP8 T5, etc. |
| SD3.5 Large | State-of-art quality; distilled Turbo for speed | Extremely high resource needs; gating/licensing; no built-in refiner | ~19 GB baseline; 11 GB w/ TensorRT | Turbo (4 steps) exists but no guidance; still gated |
| SD3.5 Turbo | Near-instant generation (4-8 steps) | Lower guidance (no classifier-free) vs. base | ~11 GB (FP8) | Distilled |
| SD3.5 Medium | (expected similar to SD3 Medium) | (expected similar to Medium) | (unknown) | New architecture improvements (MMDiT-X) |
Unspecified parameters: Exact size of “Medium” variants not given publicly. VRAM notes are from official benchmarks or recommended setups.
Ecosystem, Adoption, and Use Cases
Stable Diffusion’s impact is vast. Open-source code and weights have spurred countless extensions. Notable ecosystem components:
- Diffusers library: Hugging Face’s
diffusersfully supports SD2, SDXL, SD3, and SD3.5 pipelines, including custom schedulers, compile optimizations, and integration of ControlNet and LoRA. Official docs (and Hugging Face blog) provide SD3/SD3.5 usage examples withdiffusers. - Web UIs: The AUTOMATIC1111 web UI (161k stars) and InvokeAI (27k stars) initially targeted v1/v2; third-party forks have since added SDXL, SD3 support. ComfyUI (104k stars) natively supports SD3/3.5 and has become popular for advanced pipelines.
- Hugging Face: The model hub shows enormous usage. As of Feb 24, 2026: SD3.5 Medium (unstable-diffusion-3.5-medium) was downloaded ~131,993 times in the last month, with 2.6k likes; SD3.5 Large ~42k/month, 2.0k likes; SD3 Medium ~5k/month, 4.9k likes. By comparison, SDXL base is ~2 million/month with 7.5k likes. The SD3.5 hub lists hundreds of fine-tuned checkpoints and LoRAs. There are also Stability AI-offered TensorRT and ONNX-quantized variants (for AMD/Nvidia) and even optimized ControlNet versions (e.g. “stable-diffusion-3.5-controlnets-tensorrt” on HF) demonstrating active adaptation.
- Cloud/Commercial: In addition to Bedrock/NIM, SD models appear in Sagemaker containers, Google Colab notebooks, and proprietary apps (e.g. Canva uses SD in its image tools). Reports cite enterprises in design, marketing, gaming, and film leveraging custom SD3/3.5 pipelines.
| Metric | Value |
|---|---|
| SDXL base downloads | 2,062,317 |
| SD3.5 Medium downloads | 131,993 |
| AUTOMATIC1111 stars | 161,000 |
| ComfyUI stars | 104,000 |
| Diffusers stars | 32,800 |
Legal and Ethical Issues
Stable Diffusion’s dataset and outputs have been the subject of intense debate:
-
Training data: Early models used LAION-scraped images (CC-0 or permissive licenses). However, investigative reports uncovered that LAION contained identifiable people and even minors, raising privacy concerns. SD3 model cards emphasize “red teaming” and claim to have removed toxic or illegal content, but independent audits (and the presence of watermarks from Getty-type images) suggest these filters are imperfect.
-
Copyright lawsuits: In late 2024, Getty Images sued Stability AI (UK and US). A UK High Court ruled that Stable Diffusion’s model weights are not “copies” of Getty’s photos, since the model doesn’t store pixel-level images. This was a narrow victory: the court still enjoined training with unauthorized Getty images as a loss. Similar suits by artists (Andersen v. Stability, etc.) are underway in the US, with motions partially dismissed. (Separately, Midjourney and DeviantArt are defendants in related cases, per Reuters). The gist: legality of using scraped art for training is unsettled, especially outside limited “data mining” exceptions.
-
Ethical use: SD includes safety filters (Stable Diffusion 1.4+ had an NSFW classifier by default). SDXL and SD3 rely on Watermark SDK (invisible watermarks) to tag AI images, but watermark removal attacks have been demonstrated (24-hour research by MIT, etc. ). The gating license also restricts uses like automated face recognition, biometric analysis, surveillance, and illegal content generation, as enumerated in the license.
Practical Guidance for Users
Access: To download gated models (SD3/3.5), create a Hugging Face account and agree to the license on the model page. Then use huggingface-cli login before running any pipeline. The diffusers example for SD3.5 shows precisely this step. Stability AI provides a GitHub script (Stability-AI/sd3.5) to automate fetching the required files.
Inference: Use mixed precision (fp16 or bf16) and frameworks like Torch 2.0 with torch.compile or TensorRT for speed. For example, SD3.5 Large inference is demonstrated in bfloat16 (the original published precision). Use classifier-free guidance values ~4–7 for SD3 models (contrast vs ~7–15 in SD1/2). For VRAM-limited hardware: load one model at a time (e.g. don’t load refiner with SDXL by default), or use offloading. On 8–12 GB GPUs, SDXL may run at reduced batch size, whereas SD3.5 Large typically needs ~16 GB or more without optimization.
Finetuning and Plugins: LoRA fine-tuning works with any version, but LoRAs are not cross-compatible between architecture types (e.g. a v1.5 LoRA won’t plug into SD3). ControlNet requires separate model variants; StabilityAI released SD3.5-specific control nets (TensorRT-optimized) on HF. Many community adapters exist (LoRAs, dreambooth models) as seen in the HF model trees. Always ensure LoRAs or ControlNets match the base model version.
All data above is drawn from official sources (model cards, papers) or reputable analyses. Figures marked n.d. were not specified by the sources. We cite training data and VRAM from Stability AI announcements and papers; where not public, we note it. This snapshot is accurate as of Feb 24, 2026.