Stable Diffusion Model Line: Evolution, Architecture, and Ecosystem

Executive Summary

Stable Diffusion encompasses a series of open (or semi-open) text-to-image models developed by Stability AI and collaborators. Its evolution spans from the original v1.x models (released August 2022) through v2.x (late 2022), the large SDXL models (mid–late 2023), and the new transformer-based SD3 and SD3.5 models (announced 2024). Each generation brought new technical designs and licensing policies. Stability AI’s strategy shifted from fully open releases to gated “community license” releases with revenue thresholds, notably for SD3/3.5.

Related pages:

Company context: Stability AI (founded 2020) was an early backer and engineer of Stable Diffusion. The company grew rapidly on Stability Diffusion’s success, raising rounds in 2023–24 and facing leadership changes (founder replaced, Sean Parker joined board). By late 2024 Stability AI was pursuing enterprise strategy, introducing revenue-based licenses to monetize its ecosystem.

Availability and licensing: All major models are distributed via Hugging Face (and some via Stability AI’s own studios or API) but with evolving terms. The v1.x and SDXL models use CreativeML Open RAIL++ licensing (free for most uses). In contrast, SD3 and SD3.5 models use a new Stability AI Community License: free for research, non-commercial, and small-business (below US$1M revenue), but requiring enterprise licensing above that threshold. The SD3/3.5 checkpoints are gated on Hugging Face: users must log in, accept terms, and provide contact info before downloading (as documented on the model pages). SDXL weights remain publicly downloadable (with standard license click-through). Cloud integrations exist: e.g. SD3.5 Large is available via Amazon Bedrock (AWS) and NVIDIA NIM, subject to the same gating/licensing.

Technical specifications: The core architectural shift is from diffusion UNets to diffusion transformers. All pre-3 models (v1.x, v2.x, SDXL) are latent diffusion models: a pixel-space autoencoder (usually 4-channel latent at 8× downsampling) feeding a UNet denoiser. They use the CLIP text encoder (v1.x: CLIP ViT-L/14; v2.x and SDXL: OpenCLIP ViT-H/14 and/or CLIP ViT/L) to condition via cross-attention. SDXL added a second text encoder and a smaller “refiner” UNet for polishing. In contrast, SD3/3.5 are Multimodal Diffusion Transformers. They encode text with three pretrained encoders (CLIP ViT/L, OpenCLIP ViT/G, and a large T5) and interleave modality tokens, achieving bidirectional flow between text and image representations. Key specs by version:

Figure: Latent diffusion architecture of early Stable Diffusion (SD v1.x/2.x). A U-Net denoises an 8× downsampled latent; conditioning (text features) enters via cross-attention blocks.

Release Timeline and Key Events

Key release and company events are summarized below.

Timeline Table:

DateEventDetails & Source
2022-08-22SD 1.0 (v1.1) releaseLatent diffusion UNet (CLIP), first public release. Demo launched.
2022-12SD v2.0 releaseNew OpenCLIP encoders, 512/768 modes, refined dataset, added depth, inpaint variants.
2023-07SDXL 1.0 releaseLarge UNet (2.6B), dual text encoders, base+refiner pipeline, 1024px.
2024-03SD3 (MMDiT) paperIntroduced transformer backbone, rectified flow; source code/models promised.
2024-06-12SD3 Medium releaseGated release of Medium model.
2024-10-22SD3.5 releaseSD3.5 Large & Turbo (4-step) released; 8B param; gated distribution.
2024-11SD3.5 Medium releaseSD3.5 Medium checkpoint added on HF.
2024-12SD3.5 on AWS & NVIDIASD3.5 Large deployed on Amazon Bedrock, NVIDIA NIM (with gating).
2025-05SD3.5 TensorRT optimization2× speed, 40% less VRAM (11GB) on RTX GPUs.
2025-XXLegal rulings/lawsuitsUK court rules SD model weights not direct infringing copies; ongoing US cases.
(2016-2025)Company eventsFounding (2020), Emad departure (2024), CEO/board changes, $101M funding (2023).

Availability, Gating, and Licensing

Each model generation’s distribution and license differs:

Cloud availability: Stability AI offers SD3/3.5 on its API and “Stable Assistant” products, and partners have integrated them (e.g., SD3.5 Large is a model on AWS Bedrock and Nvidia NIM). In all cases, the same gating/license applies. A consequence is that many community scripts changed: e.g. HuggingFace’s diffusers docs warn that “the model is gated…you first need to go to the Stable Diffusion 3.5 Large Hugging Face page, fill in the form and accept the gate. Then log in using huggingface-cli.”.

Openness comparison: V1.x, V2.x, and SDXL had fully public weights and open licenses (RAIL++), whereas SD3/3.5 are “open” in source but gated in access and under a revenue-restricted license. All major releases are on Hugging Face (with necessary gating) and Stability AI’s GitHub for code. The SD3.5 launch also provided an “inference-only” GitHub repo (Stability-AI/sd3.5) that automates downloading the gated weights.

Technical Specifications by Version

The table below summarizes core technical specs (architecture, resolution, text encoders, etc.) and contrasts each major Stable Diffusion version. Unspecified or undisclosed figures are marked “(n.d.)”.

VersionArchitecture / EncoderCond. Text EncodersLatent/VAE (channels, downsample)Native Res. / GuidanceTraining Data (size/filter)License / Access
SD 1.x (v1.4/1.5)UNet (860M conv layers)CLIP ViT-L/14 (768-dim)4 channels, 8× downsample512×512LAION-5B English subset, filtered by aesthetic scoreOpenRAIL (open); HF weights open
SD 2.0/2.1UNet (similar scale)OpenCLIP ViT-H/14 (1024-dim)4 channels, 8× downsample512/768LAION-5B high-aesthetic subset, NSFW filteredOpenRAIL++ (open, HF gated accept)
SDXL 1.0UNet (≈2.6B conv weights)CLIP ViT/L + OpenCLIP ViT/G4 ch, 8× downsample (in latent)1024×1024540M+ images (multi-aspect LAION subset, aesthetics >4.3) (private)OpenRAIL++ (open)
SDXL refinerUNet refinement stage(uses same encoders)-1024×1024Trained on faces and details (internal)-
SD3 MediumMMDiT Transformer (~2–3B?)CLIP ViT/L, OpenCLIP ViT/G, T5-XXL4 ch, 8× downsample~1024×1024 (guided)1.0B pretrain (synth+public) + 30M aesthetic + 3M preference (stability.ai)Community License (gated)
SD3.5 LargeMMDiT Transformer (8.1B)CLIP ViT/L, OpenCLIP ViT/G, T5-XXL4 ch, 8× downsample1024×1024(Not disclosed; presumably even larger+prefine)Community License (gated)
SD3.5 Large (Turbo)Same + ADD DistillationSameSame1024×1024(Distilled version for speed)Community License (gated)
SD3.5 MediumMMDiT-X Transformer (n.d.)CLIP/L, OpenCLIP/G, T5-XXL4 ch, 8× downsample1024×1024?(Not disclosed)Community License (gated)

Each model uses classifier-free guidance by default. SD v1/v2 models include partial-conditioned (dropout) during training; SD3/3.5 use combined loss with pairwise unconditioned examples. The Stable Diffusion diagram above shows the v1/v2/XL latent diffusion flow.

Compute and Inference Cost

Larger models require significantly more compute. Rough estimates:

Generation flow: Prompt text + initial latent noise

Text encoders (CLIP/T5) -> MMDiT transformer -> VAE decoder -> output image

Table: Pros/Cons and costs by version

ModelPros (Quality, Features)Cons (Cost, Issues)GPU VRAM (@32 steps, ~1024px)Notes
SD v1.5Smallest; very fast; extensive finetune supportLower fidelity on text, small resolution~7 GB (8GB GPUs are sufficient)
SD v2.1Better non-human detail; added inpainting/depth modes2x VRAM for 768px mode; people rendering caveats~8-10 GB (768)
SDXL 1.0Highest detail, natural composition; refiner improves facesVery high resource needs; refiner doubles load; complex distribution~10-12 GB (base); +10 GB (refiner)Invisible watermark included
SD3 MediumMajor leap in prompt fidelity and typography; encoded knowledge via T5High memory due to T5; slower per step; gating makes access less trivial~12+ GB (varies by batch)Optional no-T5 variant, FP8 T5, etc.
SD3.5 LargeState-of-art quality; distilled Turbo for speedExtremely high resource needs; gating/licensing; no built-in refiner~19 GB baseline; 11 GB w/ TensorRTTurbo (4 steps) exists but no guidance; still gated
SD3.5 TurboNear-instant generation (4-8 steps)Lower guidance (no classifier-free) vs. base~11 GB (FP8)Distilled
SD3.5 Medium(expected similar to SD3 Medium)(expected similar to Medium)(unknown)New architecture improvements (MMDiT-X)

Unspecified parameters: Exact size of “Medium” variants not given publicly. VRAM notes are from official benchmarks or recommended setups.

Ecosystem, Adoption, and Use Cases

Stable Diffusion’s impact is vast. Open-source code and weights have spurred countless extensions. Notable ecosystem components:

MetricValue
SDXL base downloads2,062,317
SD3.5 Medium downloads131,993
AUTOMATIC1111 stars161,000
ComfyUI stars104,000
Diffusers stars32,800

Stable Diffusion’s dataset and outputs have been the subject of intense debate:

Practical Guidance for Users

Access: To download gated models (SD3/3.5), create a Hugging Face account and agree to the license on the model page. Then use huggingface-cli login before running any pipeline. The diffusers example for SD3.5 shows precisely this step. Stability AI provides a GitHub script (Stability-AI/sd3.5) to automate fetching the required files.

Inference: Use mixed precision (fp16 or bf16) and frameworks like Torch 2.0 with torch.compile or TensorRT for speed. For example, SD3.5 Large inference is demonstrated in bfloat16 (the original published precision). Use classifier-free guidance values ~4–7 for SD3 models (contrast vs ~7–15 in SD1/2). For VRAM-limited hardware: load one model at a time (e.g. don’t load refiner with SDXL by default), or use offloading. On 8–12 GB GPUs, SDXL may run at reduced batch size, whereas SD3.5 Large typically needs ~16 GB or more without optimization.

Finetuning and Plugins: LoRA fine-tuning works with any version, but LoRAs are not cross-compatible between architecture types (e.g. a v1.5 LoRA won’t plug into SD3). ControlNet requires separate model variants; StabilityAI released SD3.5-specific control nets (TensorRT-optimized) on HF. Many community adapters exist (LoRAs, dreambooth models) as seen in the HF model trees. Always ensure LoRAs or ControlNets match the base model version.

All data above is drawn from official sources (model cards, papers) or reputable analyses. Figures marked n.d. were not specified by the sources. We cite training data and VRAM from Stability AI announcements and papers; where not public, we note it. This snapshot is accurate as of Feb 24, 2026.

Share This Page