Vision LLMs

LLMs that can work with images, screenshots, or multimodal file inputs in addition to text.

Browse Vision LLMs tools filtered by practical fit and workflow needs.

23 matching tools.

Tools in this category

ChatGPT logo

ChatGPT

Free cloud LLM for writing, research, and file-based analysis.

  • Freemium
  • cloud-llm
  • chat-assistant
  • multimodal

Best for: Daily writing, rewriting, and brainstorming, Quick research and summary work from uploaded files

Claude logo

Claude

Cloud LLM known for strong writing quality and explicit model-improvement controls.

  • Freemium
  • cloud-llm
  • chat-assistant
  • multimodal

Best for: Proposal and client communication drafting, Long-form editing and narrative refinement

DeepSeek-VL2 logo

DeepSeek-VL2

Mixture-of-experts local vision-language family for OCR, documents, charts, and grounded multimodal reasoning.

  • Free
  • local-inference
  • open-weights
  • self-hosted

Best for: Private visual document analysis, Multimodal document understanding

Gemini logo

Gemini

Free cloud LLM with published daily prompt limits and research-focused workflows.

  • Freemium
  • cloud-llm
  • chat-assistant
  • multimodal

Best for: Research briefs and competitive scans, Long-form summarization and outline generation

Gemma 3 logo

Gemma 3

Multimodal Gemma family with 128K context and broad local deployment options under Gemma terms.

  • Free
  • local-inference
  • open-weights
  • on-device

Best for: Local assistants with manageable compliance processes, Multimodal summarization and extraction

Gemma 3n logo

Gemma 3n

Device-first Gemma branch with multimodal support, long context, and efficient E2B/E4B variants.

  • Free
  • local-inference
  • open-weights
  • on-device

Best for: Multimodal local assistant workflows, Privacy-sensitive visual assistant tasks

Gemma 4 logo

Gemma 4

Newest Gemma family with Apache-2.0 licensing, multimodal input, 256K context, and sparse on-device variants.

  • Free
  • local-inference
  • open-weights
  • on-device

Best for: Multimodal local assistant workflows, Multimodal document understanding

GLM (Z.AI) logo

GLM (Z.AI)

Z.AI’s hosted GLM stack now spanning GLM-5.1, GLM-5V-Turbo, and earlier GLM branches for coding, reasoning, and multimodal workflows.

  • Freemium
  • cloud-llm
  • chat-assistant
  • multimodal

Best for: Hosted GLM access across text and vision workloads, Cloud coding assistants and technical drafting

GLM-5V-Turbo logo

GLM-5V-Turbo

Latest GLM vision branch for multimodal coding, screenshot understanding, GUI agents, and visually grounded execution workflows.

  • Freemium
  • cloud-llm
  • multimodal
  • vision

Best for: Screenshot-based coding help, GUI and browser agent workflows

InternVL 3.5 logo

InternVL 3.5

Apache-2.0 multimodal family with many size options and a strong focus on reasoning, OCR, and agent-style visual tasks.

  • Free
  • local-inference
  • open-weights
  • self-hosted

Best for: Multimodal internal analysis workflows, Builders experimenting with vision-language tasks

Kimi K2.6 logo

Kimi K2.6

Latest open-weight Kimi model for long-horizon coding, agent swarms, multimodal execution, and large-context local experimentation.

  • Free
  • local-inference
  • open-weights
  • reasoning

Best for: Local agentic coding workflows, Multimodal local assistant builds

Le Chat logo

Le Chat

Mistral’s cloud LLM chat with clear plan-level training defaults and opt-out controls.

  • Freemium
  • cloud-llm
  • chat-assistant
  • multimodal

Best for: Multilingual drafting and editing, Teams that require explicit training opt-out controls

Llama 3.2 Vision logo

Llama 3.2 Vision

Vision-capable Llama model for local image-plus-text understanding tasks.

  • Free
  • local-inference
  • open-weights
  • self-hosted

Best for: Local image + text analysis workflows, Multimodal document understanding

Llama 4 logo

Llama 4

Open-weight multimodal family with massive context, but significant policy and license constraints.

  • Free
  • local-inference
  • open-weights
  • multimodal

Best for: Large multi-document summarization pipelines, Multimodal internal analysis workflows

MiniCPM-V 2.6 logo

MiniCPM-V 2.6

Efficient local VLM with strong OCR, multi-image, and video understanding in an 8B-class footprint.

  • Free
  • local-inference
  • open-weights
  • self-hosted

Best for: Private visual document analysis, Multimodal local assistant workflows

Mistral Small 4 logo

Mistral Small 4

Open hybrid Mistral model that combines instruct, reasoning, coding, OCR, and transcription in one 256K-context family.

  • Free
  • local-inference
  • open-weights
  • self-hosted

Best for: Multimodal local assistant workflows, Multimodal document understanding

Molmo logo

Molmo

Open vision-language family from AI2 focused on strong multimodal quality with Apache-2.0 licensing.

  • Free
  • local-inference
  • open-weights
  • self-hosted

Best for: Multimodal document understanding, Private visual document analysis

Phi-3.5 Vision Instruct logo

Phi-3.5 Vision Instruct

Compact MIT-licensed multimodal model for local image, OCR, chart, and multi-image reasoning tasks.

  • Free
  • local-inference
  • open-weights
  • on-device

Best for: Multimodal document understanding, Private visual document analysis

Qwen Chat logo

Qwen Chat

Alibaba’s cloud Qwen assistant with multilingual support and enterprise-grade API access through Model Studio.

  • Freemium
  • cloud-llm
  • chat-assistant
  • multimodal

Best for: Multilingual drafting and rewriting, Cost-controlled cloud assistant operations

Qwen2.5 VL logo

Qwen2.5 VL

Multimodal Qwen model family for local vision-language workflows.

  • Free
  • local-inference
  • open-weights
  • self-hosted

Best for: Multimodal local assistant workflows, Private visual document analysis

Qwen3.5 logo

Qwen3.5

Native multimodal Qwen family with sparse MoE scaling, strong agent behavior, and a flagship 397B total / 17B active open model.

  • Free
  • local-inference
  • open-weights
  • self-hosted

Best for: Multimodal local assistant workflows, Private visual document analysis

Qwen3.6 logo

Qwen3.6

Qwen3.6 family covering the hosted Qwen3.6-Plus flagship and the first open-weight Qwen3.6-35B-A3B release.

  • Free
  • cloud-llm
  • local-inference
  • open-weights

Best for: Teams choosing between hosted and local Qwen generation, Agentic coding workflows

Qwen3.6-35B-A3B logo

Qwen3.6-35B-A3B

First open-weight Qwen3.6 model: a 35B total / 3B active multimodal MoE focused on agentic coding and practical local use.

  • Free
  • local-inference
  • open-weights
  • apache-2-0

Best for: Local agentic coding workflows, Multimodal local assistant builds

Related categories

View all categories · View all tools

Alternatives to explore

Share This Page