Vision LLMs
LLMs that can work with images, screenshots, or multimodal file inputs in addition to text.
Browse Vision LLMs tools filtered by practical fit and workflow needs.
23 matching tools.
Tools in this category
ChatGPT
Free cloud LLM for writing, research, and file-based analysis.
- Freemium
- cloud-llm
- chat-assistant
- multimodal
Best for: Daily writing, rewriting, and brainstorming, Quick research and summary work from uploaded files
Claude
Cloud LLM known for strong writing quality and explicit model-improvement controls.
- Freemium
- cloud-llm
- chat-assistant
- multimodal
Best for: Proposal and client communication drafting, Long-form editing and narrative refinement
DeepSeek-VL2
Mixture-of-experts local vision-language family for OCR, documents, charts, and grounded multimodal reasoning.
- Free
- local-inference
- open-weights
- self-hosted
Best for: Private visual document analysis, Multimodal document understanding
Gemini
Free cloud LLM with published daily prompt limits and research-focused workflows.
- Freemium
- cloud-llm
- chat-assistant
- multimodal
Best for: Research briefs and competitive scans, Long-form summarization and outline generation
Gemma 3
Multimodal Gemma family with 128K context and broad local deployment options under Gemma terms.
- Free
- local-inference
- open-weights
- on-device
Best for: Local assistants with manageable compliance processes, Multimodal summarization and extraction
Gemma 3n
Device-first Gemma branch with multimodal support, long context, and efficient E2B/E4B variants.
- Free
- local-inference
- open-weights
- on-device
Best for: Multimodal local assistant workflows, Privacy-sensitive visual assistant tasks
Gemma 4
Newest Gemma family with Apache-2.0 licensing, multimodal input, 256K context, and sparse on-device variants.
- Free
- local-inference
- open-weights
- on-device
Best for: Multimodal local assistant workflows, Multimodal document understanding
GLM (Z.AI)
Z.AI’s hosted GLM stack now spanning GLM-5.1, GLM-5V-Turbo, and earlier GLM branches for coding, reasoning, and multimodal workflows.
- Freemium
- cloud-llm
- chat-assistant
- multimodal
Best for: Hosted GLM access across text and vision workloads, Cloud coding assistants and technical drafting
GLM-5V-Turbo
Latest GLM vision branch for multimodal coding, screenshot understanding, GUI agents, and visually grounded execution workflows.
- Freemium
- cloud-llm
- multimodal
- vision
Best for: Screenshot-based coding help, GUI and browser agent workflows
InternVL 3.5
Apache-2.0 multimodal family with many size options and a strong focus on reasoning, OCR, and agent-style visual tasks.
- Free
- local-inference
- open-weights
- self-hosted
Best for: Multimodal internal analysis workflows, Builders experimenting with vision-language tasks
Kimi K2.6
Latest open-weight Kimi model for long-horizon coding, agent swarms, multimodal execution, and large-context local experimentation.
- Free
- local-inference
- open-weights
- reasoning
Best for: Local agentic coding workflows, Multimodal local assistant builds
Le Chat
Mistral’s cloud LLM chat with clear plan-level training defaults and opt-out controls.
- Freemium
- cloud-llm
- chat-assistant
- multimodal
Best for: Multilingual drafting and editing, Teams that require explicit training opt-out controls
Llama 3.2 Vision
Vision-capable Llama model for local image-plus-text understanding tasks.
- Free
- local-inference
- open-weights
- self-hosted
Best for: Local image + text analysis workflows, Multimodal document understanding
Llama 4
Open-weight multimodal family with massive context, but significant policy and license constraints.
- Free
- local-inference
- open-weights
- multimodal
Best for: Large multi-document summarization pipelines, Multimodal internal analysis workflows
MiniCPM-V 2.6
Efficient local VLM with strong OCR, multi-image, and video understanding in an 8B-class footprint.
- Free
- local-inference
- open-weights
- self-hosted
Best for: Private visual document analysis, Multimodal local assistant workflows
Mistral Small 4
Open hybrid Mistral model that combines instruct, reasoning, coding, OCR, and transcription in one 256K-context family.
- Free
- local-inference
- open-weights
- self-hosted
Best for: Multimodal local assistant workflows, Multimodal document understanding
Molmo
Open vision-language family from AI2 focused on strong multimodal quality with Apache-2.0 licensing.
- Free
- local-inference
- open-weights
- self-hosted
Best for: Multimodal document understanding, Private visual document analysis
Phi-3.5 Vision Instruct
Compact MIT-licensed multimodal model for local image, OCR, chart, and multi-image reasoning tasks.
- Free
- local-inference
- open-weights
- on-device
Best for: Multimodal document understanding, Private visual document analysis
Qwen Chat
Alibaba’s cloud Qwen assistant with multilingual support and enterprise-grade API access through Model Studio.
- Freemium
- cloud-llm
- chat-assistant
- multimodal
Best for: Multilingual drafting and rewriting, Cost-controlled cloud assistant operations
Qwen2.5 VL
Multimodal Qwen model family for local vision-language workflows.
- Free
- local-inference
- open-weights
- self-hosted
Best for: Multimodal local assistant workflows, Private visual document analysis
Qwen3.5
Native multimodal Qwen family with sparse MoE scaling, strong agent behavior, and a flagship 397B total / 17B active open model.
- Free
- local-inference
- open-weights
- self-hosted
Best for: Multimodal local assistant workflows, Private visual document analysis
Qwen3.6
Qwen3.6 family covering the hosted Qwen3.6-Plus flagship and the first open-weight Qwen3.6-35B-A3B release.
- Free
- cloud-llm
- local-inference
- open-weights
Best for: Teams choosing between hosted and local Qwen generation, Agentic coding workflows
Qwen3.6-35B-A3B
First open-weight Qwen3.6 model: a 35B total / 3B active multimodal MoE focused on agentic coding and practical local use.
- Free
- local-inference
- open-weights
- apache-2-0
Best for: Local agentic coding workflows, Multimodal local assistant builds