Vision LLMs
LLMs that can work with images, screenshots, or multimodal file inputs in addition to text.
Browse Vision LLMs tools filtered by practical fit and workflow needs.
15 matching tools.
Tools in this category
ChatGPT
Free cloud LLM for writing, research, and file-based analysis.
- Freemium
- cloud-llm
- chat-assistant
- multimodal
Best for: Daily writing, rewriting, and brainstorming, Quick research and summary work from uploaded files
Claude
Cloud LLM known for strong writing quality and explicit model-improvement controls.
- Freemium
- cloud-llm
- chat-assistant
- multimodal
Best for: Proposal and client communication drafting, Long-form editing and narrative refinement
DeepSeek-VL2
Mixture-of-experts local vision-language family for OCR, documents, charts, and grounded multimodal reasoning.
- Free
- local-inference
- open-weights
- self-hosted
Best for: Private visual document analysis, Multimodal document understanding
Gemini
Free cloud LLM with published daily prompt limits and research-focused workflows.
- Freemium
- cloud-llm
- chat-assistant
- multimodal
Best for: Research briefs and competitive scans, Long-form summarization and outline generation
Gemma 3
Portable open-weight family with long context and multimodal options under custom terms.
- Free
- local-inference
- open-weights
- on-device
Best for: Local assistants with manageable compliance processes, Multimodal summarization and extraction
GLM (Z.AI)
Z.AI’s cloud GLM assistant and API stack for coding, reasoning, and multilingual business workflows.
- Freemium
- cloud-llm
- chat-assistant
- multimodal
Best for: Cloud coding assistants and technical drafting, Multilingual business operations support
InternVL 3.5
Apache-2.0 multimodal family with many size options and a strong focus on reasoning, OCR, and agent-style visual tasks.
- Free
- local-inference
- open-weights
- self-hosted
Best for: Multimodal internal analysis workflows, Builders experimenting with vision-language tasks
Le Chat
Mistral’s cloud LLM chat with clear plan-level training defaults and opt-out controls.
- Freemium
- cloud-llm
- chat-assistant
- multimodal
Best for: Multilingual drafting and editing, Teams that require explicit training opt-out controls
Llama 3.2 Vision
Vision-capable Llama model for local image-plus-text understanding tasks.
- Free
- local-inference
- open-weights
- self-hosted
Best for: Local image + text analysis workflows, Multimodal document understanding
Llama 4
Open-weight multimodal family with massive context, but significant policy and license constraints.
- Free
- local-inference
- open-weights
- multimodal
Best for: Large multi-document summarization pipelines, Multimodal internal analysis workflows
MiniCPM-V 2.6
Efficient local VLM with strong OCR, multi-image, and video understanding in an 8B-class footprint.
- Free
- local-inference
- open-weights
- self-hosted
Best for: Private visual document analysis, Multimodal local assistant workflows
Molmo
Open vision-language family from AI2 focused on strong multimodal quality with Apache-2.0 licensing.
- Free
- local-inference
- open-weights
- self-hosted
Best for: Multimodal document understanding, Private visual document analysis
Phi-3.5 Vision Instruct
Compact MIT-licensed multimodal model for local image, OCR, chart, and multi-image reasoning tasks.
- Free
- local-inference
- open-weights
- on-device
Best for: Multimodal document understanding, Private visual document analysis
Qwen Chat
Alibaba’s cloud Qwen assistant with multilingual support and enterprise-grade API access through Model Studio.
- Freemium
- cloud-llm
- chat-assistant
- multimodal
Best for: Multilingual drafting and rewriting, Cost-controlled cloud assistant operations
Qwen2.5 VL
Multimodal Qwen model family for local vision-language workflows.
- Free
- local-inference
- open-weights
- self-hosted
Best for: Multimodal local assistant workflows, Private visual document analysis