Vision LLMs

LLMs that can work with images, screenshots, or multimodal file inputs in addition to text.

Browse Vision LLMs tools filtered by practical fit and workflow needs.

23 matching tools.

Tools in this category

ChatGPT

Free cloud LLM for writing, research, and file-based analysis.

Freemium
cloud-llm
chat-assistant
multimodal

Best for: Daily writing, rewriting, and brainstorming, Quick research and summary work from uploaded files

Claude

Cloud LLM known for strong writing quality and explicit model-improvement controls.

Freemium
cloud-llm
chat-assistant
multimodal

Best for: Proposal and client communication drafting, Long-form editing and narrative refinement

DeepSeek-VL2

Mixture-of-experts local vision-language family for OCR, documents, charts, and grounded multimodal reasoning.

Free
local-inference
open-weights
self-hosted

Best for: Private visual document analysis, Multimodal document understanding

Gemini

Free cloud LLM with published daily prompt limits and research-focused workflows.

Freemium
cloud-llm
chat-assistant
multimodal

Best for: Research briefs and competitive scans, Long-form summarization and outline generation

Gemma 3

Multimodal Gemma family with 128K context and broad local deployment options under Gemma terms.

Free
local-inference
open-weights
on-device

Best for: Local assistants with manageable compliance processes, Multimodal summarization and extraction

Gemma 3n

Device-first Gemma branch with multimodal support, long context, and efficient E2B/E4B variants.

Free
local-inference
open-weights
on-device

Best for: Multimodal local assistant workflows, Privacy-sensitive visual assistant tasks

Gemma 4

Newest Gemma family with Apache-2.0 licensing, multimodal input, 256K context, and sparse on-device variants.

Free
local-inference
open-weights
on-device

Best for: Multimodal local assistant workflows, Multimodal document understanding

GLM (Z.AI)

Z.AI’s hosted GLM stack now spanning GLM-5.1, GLM-5V-Turbo, and earlier GLM branches for coding, reasoning, and multimodal workflows.

Freemium
cloud-llm
chat-assistant
multimodal

Best for: Hosted GLM access across text and vision workloads, Cloud coding assistants and technical drafting

GLM-5V-Turbo

Latest GLM vision branch for multimodal coding, screenshot understanding, GUI agents, and visually grounded execution workflows.

Freemium
cloud-llm
multimodal
vision

Best for: Screenshot-based coding help, GUI and browser agent workflows

InternVL 3.5

Apache-2.0 multimodal family with many size options and a strong focus on reasoning, OCR, and agent-style visual tasks.

Free
local-inference
open-weights
self-hosted

Best for: Multimodal internal analysis workflows, Builders experimenting with vision-language tasks

Kimi K2.6

Latest open-weight Kimi model for long-horizon coding, agent swarms, multimodal execution, and large-context local experimentation.

Free
local-inference
open-weights
reasoning

Best for: Local agentic coding workflows, Multimodal local assistant builds

Le Chat

Mistral’s cloud LLM chat with clear plan-level training defaults and opt-out controls.

Freemium
cloud-llm
chat-assistant
multimodal

Best for: Multilingual drafting and editing, Teams that require explicit training opt-out controls

Llama 3.2 Vision

Vision-capable Llama model for local image-plus-text understanding tasks.

Free
local-inference
open-weights
self-hosted

Best for: Local image + text analysis workflows, Multimodal document understanding

Llama 4

Open-weight multimodal family with massive context, but significant policy and license constraints.

Free
local-inference
open-weights
multimodal

Best for: Large multi-document summarization pipelines, Multimodal internal analysis workflows

MiniCPM-V 2.6

Efficient local VLM with strong OCR, multi-image, and video understanding in an 8B-class footprint.

Free
local-inference
open-weights
self-hosted

Best for: Private visual document analysis, Multimodal local assistant workflows

Mistral Small 4

Open hybrid Mistral model that combines instruct, reasoning, coding, OCR, and transcription in one 256K-context family.

Free
local-inference
open-weights
self-hosted

Best for: Multimodal local assistant workflows, Multimodal document understanding

Molmo

Open vision-language family from AI2 focused on strong multimodal quality with Apache-2.0 licensing.

Free
local-inference
open-weights
self-hosted

Best for: Multimodal document understanding, Private visual document analysis

Phi-3.5 Vision Instruct

Compact MIT-licensed multimodal model for local image, OCR, chart, and multi-image reasoning tasks.

Free
local-inference
open-weights
on-device

Best for: Multimodal document understanding, Private visual document analysis

Qwen Chat

Alibaba’s cloud Qwen assistant with multilingual support and enterprise-grade API access through Model Studio.

Freemium
cloud-llm
chat-assistant
multimodal

Best for: Multilingual drafting and rewriting, Cost-controlled cloud assistant operations

Qwen2.5 VL

Multimodal Qwen model family for local vision-language workflows.

Free
local-inference
open-weights
self-hosted

Best for: Multimodal local assistant workflows, Private visual document analysis

Qwen3.5

Native multimodal Qwen family with sparse MoE scaling, strong agent behavior, and a flagship 397B total / 17B active open model.

Free
local-inference
open-weights
self-hosted