In the fast-evolving world of Vision-Language Models (VLMs), FastVLM emerges as a game-changer. Designed for efficiency and speed, FastVLM introduces a novel vision encoder that radically reduces computation time—while outperforming other lightweight models in both accuracy and latency.
Whether you’re building a multi-modal assistant, deploying models to mobile devices, or optimizing inference at scale, FastVLM is a compelling new tool to explore.

What is FastVLM?
FastVLM is an efficient vision encoding framework built for Vision-Language Models. At its core is FastViTHD, a hybrid vision encoder that processes high-resolution images by outputting fewer tokens, enabling significantly faster inference without sacrificing performance.
Key Highlights
- FastViTHD Encoder: A hybrid transformer-based encoder that dramatically reduces encoding time for high-resolution images.
- Unmatched Speed:
- 85× faster Time-to-First-Token (TTFT) compared to LLaVA-OneVision-0.5B.
- 7.9× faster TTFT than Cambrian-1-8B with FastVLM’s 7B variant.
- Mobile-Ready: A demo iOS app showcases FastVLM running efficiently on Apple devices.
- Compact and Powerful:
- 3.4× smaller vision encoder than comparable models.
- Larger variants use Qwen2-7B and still outperform models with significantly higher latency.
Accuracy vs Latency
While specific charts are in the paper, the headline is clear: FastVLM dominates the balance between model size, performance, and latency. Unlike traditional vision encoders that bloat VLMs with visual tokens, FastViTHD keeps it lean and lightning-fast.
Model Zoo & Checkpoints
FastVLM provides a full suite of pretrained models:
Model | Stage 2 Checkpoint | Stage 3 Checkpoint |
---|---|---|
FastVLM-0.5B | fastvlm_0.5b_stage2 | fastvlm_0.5b_stage3 |
FastVLM-1.5B | fastvlm_1.5b_stage2 | fastvlm_1.5b_stage3 |
FastVLM-7B | fastvlm_7b_stage2 | fastvlm_7b_stage3 |
Download all models easily:
bash get_models.sh # Downloads to the `checkpoints/` directory
Getting Started
Installation
conda create -n fastvlm python=3.10
conda activate fastvlm
pip install -e .
Inference Example
python predict.py --model-path /path/to/checkpoint-dir \
--image-file /path/to/image.png \
--prompt "Describe the image."
Inference on Apple Silicon
FastVLM supports efficient inference on Apple devices (Mac, iPhone, iPad) through model export and quantization. Instructions are provided in the model_export/
subfolder of the repo.
Pre-exported Apple-compatible models:
fastvlm_0.5b_stage3
fastvlm_1.5b_stage3
fastvlm_7b_stage3
A demo iOS app is also available to test performance in real-world mobile scenarios.
Built on LLaVA
FastVLM uses the LLaVA codebase as a training and inference foundation. If you’re already familiar with LLaVA, you’ll feel right at home with FastVLM.
Use Cases
- On-device multimodal assistants
- High-resolution image captioning
- Counting and object detection tasks
- Handwriting and emoji recognition
- Real-time AI agents on mobile
Final Thoughts
FastVLM isn’t just another VLM—it’s a blueprint for the future of efficient, scalable, and mobile-friendly AI. With impressive TTFT numbers, compact encoder design, and full Apple ecosystem support, it’s ready to power the next generation of multimodal applications.
Check out the official GitHub repo, read the paper, and start building with FastVLM today.