Kailash Labs

Kailash Labs is a video research and product lab. We are on a mission to convert the world’s videos into structured data that machines can reason over. Our aim is to build small, grounded video understanding models that can be adapted to any downstream video analysis task with little data. Towards a world where machines can watch and understand videos as fluently as humans.

What we’re betting on

01.The bitter lesson, for vision.
Specialised CV models do not compose. A PPE-detection model cannot be extended to monitor an SOP. A shelf-compliance classifier cannot be repurposed for queue analytics. In images, the field has already moved on: practitioners increasingly prompt or fine-tune a VLM rather than train a task-specific detector [1][2]. Video MLLMs have now crossed the same threshold. Pretrained video representations transfer across detection, retrieval, captioning, and temporal grounding with orders of magnitude less labelled data [3].
02.Efficiency dominates capability in video.
Camera feeds produce 30 to 60 frames per second; a 100B frontier model is the wrong shape for that load. The use cases that matter most are realtime by definition: robotics, security, industrial monitoring, live broadcast, each with strict latency, cost, and privacy constraints. We bet on small, specialised video models. Big models in development, small models in production [4].
03.Video is the last missing context layer for AI agents.
Modern agents read code, text, screens, and APIs fluently. They are blind to the physical world. Working with video is still hard: choosing a model, sampling frames, wiring it to a camera, evaluating it on your footage. Our SDK collapses that work into a few lines. Swap models, connect live feeds, run evals, give your agent eyes.

· · ·

If you are building with video — robotics, security, industrial, sports, media — we want to hear from you. Early SDK access is open to design partners.

[ Shubham Sharma ]

References

Moondream — small open-weight vision language model. moondream.ai
Bai et al. Qwen2.5-VL Technical Report, 2025. arXiv:2502.13923.
Yuan et al. Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding, 2025. arXiv:2501.07888.
Vikhyat Korrapati. Tiny models will run the world, Moondream public talk, 2025.
Qwen Team. Qwen3-VL Technical Report, 2025. arXiv:2511.21631. Base model for Marlin.
CaReBench — careful captioning benchmark with spatial/temporal decomposition. carebench.github.io.
TimeLens-Bench — temporal grounding benchmark for video VLMs. arXiv:2512.14698.