Mohammad Mahdi Derakhshani

Computer Vision | Machine Learning

Welcome! I'm Mohammad, a Ph.D. student at the University of Amsterdam's VIS lab under the supervision of Cees Snoek and Yuki Asano. My research delves into multi-modal foundation models, with a keen interest in unified generative models. I am currently a community researcher at Cohere Lab, focusing on multilingual generative models.

In summer 2023, I interned at Microsoft Research, Cambridge, working on fine-tuning large-scale LLMs, e.g. GPT3 and GPT3.5 and conditional text-to-image generation alongside Molly Xia, Harkirat Behl and Victor Ruehle. In summer 2022, I was at Samsung AI Center, Cambridge, researching large-scale language-image models and Federated Learning with Brais Martinez and Georgios Tzimiropoulos. Before that, I pursued my master's at the University of Tehran under the guidance of Babak Nadjar Araabi and Mohammad Amin Sadeghi. My studies at the Machine Learning and Computational Modeling lab focused on object detection and image compression. I also researched object detection with Mohammad Rastegari.

I'm proud to be an ELLIS society member and have reviewed for prestigious conferences such as CVPR, NeurIPs, ICLR, ICML, ICCV and TPAMI.

News

📜 ICLR'25: Tulip (🌷): Token-length Upgraded CLIP
📜 ICCV'25 TWIST & SCOUT: Grounding Multimodal LLM-Experts by Forget-Free Tuning
📜 CompFuser: Unlocking Spatial Comprehension in Text-to-Image Diffusion Models
📜 ICLR'24 FoMo workshop: SeCAt: Self-Supervised Open-Ended Classification with Small Visual Language Models
📜 ICCV'23: Bayesian Prompt Learning for Image-Language Model Generalization
🔥 Joined Microsoft Research as ML summer intern in Cambridge in UK.

Tutorials & Lectures

Scaling Transformers 5D Parallelism

Foundation Model Course 2025

In-depth tutorial on parallelism techniques for training transformers.

Research

TULIP: Token-length Upgraded CLIP

Ivona Najdenkoska*, Mohammad Mahdi Derakhshani*, Yuki M. Asano, Nanne van Noord, Marcel Worring, Cees G. M. Snoek.

*: equal contribution; random order.

ICLR 2025 bibtex

We propose a generalizable method, named TULIP, able to upgrade the token length to any length for CLIP-like models. We do so by improving the architecture with relative position encodings, followed by a training procedure that (i) distills the original CLIP text encoder into an encoder with relative position encodings and (ii) enhances the model for aligning longer captions with images.

TWIST & SCOUT: Grounding Multimodal LLM-Experts by Forget-Free Tuning

Aritra Bhowmik*, Mohammad Mahdi Derakhshani*, Dennis Koelma, Martin R. Oswald, Yuki M. Asano, Cees G. M. Snoek

*: equal contribution; random order.

ICCV 2025 bibtex

We propose TWIST, a twin-expert stepwise tuning module that modifies the decoder of the language model using one frozen module pre-trained on image understanding tasks and another learnable one for visual grounding tasks. This allows the MLLM to retain previously learned knowledge and skills, while acquiring what is missing. To fine-tune the model effectively, we generate a high-quality synthetic dataset we call SCOUT, which mimics human reasoning in visual grounding. This dataset provides rich supervision signals, describing a step-by-step multimodal reasoning process, thereby simplifying the task of visual grounding.

Any-Shift Prompting for Generalization over Distributions

Zehao Xiao, Jiayi Shen, Mohammad Mahdi Derakhshani, Shengcai Liao, Cees GM Snoek

CVPR 2024 bibtex

we propose any-shift prompting: a general probabilistic inference framework that considers the relationship between training and test distributions during prompt learning. We explicitly connect training and test distributions in the latent space by constructing training and test prompts in a hierarchical architecture.

Unlocking Spatial Comprehension in Text-to-Image Diffusion Models

Mohammad Mahdi Derakhshani, Menglin Xia, Harkirat Behl, Cees GM Snoek, Victor Rühle.

Arxiv 2023 bibtex

We propose CompFuser, an image generation pipeline that enhances spatial comprehension and attribute assignment in text-to-image generative models. Our pipeline enables the interpretation of instructions defining spatial relationships between objects in a scene.

Self-Supervised Open-Ended Classification with Small Visual Language Models

Mohammad Mahdi Derakhshani*, Ivona Najdenkoska*, Cees G. M. Snoek, Marcel Worring, Yuki M. Asano.

ICLR FoMo workshop 2024 bibtex

We present Self-Context Adaptation (SeCAt), a self-supervised approach that unlocks open-ended few-shot abilities of small visual language models. Our proposed adaptation algorithm explicitly learns from symbolic, yet self-supervised training tasks.

Bayesian Prompt Learning for Image-Language Model Generalization

Mohammad Mahdi Derakhshani, Enrique Sanchez, Adrian Bulat, Victor Guilherme Turrisi da Costa, Cees G. M. Snoek, Georgios Tzimiropoulos, Brais Martinez.

ICCV 2023 bibtex

We propose a probabilistic modeling of the underlying distribution of prompts, allowing prompts within the support of an associated concept to be derived through stochastic sampling. This results in a more complete and richer transfer of the information captured by the language model, providing better generalization capabilities for downstream tasks.

Open-Ended Medical Visual Question Answering Through Prefix Tuning of Language Models

Tom van Sonsbeek*, Mohammad Mahdi Derakhshani*, Ivona Najdenkoska*, Cees G. M. Snoek, Marcel Worring.

MICCAI 2023 bibtex

we introduce a novel method particularly suited for small, domain-specific, medical datasets. To properly communicate the medical images to the language model, we develop a network that maps the extracted visual features to a set of learnable tokens. Then, alongside the question, these learnable tokens directly prompt the language model.

LifeLonger: A Benchmark for Continual Disease Classification

Mohammad Mahdi Derakhshani*, Ivona Najdenkoska*, Tom van Sonsbeek*, Xiantong Zhen, Dwarikanath Mahapatra, Marcel Worring, Cees G. M. Snoek.

MICCAI 2022 bibtex

We introduce LifeLonger, a benchmark for continual disease classification on the MedMNIST collection, by applying existing state-of-the-art continual learning methods. We perform a thorough analysis of the performance and examine how the well-known challenges of continual learning, such as the catastrophic forgetting exhibit themselves in this setting.

Generative Kernel Continual learning

Mohammad Mahdi Derakhshani, Xiantong Zhen, Ling Shao, Cees G. M. Snoek

arXiv 2021 bibtex

We introduce generative kernel continual learning, which explores and exploits the synergies between generative models and kernels for continual learning.

Kernel Continual learning

Mohammad Mahdi Derakhshani, Xiantong Zhen, Ling Shao, Cees G. M. Snoek

ICML 2021 bibtex

This paper introduces kernel continual learning, a simple but effective variant of continual learning that leverages the non-parametric nature of kernel methods to tackle catastrophic forgetting.

Assisted Excitation of Activations: A Learning Technique to Improve Object Detectors

Mohammad Mahdi Derakhshani, Saeed Masoudnia, Amir Hossein Shaker, Omid Mersa, Mohammad Amin Sadeghi, Mohammad Rastegari, Babak N. Araabi

CVPR 2019 bibtex

We present a simple and effective learning technique that significantly improves mAP of YOLO object detectors without compromising their speed.

BlockCNN: A Deep Network for Artifact Removal and Image Compression

Danial Maleki, Soheila Nadalian, Mohammad Mahdi Derakhshani, Mohammad Mahdi Derakhshani, Mohammad Amin Sadeghi

CVPR (Workshop) 2018 bibtex

We present a general technique that performs both artifact removal and image compression. For artifact removal, we input a JPEG image and try to remove its compression artifacts.