We build OV-DGSS, a depth-aware open-vocabulary semantic segmentation pipeline for adverse-condition urban scenes (night, rain, fog). It augments a frozen CLIP–ConvNeXt backbone with a monocular depth prior and a geometry-conditioned text query decoder, trained in a parameter-efficient regime. On 50 ACDC adverse-condition frames, OV-DGSS reaches mIoU 0.620 / 0.486 on Seen / Unseen Cityscapes classes — beating FC-CLIP (0.435 / 0.468) and SED (0.376 / 0.457).
Zero-shot vision–language segmenters work well on clean scenes but fail in the cases that matter most for mobility and safety analysis: nighttime, rain, glare, occlusion. Symptoms are predictable — boundaries blur into shadow, small but safety-critical classes (signs, poles) vanish, and large regions collapse onto a single background label.
Our hypothesis: under adverse conditions texture alone is a weak cue, but a geometric prior — ground plane, vertical surfaces, near/far — survives illumination loss. The challenge is to inject that prior into a frozen open-vocabulary backbone without sacrificing zero-shot text alignment and without the prohibitive cost of full-parameter VLM training.
OV-DGSS has three frozen foundation streams (visual, depth, text) plus a small trainable head. The depth stream is projected into the visual feature space and added to it; class prompts become GeoText queries that attend to the fused features through a cross-attention decoder, with a cross-modal positional encoding (CMPE) injecting depth-derived geometry into the attention keys.
Why depth? Adverse-condition frames are dominated by low-contrast texture: shadows swallow object edges and rain blurs long-range structure. The depth-VFM produces features that carry coarse geometric structure — near/far, ground plane, vertical surfaces — that survives illumination loss. We compute a delta feature $F_\ell = V_\ell + \phi_\ell(D_\ell)$, where $\phi_\ell$ is a small learnable projection that aligns depth features into the VLM feature space. This is also the exact intervention point of the zero-tensor ablation reported below.
Why GeoText queries + CMPE? So that a query for road does not have to rediscover the ground plane from raw pixels every time the lighting changes. CMPE injects depth-derived position into the attention keys, keeping the road–sidewalk boundary stable even when shadow + headlight glare would otherwise shift it by tens of pixels.
Data & baselines. We evaluate on the adverse-condition splits (night, rain,
fog) of the ACDC validation set, with 19 Cityscapes-aligned categories. We compare against
FC-CLIP and SED using their official released ConvNeXt-L checkpoints, run via Detectron2 against
the same prompt set.
50 adverse-condition frames per model. We split the 19 Cityscapes classes into Seen (trainId 0–9:
road, sidewalk, building, wall, fence, pole, traffic light, traffic sign, vegetation, terrain) and Unseen
(trainId 10–18: sky, person, rider, car, truck, bus, train, motorcycle, bicycle).
| Scenario | FC-CLIP | SED | OV-DGSS (Ours) |
|---|---|---|---|
| Seen (trainId 0–9) | 0.435 | 0.376 | 0.620 |
| Unseen (trainId 10–18) | 0.468 | 0.457 | 0.486 |
| Average | 0.452 | 0.417 | 0.553 |
The Seen-class margin is large (+0.185 over FC-CLIP, +0.244 over SED) because ground-plane / wall geometry is exactly where depth helps when contrast collapses. On Unseen classes the margin is small but consistent (+0.018 over FC-CLIP) — depth is a structure prior, not a recognition prior, so it does not push the frontier on objects whose shape (not geometry) is the dominant cue.
Columns: RGB · Ground Truth · FC-CLIP · SED · OV-DGSS. All masks use the Cityscapes color palette.
Two patterns are worth calling out. First, in nighttime frames both baselines suffer boundary bleed along the road–sidewalk transition; OV-DGSS's fused depth keeps the ground plane consistent. Second, in rain/fog frames where long-range structure is washed out, the baselines tend to collapse large regions onto a single background class (sky or vegetation), while OV-DGSS retains the road–vegetation–building partition because depth disambiguates it.
At inference time we monkey-patch the depth-fusion step to replace $D_\ell$ with
torch.zeros_like(depth_features), leaving everything else untouched. The model is forced to
produce a mask using only the frozen visual stream plus the CMPE/decoder — the depth signal is surgically
severed.
Differences appear at the road/sidewalk boundary, in road markings, and along vertical structures (poles, signs). This is a pseudo-ablation: we do not retrain without depth, so the surviving weights were never adapted to live without it. The degradation is therefore an upper bound on the depth branch's contribution — but it is the cleanest white-box evidence that the model actually uses its geometric prior.
A forward hook on the decoder's cross-attention submodule captures the output activation; we normalize, reshape
into an $H\times W$ heatmap, colorize with COLORMAP_JET, and blend onto the RGB input.
Honesty note. What we capture is the output activation of the cross-attention submodule, not the softmax-normalized attention tensor; the reshape is heuristic. The most we can claim is that the decoder's cross-attention produces a spatially non-trivial response that broadly follows scene structure — not per-class grounding.
OV-DGSS shows that a small set of trainable adapters — a depth projection, a cross-modal positional encoding, and a cross-attention decoder — is enough to turn a frozen open-vocabulary VLM into a competitive adverse-condition segmenter, without retraining the backbone. Geometry helps where contrast fails; depth is a structure prior, not a recognition prior; PEFT-style fusion is sufficient to bridge the domain gap.
Caveats. The mIoU study is a 50-frame subset, not full ACDC val. The cross-attention figure is a heuristic response map, not a rigorous per-class attention. The depth ablation is pseudo (not retrained). These are honest scope decisions, not claims of leaderboard parity.
Future Directions:
[1] Q. Yu, J. He, X. Deng, X. Shen, L.-C. Chen. "Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP" (FC-CLIP). NeurIPS, 2023.
[2] B. Xie, J. Cao, J. Xie, F. S. Khan, Y. Pang. "SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation." CVPR, 2024.
[3] C. Sakaridis, D. Dai, L. Van Gool. "ACDC: The Adverse Conditions Dataset with Correspondences for Semantic Driving Scene Understanding." ICCV, 2021.
[4] E. J. Hu, Y. Shen, P. Wallis, Y. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen. "LoRA: Low-Rank Adaptation of Large Language Models." ICLR, 2022.
[5] N. Ding et al. "Parameter-Efficient Fine-Tuning of Large-Scale Pre-Trained Language Models." Nature Machine Intelligence, 5(3):220–235, 2023.
[6] A. Radford et al. "Learning Transferable Visual Models From Natural Language Supervision" (CLIP). ICML, 2021.
[7] L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, H. Zhao. "Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data." CVPR, 2024.