Depth-Aware Open-Vocabulary Segmentation
of Adverse-Condition Urban Streetscapes

CS 766 Final Project  ·  OV-DGSS

Chen Wei
University of Wisconsin–Madison

Project Presentation

Introduction

We build OV-DGSS, a depth-aware open-vocabulary semantic segmentation pipeline for adverse-condition urban scenes (night, rain, fog). It augments a frozen CLIP–ConvNeXt backbone with a monocular depth prior and a geometry-conditioned text query decoder, trained in a parameter-efficient regime. On 50 ACDC adverse-condition frames, OV-DGSS reaches mIoU 0.620 / 0.486 on Seen / Unseen Cityscapes classes — beating FC-CLIP (0.435 / 0.468) and SED (0.376 / 0.457).

Qualitative comparison teaser
RGB  |  Ground Truth  |  FC-CLIP  |  SED  |  Ours.

Motivation

Zero-shot vision–language segmenters work well on clean scenes but fail in the cases that matter most for mobility and safety analysis: nighttime, rain, glare, occlusion. Symptoms are predictable — boundaries blur into shadow, small but safety-critical classes (signs, poles) vanish, and large regions collapse onto a single background label.

Our hypothesis: under adverse conditions texture alone is a weak cue, but a geometric prior — ground plane, vertical surfaces, near/far — survives illumination loss. The challenge is to inject that prior into a frozen open-vocabulary backbone without sacrificing zero-shot text alignment and without the prohibitive cost of full-parameter VLM training.

Method

OV-DGSS has three frozen foundation streams (visual, depth, text) plus a small trainable head. The depth stream is projected into the visual feature space and added to it; class prompts become GeoText queries that attend to the fused features through a cross-attention decoder, with a cross-modal positional encoding (CMPE) injecting depth-derived geometry into the attention keys.

RGB image
Class prompts
19 Cityscapes classes
CLIP–ConvNeXt
visual encoder
frozen → $V_\ell$
Depth-VFM
encoder
frozen → $D_\ell$
CLIP text tower
frozen → $q_c$
Depth projection $\phi_\ell$
trainable
Fused features
$F_\ell = V_\ell + \phi_\ell(D_\ell)$
Cross-attn decoder + CMPE
trainable
Per-class
segmentation masks
Frozen foundation Trainable adapter Output
Conceptual data flow of OV-DGSS. Only the depth projection, CMPE, and cross-attention decoder are trained.

Why depth? Adverse-condition frames are dominated by low-contrast texture: shadows swallow object edges and rain blurs long-range structure. The depth-VFM produces features that carry coarse geometric structure — near/far, ground plane, vertical surfaces — that survives illumination loss. We compute a delta feature $F_\ell = V_\ell + \phi_\ell(D_\ell)$, where $\phi_\ell$ is a small learnable projection that aligns depth features into the VLM feature space. This is also the exact intervention point of the zero-tensor ablation reported below.

Why GeoText queries + CMPE? So that a query for road does not have to rediscover the ground plane from raw pixels every time the lighting changes. CMPE injects depth-derived position into the attention keys, keeping the road–sidewalk boundary stable even when shadow + headlight glare would otherwise shift it by tens of pixels.

Data & baselines. We evaluate on the adverse-condition splits (night, rain, fog) of the ACDC validation set, with 19 Cityscapes-aligned categories. We compare against FC-CLIP and SED using their official released ConvNeXt-L checkpoints, run via Detectron2 against the same prompt set.

Results

1 · Scenario-based mIoU

50 adverse-condition frames per model. We split the 19 Cityscapes classes into Seen (trainId 0–9: road, sidewalk, building, wall, fence, pole, traffic light, traffic sign, vegetation, terrain) and Unseen (trainId 10–18: sky, person, rider, car, truck, bus, train, motorcycle, bicycle).

Scenario FC-CLIP SED OV-DGSS (Ours)
Seen (trainId 0–9) 0.435 0.376 0.620
Unseen (trainId 10–18) 0.468 0.457 0.486
Average 0.452 0.417 0.553
Scenario-based mIoU bar chart

The Seen-class margin is large (+0.185 over FC-CLIP, +0.244 over SED) because ground-plane / wall geometry is exactly where depth helps when contrast collapses. On Unseen classes the margin is small but consistent (+0.018 over FC-CLIP) — depth is a structure prior, not a recognition prior, so it does not push the frontier on objects whose shape (not geometry) is the dominant cue.

2 · Qualitative comparison (1×5)

Columns: RGB · Ground Truth · FC-CLIP · SED · OV-DGSS. All masks use the Cityscapes color palette.

Qualitative 1x5 grid (full)

Two patterns are worth calling out. First, in nighttime frames both baselines suffer boundary bleed along the road–sidewalk transition; OV-DGSS's fused depth keeps the ground plane consistent. Second, in rain/fog frames where long-range structure is washed out, the baselines tend to collapse large regions onto a single background class (sky or vegetation), while OV-DGSS retains the road–vegetation–building partition because depth disambiguates it.

3 · Zero-tensor depth ablation

At inference time we monkey-patch the depth-fusion step to replace $D_\ell$ with torch.zeros_like(depth_features), leaving everything else untouched. The model is forced to produce a mask using only the frozen visual stream plus the CMPE/decoder — the depth signal is surgically severed.

Zero-tensor depth ablation
Left: RGB. Middle: normal mask. Right: mask after replacing depth features with zeros.

Differences appear at the road/sidewalk boundary, in road markings, and along vertical structures (poles, signs). This is a pseudo-ablation: we do not retrain without depth, so the surviving weights were never adapted to live without it. The degradation is therefore an upper bound on the depth branch's contribution — but it is the cleanest white-box evidence that the model actually uses its geometric prior.

4 · Cross-attention response visualization

A forward hook on the decoder's cross-attention submodule captures the output activation; we normalize, reshape into an $H\times W$ heatmap, colorize with COLORMAP_JET, and blend onto the RGB input.

Cross-attention overlay

Honesty note. What we capture is the output activation of the cross-attention submodule, not the softmax-normalized attention tensor; the reshape is heuristic. The most we can claim is that the decoder's cross-attention produces a spatially non-trivial response that broadly follows scene structure — not per-class grounding.

Conclusion

OV-DGSS shows that a small set of trainable adapters — a depth projection, a cross-modal positional encoding, and a cross-attention decoder — is enough to turn a frozen open-vocabulary VLM into a competitive adverse-condition segmenter, without retraining the backbone. Geometry helps where contrast fails; depth is a structure prior, not a recognition prior; PEFT-style fusion is sufficient to bridge the domain gap.

Caveats. The mIoU study is a 50-frame subset, not full ACDC val. The cross-attention figure is a heuristic response map, not a rigorous per-class attention. The depth ablation is pseudo (not retrained). These are honest scope decisions, not claims of leaderboard parity.

Future Directions:

  • Full-set ACDC evaluation with confidence intervals.
  • Retrained no-depth variant for a calibrated bound on the depth branch's contribution.
  • Rigorous per-class cross-attention maps using the actual softmax-normalized attention tensor.

References

[1] Q. Yu, J. He, X. Deng, X. Shen, L.-C. Chen. "Convolutions Die Hard: Open-Vocabulary Segmentation with Single Frozen Convolutional CLIP" (FC-CLIP). NeurIPS, 2023.

[2] B. Xie, J. Cao, J. Xie, F. S. Khan, Y. Pang. "SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation." CVPR, 2024.

[3] C. Sakaridis, D. Dai, L. Van Gool. "ACDC: The Adverse Conditions Dataset with Correspondences for Semantic Driving Scene Understanding." ICCV, 2021.

[4] E. J. Hu, Y. Shen, P. Wallis, Y. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen. "LoRA: Low-Rank Adaptation of Large Language Models." ICLR, 2022.

[5] N. Ding et al. "Parameter-Efficient Fine-Tuning of Large-Scale Pre-Trained Language Models." Nature Machine Intelligence, 5(3):220–235, 2023.

[6] A. Radford et al. "Learning Transferable Visual Models From Natural Language Supervision" (CLIP). ICML, 2021.

[7] L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, H. Zhao. "Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data." CVPR, 2024.