CVPR 2025 oral notes

P.S. The paper descriptions are based on my personal understanding. Some text were extracted from the abstracts.

Domain

Vision-language models (VLMs)

Dataset
- Spatial reasoning
  - PixMo MattDeitke2025CVPR provides a dataset of highly detailed image captions for pre-training, a free-form image Q&A dataset for fine-tuning, and an innovative 2D pointing dataset. It's used to train an fully open sourced 72B VLM from scratch, outperforming some proprietary models #🚀
  - VSI-Bench JihanYang2025CVPR presents a video-based visual-spatial intelligence benchmark of over 5,000 question-answer pairs. It shows that spatial reasoning capabilities remain the primary bottleneck and prevailing linguistic reasoning techniques (e.g., chain-of-thought, self-consistency, tree-of-thoughts) fail to improve performance
- OpenING PengfeiZhou2025CVPR provides a dataset for interleaved image-text generation and trains a judge model
- Q-Eval-100K ZichengZhang2025CVPR presents a large dataset focused on visual quality and alignment for 100K instances (60K images and 40K videos)
Fine tuning
- OPA-DPO ZhiheYang2025CVPR reveals that the benefits of DPO for mitigating hallucination are largely contingent on whether the constructed data aligns on-policy. It therefore proposes an on-policy alignment DPO framework
Inference
- FarSight FeilongTang2025CVPR proposes a decoding strategy that intervenes the outlier tokens in the token interaction process to enhance in-context inference, proving to mitigate hallucination in MLLM
Application cases
- EgoLM FangzhouHong2025CVPR integrates the rich contextual information from egocentric videos and motion sensors afforded by wearable devices. It models the joint distribution of egocentric motions and natural language using LLMs and unifies a range of motion understanding tasks
- M2F2-Det XiaoGuo2025CVPR employs tailored face forgery prompt learning with CLIP to improve generalization to unseen forgeries. A LLM then provides a detailed textual explanations of its detection decisions
- Agent
  - GEA AndrewSzot2025CVPR adapts MLLM to a generalist embodied agent capable of grounding itself across these varied domains through a multi-embodiment action tokenizer #🚀
Evaluation
- SoFA XinyuTian2025CVPR reveals the positional bias of multi-image VLMs and proposes a training-free approach, SoFt attention, that employs linear interpolation between inter-image causal attention and bidirectional counterparts to mitigate this bias

3D vision

Foundational models
- VGGT JianyuanWang2025CVPR(visual geometry grounded transformer) infers all key 3D attributes of a scene, including camera parameters, point maps, depth maps, and 3D point tracks, from one, a few, or hundreds of its views
Generation
- DiT
  - CraftsMan3D WeiyuLi2025CVPR proposes a 3D-native DiT that directly models the distribution of 3D data in latent space, generating coarse geometries (inspired by craftsmanship) with regular mesh topology in seconds and then use a normal-based geometry refiner to enhance local details (either automatically or interactively)
  - DNF XinyiZhang2025CVPR uses dictionary learning to disentangle 4D motion from shape as neural fields to generate high-fidelity 4D animations
  - ChenGeng2025CVPR generates temporal object intrinsics with signals distilled from pretrained 2D diffusion models, including the evolving sequences of object geometry, reflectance, and texture, such as a blooming rose
- Neural rendering
  - NeRF
    - DIFIX3D+ JayZhangjieWu2025CVPR uses a single-step image diffusion model trained to enhance and remove artifacts in rendered novel views caused by under-constrained regions of the 3D representation, boosting FID by 2x #🚀
  - 3DGS
    - SSS JialinZhu2025CVPR proposes to replace the Gaussian mixture model of 3DGS with Student's t distribution, thus enabling both positive (splatting) and negative (scooping) densities, providing better expressivity #🧠
    - RunfengLi2025CVPR improves monocular view 3D Gaussian splatting optimization, which is commonly multi-view, for dynamic scenes reconstruction from C-ToF cameras
    - 3DGUT QiWu2025CVPR supports distorted cameras, beyond the simple pinhole model, with time dependent effects such as rolling shutter, by replacing the EWA splatting formulation with the Unscented Transform that approximates the particles through sigma points
- Human scene interaction (HSI)
  - TokenHSI LiangPan2025CVPR proposes a unified transformer-based policy capable of multi-skill unification and flexible adaptation. The key insight is to model the humanoid proprioception as a separate shared token and combine it with distinct task tokens via a masking mechanism
Segmentation
- Point clouds
  - GRC LongyuYang2025CVPR explicitly separates feature extraction for geometry and reflectance to address the decreased accuracy when LiDAR semantic segmentation models are exposed to adverse weather conditions
Reconstruction
- ShoichiroTakeda2025CVPR exploits the cyclic symmetry property for faster solution of the Gromov-Wasserstein problem (GW), which underlies various real-world computer vision applications, e.g., image registration, point cloud registration, stereo matching, and 3D reconstruction
- Image sensors
  - CUT3R QianqianWang2025CVPR uses transformer to construct a stateful recurrent model that continuously updates its state representation with each new observation for 3D/4D reconstruction, admitting highly flexible inputs like video streams or unordered photo collections
  - Monocular 3D reconstruction
    - MoGe RuichengWang2025CVPR proposes an affine-invariant representation, which is agnostic to true global scale and shift, for geometry learning. With a robust & and efficient point cloud alignment solver, a set of global and local geometry supervision are introduced. Eventually, it outperforms SOTA monocular geometry estimation methods #🧠
  - Multi-view 3D reconstruction
    - Murre HaoyuGuo2025CVPR first estimates SfM and uses it to conditioned the diffusion model to generate multi-view metric depth maps, providing an effective way to leverage the foundational vision models for 3D reconstruction
    - MV-DUSt3R+ ZhenggangTang2025CVPR introduces the multi-view decoder blocks to deal with the combinatorial pairwise reconstructions number and expensive global optimization in DUSt3R/MASt3R alike single-stage scene reconstruction methods, boosting inference time to 2 seconds #🚀
    - FoundationStereo BowenWen2025CVPR constructs a large scale (1M pairs) synthetic training dataset, with automatic scheme to remove ambiguous samples. On network design, it uses a vision foundation models as a side-tuning feature backbone, mitigating the sim-to-real gap and enabling long-range context reasoning
  - Dynamic 3D scene reconstruction (spatial-temporal)
    - Stereo4D LinyiJin2025CVPR fuses and filters the output of camera pose estimation, stereo depth estimation, and temporal tracking to achieve high-quality 4D reconstructions from internet stereoscopic, wide-angle videos
    - YiqingLiang2025CVPR trains a generalized model for scene flow estimation, by introducing new method for (i) geometry-motion joint estimation and (ii) a new data recipe to obtain 1M annotated data samples
    - Generate multi-view from single-view
      - CAT4D RundiWu2025CVPR uses a video diffusion model to transform a single monocular video into a multi-view video, enabling robust 4D reconstruction via optimization of a deformable 3D Gaussian representation
      - FluidNexus YueGao2025CVPR reconstructs 3D fluid from monocular video with (i) a novel-view video synthesizer and (ii) a physics-integrated differentiable particle simulator
  - SLAM
    - MegaSaM KaiyuLi2025CVPR presents a deep visual SLAM framework robust to real-world videos of complex dynamic scenes with unconstrained camera paths and contains little camera parallax
  - Vanishing point estimation
    - GlobustVP BangyanLiao2025CVPR introduces convex relaxation techniques to solve the vanishing points problem
  - Camera pose estimation
    - JuanCDibene2025CVPR presents a marker-based geometric estimation framework for the absolute pose of a camera by analyzing the 1D observations in a single radially distorted pixel scanline
- Depth sensors
  - TacoDepth YiranWang2025CVPR fuses Radar & image data for depth estimation with one-stage fusion. The graph-based Radar structure extractor and the pyramid-based Radar fusion module are used to capture and integrate the graph structures of Radar point clouds
  - AnaghMalik2025CVPR presents a system for physically based, neural inverse rendering from multi-viewpoint videos of propagating light (flash lidar)
  - DORNet ZhengxueWang2025CVPR proposes a RGB-guided depth super-resolution method augmented with self-supervised real-world degradation learning
  - YuhuiLiu2025CVPR presents a hardware-software co-designed RGBD imaging framework that leverages both stereo and focus cues to reconstruct texture-rich color images along with detailed depth maps
  - SotirisNousias2025CVPR explores on how to detect and leverage "ambient" laser pulses from other devices for passive 3D vision #🧠
- Object-specific priors
  - Human reconstruction
    - MEGA GuenoleFiche2025CVPR tokenizes the human pose and shape and formulates the human mesh reconstruction (HMR) task as generating a sequence of discrete tokens conditioned on an input image. It models the 2D -> 3D ambiguity with the generative paradigm #🧠
    - Pose estimation
      - YanXia2025CVPR reconstructs 3D humans with a biomechanically accurate skeleton from a single image. It addresses the lack of training data with pseudo ground truth labels that are iteratively refined
    - Avatar
      - CAP4D FelixTaubner2025CVPR uses a morphable multi-view diffusion model to reconstruct photo-real 4D (dynamic 3D) portrait avatars from any number of reference images (i.e., one to 100) and animate and render them in real time
        Key contribution: transfer the advantage of multi-view approach (multi-view stereo & neural rendering) to single-view approach

2D vision

Generation
- Auto-regressive generation
  - RandAR ZiqiPang2025CVPR enables arbitrary image token generation order with the position instruction token accompanying each image token
  - Infinity JianHan2025CVPR expands the tokenizer vocabulary size to infinity by refactoring visual autoregressive model to a bitwise token prediction framework with an infinite-vocabulary classifier and bitwise self-correction mechanism #🧠
  - KaihangPan2025CVPR constructs a proper visual language for LLM-diffusion combined models by leveraging diffusion timesteps to learn discrete, recursive visual tokens, replacing the spatial token which lacks the recursive structure inherent to languages and thus forms an impossible language for LLM to master #🧠
- DiT
  - Visual token
    - VA-VAE JingfengYao2025CVPR speedups the latent diffusion models' training by 21x and achieves SoTA, when increasing the per-token feature dimension, by aligning the latent space with pre-trained vision foundation models when training the visual tokenizers #🚀
    - TexTok KaiwenZha2025CVPR conditions the tokenization process on descriptive text captions, allowing more learning capacity and token space to be allocated to capture fine-grained visual details #🧠
  - Sample efficiency
    - SoobinUm2025CVPR develops an online prompt optimization framework to promote learning of minority samples
  - Few-step generation
    - ARD YeongminKim2025CVPR proposes a method to distill a DiT to a few-step generator which leverages the historical trajectory of the ODE to predict future steps. Key innovations are (i) adding token-wise time embedding to mark each input from the trajectory history and (ii) employing a block-wise causal attention mask for training
  - Scenario specific improment
    - DreamRelation QingyuShi2025CVPR addresses the gap of relation-aware image generation by (i) dataset with relation-specific images, (ii) keypoint matching loss for pose guides, and (iii) local features from the image prompts to better distinguish between objects
    - DesignDiffusion ZhendongWang2025CVPR improves text-to-poster generation by removing intricate components like position and layout modeling. Instead, it uses distinctive character embedding, character localization loss, and self-play DPO
    - CustAny LingjieKong2025CVPR extends SOTA zero-shot object customization from specific domains to the general domain by providing a large-scale general identity (preserving) dataset, MC-IDC
- Flow
  - LookingGlass PascalChang2025CVPR uses latent rectified flow models to generate anamorphic images that still retain a valid interpretation when viewed directly, while encodes hidden image(s) that need special device for viewing. The key is the introduction of Laplacian Pyramid Warping, a frequency-aware image warping technique
- Safety
  - AndreasMuller2025CVPR reveals that attackers can easily forge or remove semantic watermarks of the generated images
  - Adv-CPG JunyingWang2025CVPR introduces facial adversarial attacks into Customized Portrait Generation (CPG) to prevent the generated images from being misused by malicious face recognition systems.
Editing
- EricKee2025CVPR trains a reflections removal network using synthetic RAW photos with reflection simulation, showing more significant improvement than architectural variations. Another insight is using the "selfie" camera to take an optional contextual photo to disambiguates real reflection
- ShangquanSun2025CVPR improves rain streak removal with dual-branch spatio-temporal state-space models
- Dataset
  - AnyEdit QifanYu2025CVPR presents a comprehensive multi-modal instruction editing dataset, comprising 2.5 million high-quality editing pairs spanning over 20 editing types and five domains
- Super resolution
  - DiffFNO XiaoyiLiu2025CVPR introduces the Weighted Fourier Neural Operator (WFNO), capturing critical frequency components from the spectral domain. It's accompanied by Attention-based Neural Operator (AttnNO) to capture spatial domain feature. An Adaptive Time-Step (ATS) ODE solver is also used for acceleration
Tracking
- Descriptor-In-Pixel LaurieBose2025CVPR performs point-feature detection and tracking entirely in-pixel, with the Pixel Processor Array (PPA) hardware, achieving 1000x reduction in data transfer compared to raw image output. It tracks point-features reliably even under violent motion
Detection
- KunyuWang2025CVPR achieves efficient Continual test-time adaptive object detection (CTTA-OD) with sensitivity-guided channel pruning strategy that quantifies each channel based on its sensitivity to domain discrepancies at both image and instance levels
Segmentation
- JiaxinCai2025CVPR proposes a symmetric parameter-efficient fine-tuning framework for multimodal segmentation, featuring with a modality-aware prompting and adaptation scheme. It leverages the capabilities of pre-trained model on RGB, while at the same time making it not reliant on RGB thus fully utilizes other modalities
- AishikKonwer2025CVPR enhances Segment Anything Model (SAM) to enable the generation of high-fidelity segmentations. The core is a policy trained by DPO with simple ratings or rankings provided by a virtual annotator simulating the human annotation process
- Effective SAM MinhyeokLee2025CVPR replaces the 2 stage proposal-VLM SAM approach with a one-stage open-vocabulary approach. It refines the spatial aggregation for mask predictions by embedding pseudo prompts generated from image-text correlations into SAM's promptable segmentation framework

Video vision

Representation learning
- AnnaManasyan2025CVPR augments unsupervised object-centric representation learning from videos with object-level temporal contrastive loss, improving the temporally consistency & object discovery, outperforming even some weakly supervised methods
- ViewpointRosetta MiLuo2025CVPR uses diffusion-based Rosetta Stone Translator to align the ego- and exo-viewpoint videos in feature space. It provides a new cross-view benchmark using Ego-Exo4D to illustrate the advantage of the learned feature
Video understanding
- SEAL LanWang2025CVPR decomposes the video into three distinct types of semantic entities: scenes, objects, and actions, allowing models to operate on a compact set of entities rather than a large number of frames or pixels. It also proposes an attention learning module that balances token relevance with diversity
- AVIGATE BoseungJeong2025CVPR leverages audio cues through a gated attention mechanism that selectively filters out uninformative audio signals, instead of blindly utilizing the audio input
- Tokenization
  - TEAM SuBeenLee2025CVPR discards pre-defined and length-dependent alignment units (e.g., frames or tuples) for few-shot action recognition. Instead, it represents videos with a fixed set of pattern tokens that capture globally discriminative clues for token-wise comparisons among videos
  - VST YanShu2025CVPR introduces Visual Summarization Token for hour-scale video understanding, with dynamic information compression which exploits MLLMs' inherent key-value (KV) sparsification capacity to condense the visual input. This token is trained with instruction fine-tuning
- Dataset
  - VideoEspresso SonghaoHan2025CVPR presents a dataset with video QA pairs preserving essential spatial details and temporal coherence, along with multimodal annotations of intermediate CoT reasoning steps generated with GPT-4o
  - PanAf-FGBG OttoBrookes2025CVPR presents a dataset featuring 21 hours of wild chimpanzee behaviors, which enables direct evaluation of in-distribution and out-of-distribution conditions, and for the impact of backgrounds on behavior recognition models
Generation
- Diffusion Renderer RuofanLiang2025CVPR uses video diffusion model for both inverse and forward rendering, enabling video editing like relighting, material editing, and realistic object insertion
- Conditioning
  - Motion Prompting DanielGeng2025CVPR uses spatio-temporally sparse or dense motion trajectories for motion conditioning, instead of text prompts. It also proposes motion prompt expansion to translate high-level user requests into detailed, semi-dense motion prompts
  - RyanBurgert2025CVPR enhances video diffusion models by allowing motion control via structured latent noise sampling, which is derived from the optical flow fields in data preprocessing stage. It is agnostic to diffusion model design and enables a wide range of user-friendly motion control: local object motion control, global camera movement control, and motion transfer

Embodied intelligence

Robotics
- PDFactor JingyiTian2025CVPR decomposes 3D point cloud into three orthogonal feature planes and leverages a tri-perspective view transformer to produce dense cubic features as a latent diffusion field, representing 6-DoF action probability distribution. It then employs a small denoising network for feature & action inference #🧠
- Dataset
  - RoboSpatial ChanHeeSong2025CVPR presents a large-scale dataset for spatial understanding in robotics, consisting of real indoor and tabletop scenes, captured as 3D scans and egocentric images, and annotated with rich spatial information relevant to robotics
- Reward
  - GROVE JiemingCui2025CVPR presents a generalized reward framework that enables open-vocabulary physical skill learning without manual engineering or task-specific demonstrations, with LLMs generating physical constraints capturing task requirements and VLMs evaluating motion semantics and naturalness. For efficiency, Pose2CLIP is trained to project agent poses directly into semantic feature space without computationally expensive rendering #🚀
- Navigation
  - NWM AmirBar2025CVPR presents a controllable video generation model that predicts future visual observations based on past observations and navigation actions. It's used for simulate and plan navigation trajectories by evaluating whether they achieve the desired goal
Autonomous vehicle
- CAT-K ZhejunZhang2025CVPR addresses the covariate shift of multi-agent traffic simulation with a fine-tuning strategy, when shifting from open-loop to closed-loop simulation

Misc

AI4Science
- Neuroscience
  - BrainNRDS JacobYeung2025CVPR uses fMRI brain activity signals to condition image reanimation. It conducts a series of experiments based on video diffusion models: fMRI $\rightarrow$ optical flow, video encoder $\rightarrow$ brain activity, and fMRI $\rightarrow$ image reanimation/full video decoding
- Geoscience
  - SegEarth-OV KaiyuLi2025CVPR presents a training-free open-vocabulary semantic segmentation (OVSS) system for remote sensing, augmented by low-resolution features upsampler and a subtraction operation to alleviate the global bias in patch CLS tokens
  - IceDiff JingyiXu2025CVPR uses a vision transformer to generate coarse Arctic sea ice forecasting. It then been used to generated fine-level forecasting with an unconditional diffusion model
AI4Health
- TopoCellGen MeilongXu2025CVPR integrates topological constraints into a diffusion model to improve the generation of realistic, contextually accurate cell topologies. It also proposes Topological Frechet Distance (TopoFD) to access the fidelity of the generated samples, beyond the traditional FID score

Architecture

Diffusion model
- AF-LDM YifanZhou2025CVPR makes latent diffusion models (LDMs) to be shift-equivariant for output consistency by redesigning the attention modules to be shift-equivariant and propose an equivariance loss that effectively suppresses the frequency bandwidth of the features in the continuous domain
- Sampling steps
  - DAPS BingliangZhang2025CVPR decouples consecutive steps in a diffusion sampling trajectory, allowing them to vary considerably from one another while ensuring their time-marginals anneal to the true posterior. It's proved to be effective for complicated nonlinear inverse problems like image restoration
- Use as feature extractor
  - NickStracke2025CVPR deals with the requirement of adding noise when using large-scale pre-trained diffusion models as a feature extractor. It shows that ensembling with different random noises can't remedy the consistency issue. Instead, it proposes an unsupervised fine-tuning framework to enable feature extraction without adding noise
Transformer
- Interpretability
  - LibraGrad FaridounMehri2025CVPR digs into why gradient-based explanation, which works well with CNN, doesn't work for Transformer. It proposes a post-hoc approach that corrects gradient imbalances through pruning and scaling of backward paths
CNN
- OverLoCK MengLou2025CVPR mimics the overview-first-look-closely-next pattern of human visual system by constructing a Base-Net to encode low/mid-level features, a Overview-Net to generate top-down attention, and a Focus-Net that performs finer-grained perception guided by top-down attention
  P.S. Kind of like introducing some hand-engineered attention mechanism to the CNN framework that mimics human vision system's behavior
Spiking Neural Network (SNN)
- a-XNOR YichenXiao2025CVPR attributes the performance gap of spiking Transformer to the ineffectiveness of dot product in measuring similarity between spiking queries and keys. It then replaces it with a-XNOR similarity
Activation function
- DamienTeney2025CVPR proposes a method to meta-learn activation functions and shows that the simplicity bias, i.e. using the simple ReLU, doesn't always hold
  Interestingly, in tasks neural networks historically struggled, e.g. tabular data & regression tasks, more complex activation functions are learned; whereas in images the learned activation functions are similar to ReLU/GeLU

Supervision

Infra
- UniAP HaoLin2025CVPR develops an automatic parallelism methods that jointly optimizes inter- & intra-layer parallelisms
Dataset distillation
- UniDD DingQi2025CVPR extends dataset distillation beyond image classification to universal tasks with (i) Universal Task Knowledge Mining, which captures task-relevant information through task-specific proxy model training and (ii) Universal Task-Driven Diffusion, where these proxies guide the diffusion process to generate task-specific synthetic images
Post-training learning
- Quantization
  - KaiZhao2025CVPR uses multi-layer features mixer, normalization flow based attention, and losses to promote data diversity of model quantization data generator
- LoRA
  - LoRASculpt JianLiang2025CVPR mitigates the forgetting of generalized knowledge during LoRA by controlling the harmful parameter update redundancy #🧠
Federated learning
- YanbiaoMa2025CVPR proposes a geometry-guided data generation method that centers on simulating the global embedding distribution locally, to address the instability training with heterogeneous data distributions where label skew and domain skew coexist