All-Day Multi-Scenes Lifelong Vision-And-Language Navigation With Tucker-Adaption

Main picture
The proposed All-day Multi-scenes Lifelong Vision-and-Language Navigation (AML- VLN) task and AlldayWalker with Tucker Adaptation (TuKA). The AML-VLN requires navigation agents to continually learning across multiple scenes and multiple environments (low-light, over- exposure, and scattering), thereby progressively consolidating navigation knowledge to evolve con- tinually and achieve all-day multi-scenes navigation. TuKA decouples navigation knowledge and represents it in a high-dimensional space to continuously learn specific and shared knowledge.

Abstract

Deploying vision-and-language navigation (VLN) agents requires adaptation across diverse scenes and environments, but fine-tuning on a specific scenario often causes catastrophic forgetting in others, which severely limits flexible long-term deployment. We formalize this challenge as the all-day multi-scenes lifelong VLN (AML-VLN) problem. Existing parameter-efficient adapters (e.g., LoRA and its variants) are limited by their two-dimensional matrix form, which fails to capture the multi-hierarchical navigation knowledge spanning multiple scenes and environments. To address this, we propose Tucker Adaptation (TuKA), which represents the multi-hierarchical navigation knowledge as a high-order tensor and leverages Tucker decomposition to decouple the knowledge into shared subspaces and scenario-specific experts. We further introduce a decoupled knowledge incremental learning strategy to consolidate shared subspaces while constraining specific experts for decoupled lifelong learning. Building on TuKA, we also develop a VLN agent named AlldayWalker, which continually learns across multiple navigation scenarios, achieving all-day multi-scenes navigation. Extensive experiments show that AlldayWalker consistently outperforms state-of-the-art baselines.

Method

Main picture

Illustration for comparison of existing LoRA and our TuKA architecture. Different from the LoRA or MoE-LoRA variants, which represent simple knowledge within a two-hierarchical matrix, TuKA decoupling represents the multi-hierarchical knowledge within a high-order tensor.

Main picture
llustration of the proposed Decoupled Knowledge Incremental Learning. Our TuKA per- forms decoupled representation and incremental learning of knowledge in a high-dimensional space.
In order to learn the $t$-th navigation scenario task (with the $s$-th scene and the $e$-th environment) $T_{t} = \{S_{s},E_{e}\}$, we propose a new type of finetuning method named Tucker-Adaption (TuKA) to finetune the pretrained StreamVLN agent $\mathcal{F}_{\theta_{0}}$, in a higher-dimensional space, on the task-specific data $S_{t}=\{\mathcal{O}_{t},\mathcal{I}_{t}\}$, and then obtain an updated model $\mathcal{F}_{\theta'_{t}}$, where $\theta'_{t}=\theta_{0}+\Delta\theta_{t}$, $\Delta\theta_{t} = \{\Delta W^{l}_{t}\}^{L}_{l=1}$, and $\Delta W^{l}_{t}\in\mathbb{R}^{a_{l} \times b_{l}}$ is the updated weight in $l$-th layer of a total of $L$ transformer layers for task $T_{t}$. In our TuKA, we follow the Tensor Tucker Decomposition Approach (TAKE) to decouple the higher-order tensor knowledge for better multi-hierarchical navigation knowledge learning. Specifically, a four-order tensor $\mathcal{X} \in \mathbb{R}^{a\times b\times M \times N}$ can be decomposed into: $$\mathcal{X} = \mathcal{G} \times_{1} U^{1} \times_{2} U^{2} \times_{3} U^{3} \times_{4} U^{4}, \tag{1}$$ where $\times_{n}, n=1,2,3,4$ denotes the $n$-th modal product of the tensor and matrix. $\color{orange}{\mathcal{G}}\in \mathbb{R}^{r_{1}\times r_{2}\times r_{3}\times r_{4}}$ is core tensor, which contains interaction information between all patterns, and is used to learn the $\color{orange}{\text{navigation-shared knowledge}}$. The factor matrix $\color{orange}{U^{1}} \in \mathbb{R}^{a \times r_{1}}$ represents the transformation pattern of the feature from $r_{1}$ dimension to $a$, which can be regarded as a decoder; $\color{orange}{U^{2}} \in \mathbb{R}^{b \times r_{2}}$ represents the transformation pattern of the feature from $b$ dimension to $r_{2}$, which can be regarded as a encoder. The factor matrix $\color{purple}{U^{3}} \in \mathbb{R}^{M \times r_{3}}$ represents the $M$ group of scene experts, with each scene expert $\color{purple}{U^{3}[i,:]}$ is used to learn the $i$-th $\color{purple}{\text{specific scene knowledge}}$; $\color{cyan}{U^{4}} \in \mathbb{R}^{N \times r_{4}}$ represents $N$ group of environment experts, with each expert $\color{cyan}{U^{4}[j,:]}$ is used to learn the $j$-th $\color{cyan}{\text{specific environment knowledge}}$. Thus, for the $t$-th scenario with $s$-th scene and $e$-th environment adaptation, we extract the task-specific matrices $U^{3}[s,:]$ and $U^{4}[e,:]$ from the high-order tensor $\mathcal{X}$ to constitute weight $W_{t}$: $$\Delta W_{t} = U^{1} \cdot(\mathcal{G} \times_{3} U^{3}[s,:] \times_{4} U^{4}[e,:]) \cdot (U^{2})^{T}. \tag{2}$$

AlldayWalker Algorithms

Main picture
Main picture

For clarity, a summary of the proposed AlldayWalker learning algorithm and the inference algorithm is provided above.

Task Setting

Main picture

Illustration of our all-day multi-scenes lifelong VLN Benchmark setting. The normal environment is marked in green color, low light is marked blue, scattering is marked yellow, and overexposure is marked in purple.

Real-World Experiments

robotic navigation system introduction

Main picture
The platform consists of a DeepRobotDog Lite2 quadruped robot, a Hikvision DS-E12 camera, a portable WiFi communication module, and a remote computation server equipped with an NVIDIA A6000 GPU. During deployment, the Hikvision DS-E12 mounted on the robot captures RGB visual streams from the real environment, providing essential perception signals for navigation and scene understanding. These visual data are transmitted in real time to the remote server through the portable WiFi module. The server performs inference using the proposed AlldayWalker system running on the A6000 GPU, which processes both user instructions and visual observations to generate navigation actions. The resulting control signals are then sent back to the DeepRobotDog Lite2 via the wireless communication channel. The robot executes these actions in the physical environment, enabling closed-loop interaction between perception, language reasoning, and embodied control. This platform supports flexible and robust experimentation of all-day multi-scene lifelong navigation, bridging simulation and real-world deployment.

Vision-Language Navigation in Real World

Scene: real-world 2

Environment: Normal

Speed 1X

Task21

Walk straight ahead, bypass the pole on the left, and find the cup in the middle of the blue cone and in front of the yellow cone.

Scene: real-world 1

Environment: Low-light

Speed 1X

Task22

Find the red cone-shaped barrel on the right, pass through from the left side, then pass to the right of the blue cone-shaped barrel in front, and finally find the red ball next to the cardboard box.

Scene: real-world 2

Environment: Low-light

Speed 1X

Task23

Walk straight, pass around the left side of the cardboard box, and find the Cube located in front of the blue cone.

Scene: real-world 1

Environment: Normal

Speed 1X

Task24

Pass through the left side of the blue cone-shaped barrel on the right, go around the left side of the cardboard box to the back of the box, and find the small red ball.

Habitat Simulator Experiments

Allday-Habitat simulation platform

Main picture
We extended the Habitat embodied AI simulation platform to create diverse degraded environments for training and evaluating our AML-VLN task. We not only use normal environments, but also synthesize three types of challenging visual conditions:
Scattering environments: Created using atmospheric scattering models that simulate how particles in the air (like fog, haze, or dust) affect image clarity by reducing visibility and adding atmospheric light.
Low-light environments: Generated using abnormal light imaging models that account for reduced illumination conditions, incorporating factors like camera response functions, system gain, exposure time, and various types of noise.
Overexposure environments: Synthesized using models that simulate excessive brightness conditions, where parts of the image become saturated due to too much light, resulting in loss of detail in bright regions.

Visual Examples for Allday-Habitat

Habitat Natural Light
Scene Simulator

Natural lighting scene simulation

Simulating Scattering Scenes Based on Atmospheric Scattering Model

Scene scattering simulation

Simulating Low-light Scenes Based on Camera Response Model

Low light scene simulation

Simulating Overexposure Scenes Based on Camera Response Model

Overexposure scene simulation

Vision-Language Navigation in Habitat Simulator

Scene: ac26ZMwG7aT

Environment: Normal

Speed 1X

Task1

Go out of the room an take an immediate left. When you reach the hallway, go right. The second door on the left is a powder room, go in and wait there.

Scene: 5LpN3gDmAk7

Environment: Scattering

Speed 1X

Task2

Go straight and into the bedroom to the left of the stairs. Stop in the large doorway to the right of the bed.

Scene: S9hNv5qa7GM

Environment: Normal

Speed 1X

Task3

Exit the room using the door on the right. Turn left and go through the kitchen. Go past the table and chairs and wait near the stairs on the right.

Scene: mJXqzFtmKg4

Environment: Low-light

Speed 1X

Task4

Exit study and turn left. Stop at double doors and tall planter.

Scene: mJXqzFtmKg4

Environment: Overexposure

Speed 1X

Task5

Enter the room go out the door on the right turn left and wait right at the living room entrance.

Scene: ULsKaCPVFJR

Environment: Overexposure

Speed 1X

Task6

Walk past the mirror on your right, and continue to walk down the hallway. Walk into the room directly right of the doorway in front of you, and stop once you walk in.

Scene: 5LpN3gDmAk7

Environment: Normal

Speed 1X

Task7

Enter the house and walk around the chair to the open door on your right. Walk through the open door just to the left of the stairs. Walk to the end of the hall and turn left. Enter the bedroom and wait by the fireplace.

Scene: ac26ZMwG7aT

Environment: Overexposure

Speed 1X

Task8

Leave the room, then take the left door in the next room. Turn left and go all the way down the hallway into the walk-in closet.

Scene: ULsKaCPVFJR

Environment: Normal

Speed 1X

Task9

Walk straight through the doorway and take a right. Walk straight through the bathroom and stop passed the sink.

Scene: ac26ZMwG7aT

Environment: Scattering

Speed 1X

Task10

Walk into the office and turn left. Walk into the bedroom and turn left. Walk down the hall and turn left. Walk into the bathroom and stop.

Scene: ULsKaCPVFJR

Environment: Low-light

Speed 1X

Task11

Turn around and go down the long hallway. Turn the right corner, and turn left into the bathroom.

Scene: ULsKaCPVFJR

Environment: Scattering

Speed 1X

Task12

Walk out of the bathroom and take a left and walk straight, down the hall, into the bedroom. In the bedroom stop next to the photo of sailboat on your right.

Scene: 5LpN3gDmAk7

Environment: Low-light

Speed 1X

Task13

Walk outside and pass the outdoor furniture. Make a left before you reach the table, and enter the inside room on the left. Wait here.

Scene: mJXqzFtmKg4

Environment: Normal

Speed 1X

Task14

Walk toward the fireplace. Exit the room and turn left. Stop in front of the double doors and tall potted plant.

Scene: S9hNv5qa7GM

Environment: Low-light

Speed 1X

Task15

Walk straight down the hall and into the room straight ahead. Wait near the entrance.

Scene: S9hNv5qa7GM

Environment: Scattering

Speed 1X

Task16

Walk into the hardwood floored room, heading towards the fireplace at the opposite end. Once you're at the fireplace, turn right and go into the room ahead of you.

Scene: mJXqzFtmKg4

Environment: Overexposure

Speed 1X

Task17

Enter the room go out the door on the right turn left and wait right at the living room entrance.

Scene: ac26ZMwG7aT

Environment: Low-light

Speed 1X

Task18

Exit the room with the TV, into the room with a wooden desk with two office chairs. Exit that room through the double doors, into the curving hall. Turn into the first room on the right and stop in the doorway.

Scene: 5LpN3gDmAk7

Environment: Overexposure

Speed 1X

Task19

Turn right and exit into the bedroom. Turn left and walk to the wall and then turn left. Walk to the end of that area and turn right out the last door. Once out the door walk across the small room and enter the door to the far left. Stop once you enter the bedroom.

Scene: mJXqzFtmKg4

Environment: Scattering

Speed 1X

Task20

go straight down hall to archway on your left turn left and walk straight to first doorway on your right, stop in doorway.

Comparison Experiment Results

Main picture
Main picture

Test results ( (a) SPL ↑, (b) F-SPL ↓, (c) OSR ↑, (d) F-OSR ↓) of comparison experiment under the AML-VLN settings.

Extension Experiments

Main picture
Main picture

Illustration of the fifth-order tensor TuKA architecture.