HoRD logoHoRD: Robust Humanoid Control via History-Conditioned Reinforcement Learning and Online Distillation

1University of Auckland   2University of New South Wales   3University of Cambridge   4University of Adelaide   5Macquarie University   6University of Melbourne
Corresponding author

Abstract

Humanoid robots can suffer significant performance drops under small changes in dynamics, task specifications, or environment setup. We propose HoRD, a two-stage learning framework for robust humanoid control under domain shift. First, we train a high-performance teacher policy via history-conditioned reinforcement learning, where the policy infers latent dynamics context from recent state–action trajectories to adapt online to diverse randomized dynamics. Second, we perform online distillation to transfer the teacher’s robust control capabilities into a transformer-based student policy that operates on sparse root-relative 3D joint keypoint trajectories. By combining history-conditioned adaptation with online distillation, HoRD enables a single policy to adapt zero-shot to unseen domains without per-domain retraining. Extensive experiments indicate that HoRD attains stronger robustness and transfer performance than several state-of-the-art baselines, particularly under unseen domains and external perturbations.

HoRD Framework

HoRD Framework Overview

Framework overview. Two-stage teacher–student learning pipeline for robust humanoid control under partial observability. Stage I: an expert policy π⋆ is trained with PPO in simulation using privileged full-state observations stfull, dense future motion intent Ytfull, and episode-level domain randomization ψ(e). A shared HCDR module encodes the interaction history Ht into a temporal memory embedding mt for online dynamics inference and adaptive modulation. Stage II: a deployable student policy π receives only sparse proprioception stsparse, environment context gt, and standardized motion commands Ytsparse via SSJR, and is trained by distillation to match the expert’s actions. SSJR maps a global planner command into a platform-agnostic sparse-joint command interface, enabling cross-platform transfer, while HCDR provides in-context adaptation to latent dynamics during deployment.

Two Core Components

SSJR

Standardized Sparse-Joint Representation

SSJR encodes short-horizon future motion as root-relative trajectories of a small set of key joints on a canonical human skeleton (e.g., SMPL-X), obtained by projecting large-scale motion capture datasets such as AMASS. This sparse, standardized interface removes platform-specific details while preserving task-relevant intent, allowing the student policy to take SSJR commands plus proprioception and context and output torque modulation. Combined with HCDR and domain randomization, SSJR improves data scalability, cross-platform transfer, and anticipatory control.

HCDR

History-Conditioned Dynamics Representation

HCDR performs online dynamics inference from historical state–action interactions using a lightweight Query-Transformer (Q-Former). Learnable latent tokens attend over recent history to produce a temporal memory embedding that serves as a compact dynamics fingerprint (e.g., friction, mass distribution, actuator delays). Conditioning control on this embedding lets the policy adapt its gait, balance, and torque outputs to how the environment has responded so far, enabling online system identification without direct access to latent physical parameters.

Training Methodology

Two-Stage Training Architecture

HoRD employs a teacher–student framework. Stage I: an expert policy π⋆ is trained with PPO in simulation using privileged full-state observations stfull, dense future motion intent Ytfull, and episode-level domain randomization ψ(e); the shared HCDR module encodes interaction history Ht into a temporal memory embedding mt. Stage II: a student policy π is distilled from the expert via supervised imitation, operating from sparse observations stsparse and SSJR-formatted sparse motion commands Ytsparse, with the same mt from HCDR.

Stage I: Expert Training

The expert receives full state stfull, dense future motion intent Ytfull from AMASS, environment context gt, and temporal memory mt from HCDR. It is trained with PPO to maximize expected discounted return across domain-randomized episodes, with rewards penalizing tracking error, balance loss, excessive control effort, and falls.

Stage II: Student Distillation

The student receives only sparse observations stsparse and sparse future motion cues Ytsparse via SSJR, plus the same mt. It is trained via supervised distillation (e.g., MSE on actions) to match the expert's actions. By sharing the temporal memory and SSJR interface across both stages, the student recovers the expert's anticipatory behavior and remains robust across physical and environmental conditions.

Experimental Setup

  • Robot: Unitree-style G1 humanoid, 29 actuated DoF, PD control at 50 Hz
  • Training: Isaac Lab with contact-rich rigid-body dynamics
  • Evaluation: Zero-shot transfer to Genesis (unseen physics engine)
  • Motion Data: AMASS motion corpus; clips retargeted and converted to SSJR
  • Metrics: Success rate (%), Eg-mpjpe (mm), Empjpe (mm)

Domain Randomization Strategy

We use episode-level domain randomization: a dynamics parameter vector ψ(e) is sampled once at the start of each episode and held fixed, so the policy infers latent dynamics from interaction history rather than uncorrelated noise.

  • Inertial properties: link masses, center-of-mass locations, and joint damping coefficients are randomized within predefined ranges.
  • Contact dynamics: ground friction coefficients and contact stiffness are varied to model different terrains and surfaces.
  • Actuation delay: joint command signals are probabilistically delayed by 1–2 control steps to emulate communication and actuator latencies.
  • External perturbations: random external forces are applied to the torso at random time intervals to simulate contacts, collisions, and environmental disturbances.
  • Observation noise: slight sensor noise is added to proprioceptive signals to account for imperfect measurements in deployment.

Together, these variations encourage HoRD to learn control policies that transfer across simulators and physical conditions.

Experimental Results & Validation

Overall Performance Comparison

ID = in-distribution (IsaacLab), OOD = out-of-distribution (Genesis). “+ DR” columns add extra evaluation-time dynamics randomization on top of the already randomized training policy. Genesis results are evaluated in a zero-shot manner without collecting test-domain data for retraining; best results are bold, second-best are underlined.

Method IsaacLab (ID) IsaacLab + DR (ID) Genesis (OOD) Genesis + DR (OOD)
Succ.
(%)↑
Eg-mpjpe
(mm)↓
Empjpe
(mm)↓
Succ.
(%)↑
Eg-mpjpe
(mm)↓
Empjpe
(mm)↓
Succ.
(%)↑
Eg-mpjpe
(mm)↓
Empjpe
(mm)↓
Succ.
(%)↑
Eg-mpjpe
(mm)↓
Empjpe
(mm)↓
MaskedMimic 32.1 376 182 <10 713 283 <10 742 322 <10 738 326
OmniH2O 85.2 266 132 83.2 282 168 72.3 312 165 70.2 335 191
ExBody2 86.6 247 108 85.9 237 128 73.1 305 157 69.4 342 202
Hover 71.2 278 138 67.9 375 196 16.2 722 258 15.5 746 282
HoRD (Ours) 90.7 102 76 88.4 124 87 86.0 162 96 84.4 171 108

MaskedMimic
Train → OOD

MaskedMimic failure under dynamics shift

Challenging Transfer: In this setting MaskedMimic struggles to maintain stability under a substantial dynamics change, highlighting the difficulty of using animation-focused controllers for torque-level sim-to-sim transfer.

ExBody2
Train → OOD

ExBody2 sim-to-sim transfer

Moderate Transfer: ExBody2 maintains locomotion under domain shift, but with noticeably increased tracking errors and reduced stability compared to its training environment.

HoRD
Train → OOD

HoRD sim-to-sim transfer to Genesis

Robust Transfer: HoRD maintains high success rates and low tracking error under sim-to-sim transfer to Genesis, with HCDR enabling online dynamics inference and adaptation without target-domain fine-tuning.

Why Does HCDR Make the Difference?

HCDR encodes recent state–action history into a temporal memory embedding that serves as a dynamics fingerprint. Without this history-conditioned representation, policies trained under domain randomization must rely on a single fixed strategy that cannot infer which latent dynamics (e.g., friction, delay, mass distribution) are currently in effect. In contrast, HCDR enables online adaptation: the policy conditions on how the environment has responded to previous actions and adjusts gait, balance, and torque targets accordingly. This mechanism is key to HoRD's strong zero-shot sim-to-sim transfer to unseen physics engines (e.g., Genesis), achieved without any target-domain data or fine-tuning.

Ablation Studies

Component Analysis

We analyze the contribution of domain randomization (D) and the HCDR module (H) under both in-domain and zero-shot conditions. Without SSJR there is no conversion from motion to torque (Succ. < 10 in all cases). The results show both D and H play complementary roles for robust PD control under cross-domain and perturbed settings.

Ablation Results

Method IsaacLab
Succ. (%)
IsaacLab + DR
Succ. (%)
Genesis
Succ. (%)
Genesis + DR
Succ. (%)
HoRD 90.7 88.4 86.0 84.4
HoRD w/o D 79.8 <10 <10 <10
HoRD w/o H 91.3 70.5 <10 <10

Key findings: (1) Removing domain randomization (w/o D) causes severe collapse under test-time perturbations or simulator change. (2) Removing HCDR (w/o H) keeps competitive performance on the training distribution but degrades sharply under distribution shift (e.g., near-zero success in Genesis), highlighting HCDR's role as a learned dynamics adaptation mechanism for online adaptation to latent simulator and contact variations.

Ablation Training Curves Comparison

Training curves comparing HoRD with ablation variants across different metrics. The curves demonstrate the contribution of domain randomization (D) and HCDR (H) to overall performance and training stability.

Ablation Training Curves - Method Comparison
Ablation Training Curves - Metrics Comparison
Detailed metrics comparison showing the impact of domain randomization and HCDR on training dynamics and final performance.

Comprehensive Evaluation Metrics on Test Set

Comprehensive evaluation metrics comparison across different methods on the test set. All policies are evaluated in IsaacLab with domain randomization.

Metric HoRD HoRD w/o D HoRD w/o H
Cartesian Error ↓ 0.087 0.286 0.182
Global Rotation Error ↓ 0.369 1.594 1.011
Global Translation Error ↓ 0.124 0.832 0.341
DOF Velocity Reward ↑ 0.972 0.980 0.960
Key Body Reward ↑ 0.907 0.412 0.612
Local Rotation Reward ↑ 0.833 0.274 0.507
Root Angular Velocity Reward ↑ 0.480 0.722 0.412
Root Velocity Reward ↑ 0.954 0.878 0.904

Comparison of Deployment Capabilities

We compare HoRD with representative humanoid control frameworks along three deployment capabilities: Unified Skill Coverage (single policy over diverse skills), Sparse-Command Generalization (robustness when only sparse keypoint-level commands are available), and Explicit Online Dynamics Adaptation (dedicated mechanism for adapting to dynamics variations at test time). HoRD supports all three; prior approaches typically support only a subset.

Method Unified Skill
Coverage
Sparse-Command
Generalization
Explicit Online
Dynamics Adaptation
HoRD (Ours)
BumbleBee × × ×
HOVER × ×
MaskedMimic ×
ASAP × × ×
PHC × × ×
PULSE × × ×
ExBody2 × × ×
OmniH2O × ×

Notation: ✓ supported; × not supported (per paper Appendix A.3, Table 5).

Qualitative Demonstrations

Disturbance Recovery

HoRD disturbance recovery under random pushes

Random external push: HoRD quickly re-stabilizes and resumes the intended trajectory after a lateral push (zero-shot Genesis). Recovery success 85.2% (HoRD) vs. OmniH2O 71.8%, ExBody2 72.1%.

Online Distillation

Teacher–student online distillation demo

Teacher–student distillation: The student policy is trained by online distillation to match the expert's actions using only sparse SSJR commands and HCDR history, enabling deployable control from sparse inputs.

Representative Motion Results

HoRD performance across six representative motions (punch, get up, high kick, common walk, side walk, martial) under zero-shot transfer to the unseen Genesis environment. All videos show robust execution without target-domain fine-tuning. Red markers indicate ground-truth skeleton joints.

Punch

HoRD punch motion in Genesis

Get Up

HoRD get-up motion in Genesis

High Kick

HoRD high-kick motion in Genesis

Common Walk

HoRD common walk motion in Genesis

Side Walk

HoRD side-walk motion in Genesis

Martial

HoRD martial-arts motion in Genesis

Terrain Robustness

HoRD Terrain Robustness

HoRD is evaluated on flat ground, smooth slope, and rough slope in zero-shot Genesis. Success rates: flat 86.2%, smooth slope 85.8%, rough slope 84.2%, demonstrating graceful degradation with terrain difficulty.

BibTeX

@article{wang2026hord,
  title={HoRD: Robust Humanoid Control via History-Conditioned Reinforcement Learning and Online Distillation},
  author={Wang, Puyue and Hu, Jiawei and Gao, Yan and Wang, Junyan and Zhang, Yu and Dobbie, Gillian and Gu, Tao and Johal, Wafa and Dang, Ting and Jia, Hong},
  year={2026},
  journal={Preprint},
  url={https://github.com/tonywang-0517/hord}
}