HoRD: Robust Humanoid Control via History-Conditioned RL and Online Distillation

Abstract

Humanoid robots can suffer significant performance drops under small changes in dynamics, task specifications, or environment setup. We propose HoRD, a two-stage learning framework for robust humanoid control under domain shift. First, we train a high-performance teacher policy via history-conditioned reinforcement learning, where the policy infers latent dynamics context from recent state–action trajectories to adapt online to diverse randomized dynamics. Second, we perform online distillation to transfer the teacher’s robust control capabilities into a transformer-based student policy that operates on sparse root-relative 3D joint keypoint trajectories. By combining history-conditioned adaptation with online distillation, HoRD enables a single policy to adapt zero-shot to unseen domains without per-domain retraining. Extensive experiments indicate that HoRD attains stronger robustness and transfer performance than several state-of-the-art baselines, particularly under unseen domains and external perturbations.

HoRD Framework

Framework overview. Two-stage teacher–student learning pipeline for robust humanoid control under partial observability. Stage I: an expert policy π⋆ is trained with PPO in simulation using privileged full-state observations s_t^full, dense future motion intent Y_t^full, and episode-level domain randomization ψ^(e). A shared HCDR module encodes the interaction history H_t into a temporal memory embedding m_t for online dynamics inference and adaptive modulation. Stage II: a deployable student policy π receives only sparse proprioception s_t^sparse, environment context g_t, and standardized motion commands Y_t^sparse via SSJR, and is trained by distillation to match the expert’s actions. SSJR maps a global planner command into a platform-agnostic sparse-joint command interface, enabling cross-platform transfer, while HCDR provides in-context adaptation to latent dynamics during deployment.

Two Core Components

SSJR

Standardized Sparse-Joint Representation

SSJR encodes short-horizon future motion as root-relative trajectories of a small set of key joints on a canonical human skeleton (e.g., SMPL-X), obtained by projecting large-scale motion capture datasets such as AMASS. This sparse, standardized interface removes platform-specific details while preserving task-relevant intent, allowing the student policy to take SSJR commands plus proprioception and context and output torque modulation. Combined with HCDR and domain randomization, SSJR improves data scalability, cross-platform transfer, and anticipatory control.

HCDR

History-Conditioned Dynamics Representation

HCDR performs online dynamics inference from historical state–action interactions using a lightweight Query-Transformer (Q-Former). Learnable latent tokens attend over recent history to produce a temporal memory embedding that serves as a compact dynamics fingerprint (e.g., friction, mass distribution, actuator delays). Conditioning control on this embedding lets the policy adapt its gait, balance, and torque outputs to how the environment has responded so far, enabling online system identification without direct access to latent physical parameters.

Training Methodology

Two-Stage Training Architecture

HoRD employs a teacher–student framework. Stage I: an expert policy π⋆ is trained with PPO in simulation using privileged full-state observations s_t^full, dense future motion intent Y_t^full, and episode-level domain randomization ψ^(e); the shared HCDR module encodes interaction history H_t into a temporal memory embedding m_t. Stage II: a student policy π is distilled from the expert via supervised imitation, operating from sparse observations s_t^sparse and SSJR-formatted sparse motion commands Y_t^sparse, with the same m_t from HCDR.

Stage I: Expert Training

The expert receives full state s_t^full, dense future motion intent Y_t^full from AMASS, environment context g_t, and temporal memory m_t from HCDR. It is trained with PPO to maximize expected discounted return across domain-randomized episodes, with rewards penalizing tracking error, balance loss, excessive control effort, and falls.

Stage II: Student Distillation

The student receives only sparse observations s_t^sparse and sparse future motion cues Y_t^sparse via SSJR, plus the same m_t. It is trained via supervised distillation (e.g., MSE on actions) to match the expert's actions. By sharing the temporal memory and SSJR interface across both stages, the student recovers the expert's anticipatory behavior and remains robust across physical and environmental conditions.

Experimental Setup

Robot: Unitree-style G1 humanoid, 29 actuated DoF, PD control at 50 Hz
Training: Isaac Lab with contact-rich rigid-body dynamics
Evaluation: Zero-shot transfer to Genesis (unseen physics engine)
Motion Data: AMASS motion corpus; clips retargeted and converted to SSJR
Metrics: Success rate (%), E_g-mpjpe (mm), E_mpjpe (mm)

Domain Randomization Strategy

We use episode-level domain randomization: a dynamics parameter vector ψ^(e) is sampled once at the start of each episode and held fixed, so the policy infers latent dynamics from interaction history rather than uncorrelated noise.

Inertial properties: link masses, center-of-mass locations, and joint damping coefficients are randomized within predefined ranges.
Contact dynamics: ground friction coefficients and contact stiffness are varied to model different terrains and surfaces.
Actuation delay: joint command signals are probabilistically delayed by 1–2 control steps to emulate communication and actuator latencies.
External perturbations: random external forces are applied to the torso at random time intervals to simulate contacts, collisions, and environmental disturbances.
Observation noise: slight sensor noise is added to proprioceptive signals to account for imperfect measurements in deployment.

Together, these variations encourage HoRD to learn control policies that transfer across simulators and physical conditions.

Experimental Results & Validation

Overall Performance Comparison

ID = in-distribution (IsaacLab), OOD = out-of-distribution (Genesis). “+ DR” columns add extra evaluation-time dynamics randomization on top of the already randomized training policy. Genesis results are evaluated in a zero-shot manner without collecting test-domain data for retraining; best results are bold, second-best are underlined.

Method	IsaacLab (ID)			IsaacLab + DR (ID)			Genesis (OOD)			Genesis + DR (OOD)
Method	Succ. (%)↑	E_g-mpjpe (mm)↓	E_mpjpe (mm)↓	Succ. (%)↑	E_g-mpjpe (mm)↓	E_mpjpe (mm)↓	Succ. (%)↑	E_g-mpjpe (mm)↓	E_mpjpe (mm)↓	Succ. (%)↑	E_g-mpjpe (mm)↓	E_mpjpe (mm)↓
MaskedMimic	32.1	376	182	<10	713	283	<10	742	322	<10	738	326
OmniH2O	85.2	266	132	83.2	282	168	72.3	312	165	70.2	335	191
ExBody2	86.6	247	108	85.9	237	128	73.1	305	157	69.4	342	202
Hover	71.2	278	138	67.9	375	196	16.2	722	258	15.5	746	282
HoRD (Ours)	90.7	102	76	88.4	124	87	86.0	162	96	84.4	171	108

MaskedMimic
Train → OOD

MaskedMimic failure under dynamics shift

Challenging Transfer: In this setting MaskedMimic struggles to maintain stability under a substantial dynamics change, highlighting the difficulty of using animation-focused controllers for torque-level sim-to-sim transfer.

ExBody2
Train → OOD

Moderate Transfer: ExBody2 maintains locomotion under domain shift, but with noticeably increased tracking errors and reduced stability compared to its training environment.

HoRD
Train → OOD

Robust Transfer: HoRD maintains high success rates and low tracking error under sim-to-sim transfer to Genesis, with HCDR enabling online dynamics inference and adaptation without target-domain fine-tuning.

Why Does HCDR Make the Difference?

HCDR encodes recent state–action history into a temporal memory embedding that serves as a dynamics fingerprint. Without this history-conditioned representation, policies trained under domain randomization must rely on a single fixed strategy that cannot infer which latent dynamics (e.g., friction, delay, mass distribution) are currently in effect. In contrast, HCDR enables online adaptation: the policy conditions on how the environment has responded to previous actions and adjusts gait, balance, and torque targets accordingly. This mechanism is key to HoRD's strong zero-shot sim-to-sim transfer to unseen physics engines (e.g., Genesis), achieved without any target-domain data or fine-tuning.

Ablation Studies

Component Analysis

We analyze the contribution of domain randomization (D) and the HCDR module (H) under both in-domain and zero-shot conditions. Without SSJR there is no conversion from motion to torque (Succ. < 10 in all cases). The results show both D and H play complementary roles for robust PD control under cross-domain and perturbed settings.

Ablation Results

Method	IsaacLab Succ. (%)	IsaacLab + DR Succ. (%)	Genesis Succ. (%)	Genesis + DR Succ. (%)
HoRD	90.7	88.4	86.0	84.4
HoRD w/o D	79.8	<10	<10	<10
HoRD w/o H	91.3	70.5	<10	<10

Key findings: (1) Removing domain randomization (w/o D) causes severe collapse under test-time perturbations or simulator change. (2) Removing HCDR (w/o H) keeps competitive performance on the training distribution but degrades sharply under distribution shift (e.g., near-zero success in Genesis), highlighting HCDR's role as a learned dynamics adaptation mechanism for online adaptation to latent simulator and contact variations.

Ablation Training Curves Comparison

Training curves comparing HoRD with ablation variants across different metrics. The curves demonstrate the contribution of domain randomization (D) and HCDR (H) to overall performance and training stability.

Ablation Training Curves - Method Comparison — Detailed metrics comparison showing the impact of domain randomization and HCDR on training dynamics and final performance.

Ablation Training Curves - Metrics Comparison — Detailed metrics comparison showing the impact of domain randomization and HCDR on training dynamics and final performance.

Comprehensive Evaluation Metrics on Test Set

Comprehensive evaluation metrics comparison across different methods on the test set. All policies are evaluated in IsaacLab with domain randomization.

Metric	HoRD	HoRD w/o D	HoRD w/o H
Cartesian Error ↓	0.087	0.286	0.182
Global Rotation Error ↓	0.369	1.594	1.011
Global Translation Error ↓	0.124	0.832	0.341
DOF Velocity Reward ↑	0.972	0.980	0.960
Key Body Reward ↑	0.907	0.412	0.612
Local Rotation Reward ↑	0.833	0.274	0.507
Root Angular Velocity Reward ↑	0.480	0.722	0.412
Root Velocity Reward ↑	0.954	0.878	0.904

Comparison of Deployment Capabilities

We compare HoRD with representative humanoid control frameworks along three deployment capabilities: Unified Skill Coverage (single policy over diverse skills), Sparse-Command Generalization (robustness when only sparse keypoint-level commands are available), and Explicit Online Dynamics Adaptation (dedicated mechanism for adapting to dynamics variations at test time). HoRD supports all three; prior approaches typically support only a subset.

Method	Unified Skill Coverage	Sparse-Command Generalization	Explicit Online Dynamics Adaptation
HoRD (Ours)	✓	✓	✓
BumbleBee	×	×	×
HOVER	×	✓	×
MaskedMimic	✓	✓	×
ASAP	×	×	×
PHC	×	×	×
PULSE	×	×	×
ExBody2	×	×	×
OmniH2O	✓	×	×

Notation: ✓ supported; × not supported (per paper Appendix A.3, Table 5).

Qualitative Demonstrations

Disturbance Recovery

Random external push: HoRD quickly re-stabilizes and resumes the intended trajectory after a lateral push (zero-shot Genesis). Recovery success 85.2% (HoRD) vs. OmniH2O 71.8%, ExBody2 72.1%.

Online Distillation

Teacher–student distillation: The student policy is trained by online distillation to match the expert's actions using only sparse SSJR commands and HCDR history, enabling deployable control from sparse inputs.

Representative Motion Results

HoRD performance across six representative motions (punch, get up, high kick, common walk, side walk, martial) under zero-shot transfer to the unseen Genesis environment. All videos show robust execution without target-domain fine-tuning. Red markers indicate ground-truth skeleton joints.

Punch

Get Up

High Kick

Common Walk

Side Walk

Martial

Terrain Robustness

HoRD is evaluated on flat ground, smooth slope, and rough slope in zero-shot Genesis. Success rates: flat 86.2%, smooth slope 85.8%, rough slope 84.2%, demonstrating graceful degradation with terrain difficulty.

BibTeX

@article{wang2026hord,
  title={HoRD: Robust Humanoid Control via History-Conditioned Reinforcement Learning and Online Distillation},
  author={Wang, Puyue and Hu, Jiawei and Gao, Yan and Wang, Junyan and Zhang, Yu and Dobbie, Gillian and Gu, Tao and Johal, Wafa and Dang, Ting and Jia, Hong},
  year={2026},
  journal={Preprint},
  url={https://github.com/tonywang-0517/hord}
}

HoRD: Robust Humanoid Control via History-Conditioned Reinforcement Learning and Online Distillation