The Road to Generalist Robots: A Taxonomy of Deep Reinforcement Learning and the Sim-to-Real Gap

Bryan White
Jan 23
18 min read

Digital robotic dog and arm on a grid transition to real counterparts outdoors, surrounded by trees and barrels, under soft sunlight.

1. Introduction: The Convergence of Control and Learning

The history of robotics has long been defined by a fundamental tension between precision and adaptability. Classical control theory, the discipline that gave us industrial automation and precise flight control, relies on explicit mathematical models. By defining the kinematics of a robot arm or the aerodynamics of a plane using differential equations, engineers can derive control laws that guarantee stability and performance—provided the real world matches the model. However, the real world is rarely so cooperative. It is messy, chaotic, and filled with unmodeled dynamics, from the slippage of a foot on loose gravel to the unpredictable friction of a deformable object. In these unstructured environments, the rigidity of classical control often becomes a liability, leading to the "reality gap" where robots fail because the world does not behave as their equations predict.

Deep Reinforcement Learning (DRL) offers a radical alternative. Instead of pre-programming the physics of the world, DRL enables a robot to learn them through interaction. Inspired by behavioral psychology, an RL agent perceives the state of the environment, takes an action, and receives a scalar reward signal indicating the success or failure of that action.¹ When combined with deep neural networks—layers of artificial neurons capable of approximating complex, non-linear functions—this framework allows robots to learn policies that map raw sensory inputs, such as camera pixels or joint torques, directly to motor commands.²

For years, the success of DRL was confined to the digital realm. Agents like AlphaGo and OpenAI Five demonstrated superhuman capabilities in board games and video games, environments where data is cheap, and physics are perfectly deterministic. Transferring this success to physical robots, however, proved immensely difficult. The complexity of physical interaction, the cost of real-world data collection, and the safety risks of trial-and-error learning on hardware created a barrier known as the "Sim-to-Real" gap.²

This report, based on the seminal 2025 survey Deep Reinforcement Learning for Robotics: A Survey of Real-World Successes by Tang et al., provides an exhaustive analysis of how the field has begun to bridge this gap. We explore the sophisticated methodologies—such as Domain Randomization, Asymmetric Actor-Critic architectures, and Teacher-Student learning—that have allowed DRL agents to escape the simulation and master the physical world. From quadrupedal robots traversing Swiss mountains to drones outracing human champions, we examine the case studies that mark the maturation of robotic learning.²

2. A Taxonomy of Robotic Reinforcement Learning

To understand the landscape of DRL in robotics, it is necessary to move beyond broad generalizations and adopt a structured taxonomy. The recent survey by Tang et al. (2025) proposes a framework that categorizes robotic learning achievements along four critical axes. This taxonomy allows researchers to evaluate not just if a method works, but how it works and how mature it is for deployment.⁵

2.1 Axis 1: Robot Competencies

The first axis defines what the robot is learning to do. Competencies are broadly divided into mobility and manipulation, though they increasingly overlap.

Locomotion: The ability to move the robot's base through the world. This includes quadrupedal walking, bipedal running, and aerial flight. DRL has been particularly successful here because locomotion is a self-contained physics problem where the reward (forward velocity) is clear.¹
Navigation: The high-level task of planning a path through an environment while avoiding obstacles. Unlike locomotion, which controls motors, navigation often outputs velocity commands or waypoints.
Manipulation: The ability to interact with objects. This ranges from "pick and place" (rigid objects) to dexterous in-hand manipulation (reorienting a pen in the fingers) and manipulating deformable objects (folding cloth). This is widely considered the hardest competency due to the complexity of contact physics.⁷
Multi-Robot and Human-Robot Interaction (HRI): Emerging competencies where the robot must coordinate with other agents or safely interact with humans. These are less mature due to the unpredictability of the "other" agents.¹

2.2 Axis 2: Problem Formulation

How is the learning problem structured?

Standard Reinforcement Learning: The agent learns purely from trial and error to maximize rewards.
Imitation Learning: The agent learns by mimicking a dataset of expert demonstrations (either from humans or other controllers). This is often used to "jumpstart" the learning process, preventing the robot from flailing aimlessly in the beginning.⁹
Hierarchical RL: The problem is broken into layers. A "high-level" policy might decide to "walk to the door," while a "low-level" policy controls the specific leg movements to execute that command.⁵

2.3 Axis 3: Solution Approach

This axis describes the specific algorithmic machinery used to solve the problem.

Model-Free vs. Model-Based: Model-free methods (like PPO or SAC) learn a policy directly, without learning an internal physics engine. Model-based methods try to predict "what happens next" and plan accordingly. The vast majority of real-world successes currently rely on model-free methods trained in simulation, as learning a model of complex contact physics is notoriously difficult.⁵
Sim-to-Real Transfer: The specific techniques used to port the policy from the simulator to the robot, such as domain randomization or adaptation modules.

2.4 Axis 4: Level of Real-World Success (Maturity)

Perhaps the most insightful contribution of the Tang et al. survey is the definition of "Readiness Levels" for DRL in robotics, similar to Technology Readiness Levels (TRL) in engineering.⁷

Level	Description	Status of DRL
Level 0	Validated only in simulation.	Common in academic papers; no physical proof.
Level 1	Validated under limited lab conditions.	Works once, on one specific robot, with no lighting changes.
Level 2	Validated under diverse lab conditions.	Robust to lighting, slight nudges, and minor variations in the lab.
Level 3	Validated under confined real-world conditions.	The robot can operate outdoors or in a test facility, but with safety nets.
Level 4	Validated under diverse real-world conditions.	The robot works in the "wild" (forests, public streets) with high reliability.
Level 5	Deployed on commercialized products.	The technology is in a consumer product (e.g., a Roomba or a logistics robot).

Currently, quadrupedal locomotion has reached Level 4/5, while dexterous manipulation hovers between Level 1 and 2.¹

3. The Sim-to-Real Gap: The Central Challenge

The primary bottleneck in robotic DRL is the "Sim-to-Real" gap. Simulators are preferred for training because they are fast, safe, and scalable. A researcher can simulate ten thousand robots in parallel, accumulating centuries of experience in a few days. However, simulators are approximations. They use simplified contact models (like the Coulomb friction model) that fail to capture the softness of rubber tires, the backlash in gearboxes, or the micro-slippage of fingertips.⁹

When a policy overfits to the simulator, it learns to exploit these inaccuracies—"gaming" the physics engine to achieve high scores in ways that are physically impossible. When transferred to a real robot, this policy fails catastrophically. The following sections detail the three dominant methodologies developed to overcome this gap: Domain Randomization, Privileged Learning, and History-Based Adaptation.

4. Methodology I: Domain Randomization (The Chaos Engine)

Domain Randomization (DR) is the most ubiquitous technique for Sim-to-Real transfer. It operates on a counterintuitive premise: instead of trying to make the simulation more accurate (which is expensive and difficult), researchers make it more chaotic.⁹

4.1 The Mechanics of Randomization

In a standard simulation, gravity is exactly 9.81 m/s², the robot’s mass is exactly 12 kg, and the floor friction is constant. In a Domain Randomized simulation, these parameters are treated as variables drawn from a probability distribution.

Dynamics Randomization: In every training episode, the simulator selects a random mass for the robot (e.g., between 10 kg and 15 kg), a random friction coefficient for the floor, and random damping values for the joints.
Visual Randomization: For vision-based robots, the simulator randomizes camera position, lighting, textures, and background colors.¹³

The goal is to force the neural network to learn a policy that is robust to variability. If the agent can walk effectively whether the robot weighs 10 kg or 15 kg, it is likely to walk effectively on the real robot, even if the real robot’s true effective mass is 12.5 kg. The real world becomes just "one more sample" from the randomized distribution the agent has already mastered.⁹

4.2 Automatic Domain Randomization (ADR)

A significant evolution of this technique is Automatic Domain Randomization (ADR), pioneered by OpenAI for the Dactyl project. Manual DR requires engineers to guess the appropriate ranges for randomization. If the ranges are too wide, the task becomes impossible; if too narrow, the policy overfits.⁸

ADR automates this curriculum. It starts with a non-randomized environment (Level 1 difficulty). As the agent meets a performance threshold, the algorithm automatically expands the bounds of the randomization (e.g., allowing friction to drop lower or gravity to fluctuate more). This creates a "frontier" of difficulty that grows with the agent's capability, ensuring the robot is always challenged but never overwhelmed. This technique was crucial for learning dexterous manipulation, where the physics are too complex for manual tuning.¹⁶

4.3 Comparison: System Identification vs. Domain Randomization

It is useful to contrast DR with the traditional control approach of System Identification (SysID).

Feature	System Identification (SysID)	Domain Randomization (DR)
Goal	Measure real-world parameters (friction, mass) precisely and tune the model to match.	Train across a wide distribution of parameters so precision doesn't matter.
Data Requirement	Requires real-world data collection to calibrate the model.	Can often be done "Zero-Shot" (no real data required).
Robustness	Brittle; if the world changes (e.g., robot picks up a load), the model is wrong.	Robust; the policy is trained to handle varying masses/loads.
Complexity	computationally efficient at runtime.	Requires training large neural networks to handle the variance.

While SysID is still used, DR has largely superseded it for DRL because it removes the need for tedious calibration.¹³

5. Methodology II: Privileged Learning and Teacher-Student Architectures

One of the unique advantages of simulation is the availability of "Privileged Information"—data that exists in the code but is invisible to the robot's sensors. In a simulator, we know the exact terrain geometry under the robot's feet, the exact location of obstacles, and the precise velocity of every joint. A real robot, however, is "blind" to these ground truths, relying on noisy cameras and encoders.¹⁸

To leverage this, researchers have adopted Teacher-Student (or Asymmetric Actor-Critic) architectures. This process effectively "distills" the omniscience of the simulator into a robust sensorimotor policy.

5.1 The Teacher Policy

The first step is to train a "Teacher" agent in simulation. This agent is given access to everything: the exact friction of the floor, the mass of the object being manipulated, and a noise-free map of the surroundings. Because it has perfect information, the Teacher learns quickly and achieves high performance. However, this policy cannot be deployed because the real robot does not have "Privileged Sensors".²⁰

5.2 The Student Policy

The second step trains a "Student" agent. The Student is restricted to the observations available to the real robot (e.g., depth images, joint angles, IMU data). The Student is trained using supervised learning to mimic the actions of the Teacher. Alternatively, the Student is trained via RL, but its reward is augmented by how well it approximates the Teacher's internal state representations.¹⁸

Through this process, the Student learns to infer the privileged information from its noisy sensors. For example, if the Student sees its leg slipping (via joint encoders), it implicitly learns that the friction is low—a fact the Teacher knew explicitly. This approach has been instrumental in quadruped locomotion and drone racing.²³

5.3 Asymmetric Actor-Critic

A variation of this is the Asymmetric Actor-Critic. In algorithms like PPO or SAC, the Critic network estimates the value of a state to guide the Actor network. Since the Critic is only used during training, it can be given privileged information without breaking the deployment constraints. The Actor (which runs on the robot) remains blind, but it receives gradients from an omniscient Critic, stabilizing the learning process.²⁵

6. Methodology III: Adaptation and History-Based Policies

Domain Randomization creates a robust policy, but sometimes the environment changes in ways that require the robot to adapt its strategy on the fly. To achieve this without retraining, researchers utilize History-Based Policies.²

Instead of feeding the neural network a snapshot of the current state (), the agent is fed a history of the last observations (). This temporal sequence allows the network to function as an implicit system identification module.

Example: A walking robot steps onto a patch of ice.
Instantaneous View: The robot sees its leg moving. It might not realize it is slipping.
History View: The robot sees that for the last 500ms, its legs have been moving fast, but its body velocity (IMU) has not increased. The discrepancy between motor effort and movement allows the network to infer "Low Friction" and switch to a more conservative gait.²⁷

This technique, often implemented using Temporal Convolutional Networks (TCNs) or Transformers, allows a single static policy to adapt to diverse environments (sand, snow, concrete) by recognizing the "signature" of the interaction dynamics in its short-term memory.²⁹

7. Case Study: Legged Locomotion (ANYmal)

The domain of quadrupedal locomotion is widely considered the "flagship" success of robotic DRL. Prior to DRL, controlling legged robots required complex model-based approaches like Model Predictive Control (MPC), which struggled with compliant surfaces (mud, vegetation) and required extensive tuning. Today, DRL-controlled quadrupeds like ANYmal (ETH Zurich) and Spot (Boston Dynamics) exhibit robustness that rivals biological movement.¹

7.1 The "Blind" Locomotive Breakthrough

One of the seminal works in this field was the deployment of a DRL policy on the ANYmal robot by Hwangbo et al. (2019). The team utilized a sophisticated simulation that modeled the actuator dynamics—the specific way the robot's motors responded to voltage—rather than just idealized torque. This reduced the reality gap significantly.²³

Using a Teacher-Student architecture, the robot learned to walk blindly (without vision) over rugged terrain. The "Student" policy used only proprioception (joint angles and body velocity). By processing the history of these proprioceptive signals, the robot could "feel" the terrain. If a foot hit an obstacle, the history would reflect the impact, and the policy would reflexively lift the leg higher. This system achieved Level 4 maturity, traversing hiking trails, stairs, and snow in the real world.²³

7.2 From Blindness to Parkour

Subsequent work integrated exteroception (vision). By feeding depth camera data into the RL agent (again, often using asymmetric training where the Critic sees a perfect map and the Actor sees noisy depth points), robots like ANYmal learned "Parkour" skills. They could identify gaps in the floor and jump over them, or identify high ledges and climb them. The DRL agent learned to correlate the visual input of an obstacle with the necessary motor torque to vault over it, integrating navigation and locomotion into a single end-to-end policy.¹⁹

8. Case Study: Aerial Agility (Swift & Drone Racing)

While quadrupeds operate at relatively low speeds, drone racing introduces the challenges of high-speed aerodynamics and extreme system latency. In 2023, the "Swift" system, developed by the University of Zurich, achieved a historic milestone by defeating human world champions in First-Person View (FPV) drone racing.⁶

8.1 The Physics of Speed

Racing drones fly at speeds exceeding 100 km/h, pulling accelerations of up to 5g. At these speeds, aerodynamic drag becomes complex and turbulent, making traditional analytic models inaccurate. Furthermore, the delay between the camera capturing an image and the motors responding (latency) can be fatal—a 50ms delay means the drone travels 1.4 meters before the computer reacts.³³

8.2 The Swift Architecture

Swift employed a hybrid Deep RL approach that leveraged Residual Physics Learning.

Simulation with Residuals: The training simulation included a standard rigid-body model plus a learned "residual" term. This residual term was a neural network trained on real-world flight data to predict the difference between the simple model and reality (e.g., the complex drag coefficients). This minimized the Sim-to-Real gap for aerodynamics.⁶
Perception-Control Split: Unlike pure end-to-end approaches that map pixels to motors, Swift decoupled perception. A vision system (CNN) detected the racing gate corners to estimate the drone's position. The DRL policy then took this estimated state and output control commands.
Time-Optimal Policies: The DRL agent was rewarded for minimizing lap time. Through millions of simulated races, it discovered trajectories that pushed the battery and motors to their absolute thermal and magnetic limits—lines that human pilots, relying on intuition, could not consistently execute.

In head-to-head races against human champions, Swift won consistently, marking a Level 3/4 success for DRL in high-speed aerial robotics. It demonstrated that RL could manage systems where the physics are changing rapidly and the margins for error are measured in milliseconds.²⁴

9. Case Study: Dexterous Manipulation (Dactyl)

Manipulation is widely regarded as the "Grand Challenge" of robotics. Unlike locomotion, where the robot interacts with the ground (a relatively static object), manipulation involves interacting with free-floating objects that have unknown mass, friction, and shapes. The contact physics of a robotic hand manipulating a cube are discontinuous and difficult to simulate accurately.⁸

9.1 OpenAI's Dactyl and the Rubik's Cube

In 2019, OpenAI released "Dactyl," a system that used a Shadow Dexterous Hand (a anthropomorphic hand with 24 degrees of freedom) to solve a Rubik's Cube one-handed. The complexity of this task is immense: the robot must rotate the face of the cube without dropping it, requiring precise coordination of five fingers and constant friction management.³⁵

9.2 Solving the Contact Problem with Massive Randomization

Dactyl was the proving ground for Automatic Domain Randomization (ADR). The team concluded that no single simulation could capture the physics of the real hand and cube perfectly. Instead, they randomized everything:

Physical Properties: Cube size, mass, friction, finger joint damping, gravity.
Perturbations: During training, simulated "invisible hands" would poke the cube or the robot.
Visuals: The colors and lighting were shifted wildly.

The resulting policy was not just robust; it exhibited Meta-Learning. The agent utilized an LSTM (memory) network. When the cube was heavy, the hand would drop slightly, the LSTM state would update, and the policy would tighten its grip. This happened without any explicit "weight sensor"—the adaptability emerged purely from the recurrent dynamics of the network trained over the randomized distribution.¹⁶

Remarkably, the robot could continue to solve the cube even when researchers tied its fingers together or put a rubber glove on the hand. The agent had never seen a rubber glove in simulation, but the glove's physics (different friction, slight resistance) fell within the broad distribution of "weird physics" the agent had mastered via ADR.³⁵

10. Case Study: High-Speed Driving (Gran Turismo Sophy)

Autonomous driving is typically associated with safety and traffic rules. However, high-speed racing represents a different challenge: driving at the "friction limit." A racing car must brake at the latest possible millisecond and turn while balancing on the edge of tire adhesion. Sony AI's Gran Turismo Sophy (GT Sophy) applied DRL to this domain, defeating top esports drivers in the simulator Gran Turismo Sport.³⁷

10.1 The Friction Circle and Tactics

GT Sophy required the agent to master two distinct skills:

Vehicle Dynamics: Controlling the car at the limit. The agent learned "trail braking" (braking while turning) and how to manage tire wear, skills that require delicate continuous control.³⁹
Race Tactics: Unlike time-trial racing, competitive racing involves opponents. GT Sophy learned to "slipstream" (draft) behind opponents to gain speed and to block opponents from passing.

The agent was trained using Quantile Regression DQN (QR-DQN), a distributional RL algorithm that estimates the distribution of future rewards rather than just the mean. This allowed the agent to better manage the variance and risk inherent in racing (e.g., the risk of a collision vs. the reward of an overtake). While GT Sophy operates in a digital simulator, the fidelity of Gran Turismo makes it a relevant proxy for real-world autonomous racing, demonstrating DRL's ability to handle high-speed, adversarial multi-agent interactions.³⁷

11. The Generalist Turn: Vision-Language-Action Models (RT-1, RT-2)

The case studies above (ANYmal, Swift, Dactyl) represent "Specialist" agents—robots designed to do one thing (walk, fly, manipulate) extremely well. The current frontier of robotic learning is the "Generalist" robot: a machine that can perform any task requested of it in an unstructured environment.⁴⁰

11.1 The Integration of Large Language Models

Google DeepMind's Robotic Transformer (RT) series represents a shift away from pure RL toward Vision-Language-Action (VLA) models.

RT-1: A transformer model trained on 130,000 real-world demonstrations. It takes images and text instructions ("pick up the Coke can") and outputs tokenized actions. It showed high success rates (97%) on seen tasks but struggled with new objects.⁴⁰
RT-2: This model took a massive leap by co-training on robotic data and internet-scale vision-language data. By "distilling" the knowledge of a Visual Language Model (VLM) into the robot, RT-2 gained the ability to reason.

11.2 Emergent Generalization

If a user asks RT-2 to "pick up the extinct animal," the robot can identify a plastic dinosaur on the table and pick it up. The robot was never explicitly trained on "extinct animals." It transfers the semantic concept from its web-based pre-training to the robotic control task. This suggests that the future of DRL may not be training from scratch, but rather using RL to fine-tune these massive pre-trained foundation models (RLHF for Robotics), correcting their physical execution errors while leveraging their semantic reasoning.⁴¹

12. Critical Challenges and Limitations

Despite the successes of Level 4/5 in locomotion and Level 3 in racing, DRL in robotics faces significant hurdles before it can be ubiquitous.⁵

12.1 The Sample Efficiency Wall

RL is inherently data-hungry. Dactyl required the equivalent of 13,000 years of simulated experience to learn to rotate a cube. In simulation, this is feasible. In domains where simulation is difficult—such as interacting with liquids, granular media (sand), or cutting food—Sim-to-Real is not an option, and real-world training is prohibitively slow. Methods to improve sample efficiency, such as Model-Based RL (Dreamer, MBPO), are showing promise by learning a world model from limited data and "dreaming" inside it, but they are often less stable than model-free methods.¹¹

12.2 Safety and Constraints

Standard DRL maximizes reward. It does not inherently respect safety constraints. A robot learning to open a door might learn that "smashing the glass" is the fastest way to get to the other side. Safe RL attempts to solve this using Constrained Markov Decision Processes (CMDPs), where the agent maximizes reward subject to cost limits. However, enforcing "hard constraints" (never hit a human) during the exploration phase (when the robot knows nothing) is a paradox. Current solutions often rely on "Safety Shields"—classical control layers that override the RL policy if it approaches a dangerous state—but this limits the agent's performance capabilities.⁴⁵

12.3 Generalization to Open Worlds

While RT-2 shows promise, most RL agents are still brittle to distribution shifts that fall outside their randomization. A robot trained to walk on stairs might fail on an escalator. Achieving the "Level 5" autonomy where a robot can be unboxed in a new home and work immediately without fine-tuning remains an open challenge. This is often referred to as the "Moravec Paradox" in a new form: high-level reasoning (ChatGPT) is becoming easy, but low-level sensorimotor control in a chaotic world remains incredibly hard.¹

13. Conclusion

The last decade has witnessed a transformation in robotic control. We have moved from the "era of equations," where every motion was derived from a model, to the "era of experience," where robots learn to move by interacting with simulated replicas of our world.

The survey by Tang et al. (2025) and the case studies of ANYmal, Swift, and Dactyl confirm that Deep Reinforcement Learning is no longer just a research curiosity. It has solved the problem of quadrupedal locomotion (Level 5), cracked high-speed drone flight (Level 4), and is making inroads into the holy grail of dexterous manipulation. The winning formula is now clear: Massive Domain Randomization + Privileged Teacher-Student Learning + History-Based Adaptation.

As we look forward, the integration of DRL with Vision-Language-Action models suggests a future where robots are not just agile, but also intelligent. The "brain" of the chatbot is meeting the "body" of the robot. The challenge for the next decade will be to ensure that this synthesis is safe, efficient, and capable of operating not just in the curated chaos of a lab, but in the unbridled complexity of the human world.

Key Takeaways

Sim-to-Real is the Standard: Almost all real-world DRL successes rely on training in simulation with Domain Randomization. Direct real-world training is too slow and dangerous for complex tasks.
Locomotion is Solved: Quadruped walking via RL is mature, robust, and commercially deployed (Level 5).
Manipulation is the Frontier: Handling objects remains Level 1-2 due to complex contact physics; ADR is the primary solution.
Privileged Information is Essential: "Cheating" in simulation (using ground truth) to train a Teacher, then distilling that into a blind Student, is the dominant architecture for high-performance control.
Foundation Models are Incoming: The next generation of robots will likely use RL to fine-tune generalist models (like RT-2) rather than learning from scratch.

Works cited

Deep Reinforcement Learning for Robotics: A Survey of Real-World Successes - arXiv, accessed January 17, 2026, https://arxiv.org/html/2408.03539v1
Deep Reinforcement Learning for Robotics: A Survey of Real-World Successes, accessed January 17, 2026, https://www.annualreviews.org/content/journals/10.1146/annurev-control-030323-022510
Peter Stone: Deep Reinforcement Learning for Robotics: A Survey of ..., accessed January 17, 2026, https://www.cs.utexas.edu/~pstone/Papers/bib2html/b2hd-chen_tang_ARCRAS2024.html
Deep Reinforcement Learning for Robotics: A Survey of Real-World Successes, accessed January 17, 2026, https://discovery.researcher.life/article/deep-reinforcement-learning-for-robotics-a-survey-of-real-world-successes/06749930e25439f0b00e81b1b8f126df
Deep Reinforcement Learning for Robotics: A Survey of Real-World Successes - Annual Reviews, accessed January 17, 2026, https://www.annualreviews.org/doi/pdf/10.1146/annurev-control-030323-022510
Champion-Level Drone Racing Using Deep Reinforcement Learning, accessed January 17, 2026, https://contest.techbriefs.com/2024/entries/robotics-and-automation/12793-0609-054901-champion-level-drone-racing-using-deep-reinforcement-learning
Deep Reinforcement Learning for Robotics: A Survey of Real-World Successes - AAAI Publications, accessed January 17, 2026, https://ojs.aaai.org/index.php/AAAI/article/view/35095/37250
Machine Learning | Jonas Schneider, accessed January 17, 2026, https://jonasschneider.com/machine-learning/
Sim2Real Transfer Methods - Emergent Mind, accessed January 17, 2026, https://www.emergentmind.com/topics/sim2real-transfer-method
Training Sim-to-Real Transferable Robotic Assembly Skills over Diverse Geometries, accessed January 17, 2026, https://developer.nvidia.com/blog/training-sim-to-real-transferable-robotic-assembly-skills-over-diverse-geometries/
What is sample efficiency in RL? - Milvus, accessed January 17, 2026, https://milvus.io/ai-quick-reference/what-is-sample-efficiency-in-rl
What exactly makes sim to real transfer a challenge in reinforcement learning? - Reddit, accessed January 17, 2026, https://www.reddit.com/r/robotics/comments/1j99vrt/what_exactly_makes_sim_to_real_transfer_a/
Domain Randomization for Sim2Real Transfer | Lil'Log, accessed January 17, 2026, https://lilianweng.github.io/posts/2019-05-05-domain-randomization/
Domain Adaptation and Domain Randomization for Sim2Real RL - LAMDA, accessed January 17, 2026, http://www.lamda.nju.edu.cn/chenxh/slides/Domain-Adaptation-and-Domain-Randomization-for-Sim2Real-RL.pdf
Domain Adaptation Using System Invariant Dynamics Models, accessed January 17, 2026, http://proceedings.mlr.press/v144/wang21c/wang21c.pdf
Products and applications of OpenAI - Wikipedia, accessed January 17, 2026, https://en.wikipedia.org/wiki/Products_and_applications_of_OpenAI
Artificial Intelligence: Emerging Themes, Issues, and Narratives - CNA.org., accessed January 17, 2026, https://www.cna.org/reports/2020/11/DOP-2020-U-028073-Final.pdf
Teacher–Student Reinforcement Learning - Emergent Mind, accessed January 17, 2026, https://www.emergentmind.com/topics/teacher-student-reinforcement-learning-framework
Deep Whole-Body Control: Learning a Unified Policy for Manipulation and Locomotion, accessed January 17, 2026, https://manipulation-locomotion.github.io/resources/Deep-Whole-Body-Control.pdf
TGRL: TEACHER GUIDED REINFORCEMENT LEARN- ING ALGORITHM FOR POMDPS - OpenReview, accessed January 17, 2026, https://openreview.net/pdf?id=kTqjkIvjj7
Deep reinforcement learning for real-world quadrupedal locomotion: a comprehensive review - OAE Publishing Inc., accessed January 17, 2026, https://www.oaepublish.com/articles/ir.2022.20
A more effective way to train machines for uncertain, real-world situations | MIT News, accessed January 17, 2026, https://news.mit.edu/2023/more-effective-train-machines-uncertain-real-world-situations-0531
[1901.08652] Learning agile and dynamic motor skills for legged robots - arXiv, accessed January 17, 2026, https://arxiv.org/abs/1901.08652
Vision-based Navigation of Micro Aerial Vehicles (MAVs) - Robotics and Perception Group, accessed January 17, 2026, https://rpg.ifi.uzh.ch/research_mav.html
Attention-Privileged Reinforcement Learning, accessed January 17, 2026, https://proceedings.mlr.press/v155/salter21a/salter21a.pdf
A Learning Framework for Diverse Legged Robot Locomotion Using Barrier-Based Style Rewards - arXiv, accessed January 17, 2026, https://arxiv.org/html/2409.15780v1
UNDERSTANDING DOMAIN RANDOMIZATION FOR SIM-TO-REAL TRANSFER - OpenReview, accessed January 17, 2026, https://openreview.net/pdf?id=T8vZHIRTrY
Learned perception modules for autonomous aerial vehicle navigation and control - IRoM-Lab - Princeton University, accessed January 17, 2026, https://irom-lab.princeton.edu/wp-content/uploads/2025/07/Simon_princeton_0181D_15442-1.pdf
The Reality Gap in Robotics: Challenges, Solutions, and Best Practices - arXiv, accessed January 17, 2026, https://arxiv.org/html/2510.20808v1
Learn to Adapt: A Policy for History-Based Online Adaptation - IEEE Xplore, accessed January 17, 2026, https://ieeexplore.ieee.org/iel8/7433297/11267152/10947352.pdf
‪Jemin Hwangbo‬ - ‪Google Scholar‬, accessed January 17, 2026, https://scholar.google.com/citations?user=Uam1ZB8AAAAJ&hl=en
Learning and Reusing Quadruped Robot Movement Skills from Biological Dogs for Higher-Level Tasks, accessed January 17, 2026, https://pmc.ncbi.nlm.nih.gov/articles/PMC10780440/
Champion-level drone racing using deep reinforcement learning - PMC - NIH, accessed January 17, 2026, https://pmc.ncbi.nlm.nih.gov/articles/PMC10468397/
Drone Racing - Robotics and Perception Group, accessed January 17, 2026, https://rpg.ifi.uzh.ch/research_drone_racing.html
Reinforcement learning helped robots solve Rubik's Cube—does it matter? - TechTalks, accessed January 17, 2026, https://bdtechtalks.com/2019/10/21/openai-rubiks-cube-reinforcement-learning/
"Solving Rubik's Cube with a Robot Hand", on Akkaya et al 2019 {OA} [Dactyl followup w/improved curriculum-learning domain randomization; emergent meta-learning] : r/reinforcementlearning - Reddit, accessed January 17, 2026, https://www.reddit.com/r/reinforcementlearning/comments/did0cu/solving_rubiks_cube_with_a_robot_hand_on_akkaya/
Gran Turismo Sophy: Training AI to be a Champion-Level Racer, accessed January 17, 2026, https://sonyinteractive.com/en/news/blog/gran-turismo-sophy/
Unveiling Gran Turismo Sophy™ : An AI Breakthrough - Sony AI, accessed January 17, 2026, https://ai.sony/blog/Unveiling-Gran-Turismo-Sophy-An-AI-Breakthrough/
Outracing champion Gran Turismo drivers with deep reinforcement learning - UT Austin Computer Science, accessed January 17, 2026, https://www.cs.utexas.edu/~pstone/Papers/bib2html-links/nature22.pdf
RT-2: Vision-Language-Action Models, accessed January 17, 2026, https://robotics-transformer2.github.io/
RT-2: New model translates vision and language into action - Google DeepMind, accessed January 17, 2026, https://deepmind.google/blog/rt-2-new-model-translates-vision-and-language-into-action/
Advances in Transformers for Robotic Applications: A Review - arXiv, accessed January 17, 2026, https://arxiv.org/html/2412.10599v1
Deep Reinforcement Learning of Mobile Robot Navigation in Dynamic Environment: A Review - MDPI, accessed January 17, 2026, https://www.mdpi.com/1424-8220/25/11/3394
Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion - NIPS papers, accessed January 17, 2026, http://papers.neurips.cc/paper/8044-sample-efficient-reinforcement-learning-with-stochastic-ensemble-value-expansion.pdf
[2409.12045] Handling Long-Term Safety and Uncertainty in Safe Reinforcement Learning, accessed January 17, 2026, https://arxiv.org/abs/2409.12045
State-wise Safe and Robust Reinforcement Learning for Continuous Control, accessed January 17, 2026, https://icontrol.ri.cmu.edu/research/safe-rl.html
SOAR-RL: Safe and Open-Space Aware Reinforcement Learning for Mobile Robot Navigation in Narrow Spaces - PubMed Central, accessed January 17, 2026, https://pmc.ncbi.nlm.nih.gov/articles/PMC12431151/