
Create Your First Project
Start adding your projects to your portfolio. Click on "Manage Projects" to get started
Deep Reinforcement Learning for Lunar Lander
Github Link
OVERVIEW
--------------------------------------------------
Task:
• Enable the Lunar Lander to achieve a safe landing between the yellow flags by maximizing the total reward. Two Deep Reinforcement Learning (DRL) methods are implemented: Policy Gradient and Actor-Critic.
Environment:
• The Lunar Lander environment simulates the landing of a spacecraft on the surface of the moon.
• State: An 8-dimensional vector representing:
- Lander’s x and y coordinates
- Linear velocities in x and y directions
- Lander's angle and angular velocity
- Two booleans indicating whether each leg is in contact with the ground
• Actions: 4 discrete actions:
0: No action
1: Fire left engine (accelerate right)
2: Fire main engine (accelerate downward)
3: Fire right engine (accelerate left)
• Reward Structure:
- Approaching the landing pad and coming to rest yields roughly 100–140 points.
- Moving away from the pad penalizes the agent.
- Crashing incurs an additional -100 points.
- Successfully coming to rest awards an extra +100 points.
- Each leg with ground contact provides +10 points.
- Firing the main engine costs -0.3 points per frame, and the side engines cost -0.03 points per frame.
• The environment is considered solved when the total reward reaches 200 points.
--------------------------------------------------
METHODS
--------------------------------------------------
Policy Gradient:
• Implementation:
- A neural network is trained using the policy gradient method, where the network directly maps states to action probabilities.
- The implementation is based on sample code and is run for a specified number of epochs (e.g., 100 epochs), with initial results showing an average total reward of approximately -176.83 over 5 test runs.
• Tuning:
- Learning rate, number of epochs, and other hyperparameters are tuned until the total reward converges.
- Modifications such as switching to an accumulative decaying reward structure have been explored.
Actor-Critic:
• Implementation:
- Builds upon the policy gradient approach by integrating a value function estimator to reduce the variance of the gradient estimates and stabilize training.
• Advanced Tuning:
- Further modifications to the reward mechanism and hyperparameter adjustments have been applied to improve overall performance.
--------------------------------------------------
TRAINING AND EVALUATION
--------------------------------------------------
Training Details:
• The agent is trained in the Lunar Lander environment using gradient descent updates on the policy network (and value network for the actor-critic method).
• Hyperparameters such as learning rate, number of epochs, and network architectures are tuned based on total reward performance.
Evaluation:
• The trained agent is evaluated over 5 test runs.
• The average total reward across these runs is used as the primary performance metric—the higher the reward, the better the agent's performance.
--------------------------------------------------
CODE STRUCTURE
--------------------------------------------------
Implementation:
• The repository includes all necessary code for:
- Setting up the Lunar Lander environment
- Implementing the policy gradient and actor-critic methods
- Conducting training loops
• The code covers:
- Environment interaction
- Neural network definition
- Gradient descent updates
- Hyperparameter tuning
- Evaluation routines
Dependencies:
• Libraries utilized include:
- gym (for environment simulation)
- numpy (for numerical computations)
- torch (for building and training neural networks)
--------------------------------------------------
CONCLUSION
--------------------------------------------------
This project demonstrates the implementation of deep reinforcement learning techniques—specifically policy gradient and actor-critic methods—from scratch to solve the Lunar Lander environment. By carefully tuning hyperparameters and experimenting with different reward structures, the agent learns to navigate the challenges of landing safely. All code, experiments, and evaluation results are provided in this repository, highlighting the process of developing DRL agents for complex control tasks.