
Create Your First Project
Start adding your projects to your portfolio. Click on "Manage Projects" to get started
English-to-Traditional-Chinese-using-sequence-to-sequence-Machine-Translation
Github Link
OVERVIEW
--------------------------------------------------
The goal of this project is to translate English sentences into Traditional Chinese using a sequence-to-sequence model. The model is composed of two main components: an encoder that transforms the input English sentence into a vector (or sequence of vectors) and a decoder that generates the translation one token at a time, conditioned on the encoder output and the previously generated tokens.
--------------------------------------------------
DATASET
--------------------------------------------------
• Download the dataset from the provided Google Drive link:
https://drive.google.com/file/d/1Sys5keuiw4C27_cG3LMyYi6v5eHJvu1L/view
• Training Data:
- train_dev.raw.en: 390,112 English sentences
- train_dev.raw.zh: 390,112 corresponding Traditional Chinese sentences
• Test Data:
- test.raw.en: 3,940 English sentences
Note: The Chinese test translations are undisclosed (the provided .zh file is a pseudo translation).
--------------------------------------------------
KEY COMPONENTS
--------------------------------------------------
1. Sub-word Tokenization
• Purpose: Reduce vocabulary size and address out-of-vocabulary (OOV) issues.
• Method: Split words into common prefixes and suffixes to handle rare words and morphological variations (e.g., "new", "_new", "_ways", etc.).
2. Label Smoothing Regularization
• Purpose: Reserve some probability mass for incorrect labels during loss calculation.
• Benefit: Reduces overfitting and improves generalization during training.
3. Learning Rate Scheduling
• Implementation:
- The learning rate increases linearly for the first warmup_steps and then decreases proportionally to the inverse square root of the step number.
• Benefit: Stabilizes training, especially in the early stages of training with transformer-based architectures.
--------------------------------------------------
TRAINING AND EVALUATION
--------------------------------------------------
• Benchmark Model:
- A simple RNN sequence-to-sequence model is implemented.
- Expected running time: Approximately 0.5 hours for data processing and 0.5 hours for model training.
- Benchmark BLEU score: Approximately 15.
• Improvements:
- Hyperparameters such as the number of epochs, encoder/decoder layers, and embedding dimensions have been tuned.
- A learning rate scheduler is integrated to further stabilize and improve training performance.
• Evaluation Metric:
- BLEU score is used to evaluate translation quality by measuring the modified n-gram precision (n=1 to 4) with a penalty for short sentences.
--------------------------------------------------
CODE STRUCTURE
--------------------------------------------------
• Notebook:
- Week6_ap2938DL (1).ipynb contains the full implementation, including:
• Data Preprocessing: Loading and tokenizing the English and Chinese datasets using sub-word units.
• Model Definition: Implementing the sequence-to-sequence model (encoder and decoder) with label smoothing regularization.
• Training Loop: Training the model with a learning rate scheduler and tuning hyperparameters.
• Evaluation: Generating translations and computing the BLEU score.
--------------------------------------------------
CONCLUSION
--------------------------------------------------
This project demonstrates the application of sequence-to-sequence models for machine translation from English to Traditional Chinese. By incorporating sub-word tokenization, label smoothing, and learning rate scheduling, the model achieves improved performance and training stability. All code, experiments, and generated predictions are provided in this repository.