top of page
Image by Antoine Rault

Create Your First Project

Start adding your projects to your portfolio. Click on "Manage Projects" to get started

English-to-Traditional-Chinese-using-sequence-to-sequence-Machine-Translation

Github Link

OVERVIEW
--------------------------------------------------
The goal of this project is to translate English sentences into Traditional Chinese using a sequence-to-sequence model. The model is composed of two main components: an encoder that transforms the input English sentence into a vector (or sequence of vectors) and a decoder that generates the translation one token at a time, conditioned on the encoder output and the previously generated tokens.

--------------------------------------------------
DATASET
--------------------------------------------------
• Download the dataset from the provided Google Drive link:
  https://drive.google.com/file/d/1Sys5keuiw4C27_cG3LMyYi6v5eHJvu1L/view
• Training Data:
  - train_dev.raw.en: 390,112 English sentences
  - train_dev.raw.zh: 390,112 corresponding Traditional Chinese sentences
• Test Data:
  - test.raw.en: 3,940 English sentences
  Note: The Chinese test translations are undisclosed (the provided .zh file is a pseudo translation).

--------------------------------------------------
KEY COMPONENTS
--------------------------------------------------
1. Sub-word Tokenization
  • Purpose: Reduce vocabulary size and address out-of-vocabulary (OOV) issues.
  • Method: Split words into common prefixes and suffixes to handle rare words and morphological variations (e.g., "new", "_new", "_ways", etc.).

2. Label Smoothing Regularization
  • Purpose: Reserve some probability mass for incorrect labels during loss calculation.
  • Benefit: Reduces overfitting and improves generalization during training.

3. Learning Rate Scheduling
  • Implementation:
    - The learning rate increases linearly for the first warmup_steps and then decreases proportionally to the inverse square root of the step number.
  • Benefit: Stabilizes training, especially in the early stages of training with transformer-based architectures.

--------------------------------------------------
TRAINING AND EVALUATION
--------------------------------------------------
• Benchmark Model:
  - A simple RNN sequence-to-sequence model is implemented.
  - Expected running time: Approximately 0.5 hours for data processing and 0.5 hours for model training.
  - Benchmark BLEU score: Approximately 15.

• Improvements:
  - Hyperparameters such as the number of epochs, encoder/decoder layers, and embedding dimensions have been tuned.
  - A learning rate scheduler is integrated to further stabilize and improve training performance.

• Evaluation Metric:
  - BLEU score is used to evaluate translation quality by measuring the modified n-gram precision (n=1 to 4) with a penalty for short sentences.

--------------------------------------------------
CODE STRUCTURE
--------------------------------------------------
• Notebook:
  - Week6_ap2938DL (1).ipynb contains the full implementation, including:
    • Data Preprocessing: Loading and tokenizing the English and Chinese datasets using sub-word units.
    • Model Definition: Implementing the sequence-to-sequence model (encoder and decoder) with label smoothing regularization.
    • Training Loop: Training the model with a learning rate scheduler and tuning hyperparameters.
    • Evaluation: Generating translations and computing the BLEU score.

--------------------------------------------------
CONCLUSION
--------------------------------------------------
This project demonstrates the application of sequence-to-sequence models for machine translation from English to Traditional Chinese. By incorporating sub-word tokenization, label smoothing, and learning rate scheduling, the model achieves improved performance and training stability. All code, experiments, and generated predictions are provided in this repository.

Anchor 1

© 2025 by Anish Panicker. 

bottom of page