top of page
Image by Antoine Rault

Create Your First Project

Start adding your projects to your portfolio. Click on "Manage Projects" to get started

Transformer-based-Speaker-Classification

Github Link

Date

15/10/2025

# Transformer-based Speaker Classification

This repository contains the code for a transformer-based approach to speaker classification. The task is to predict the speaker identity (among 600 speakers) from short audio segments, represented as sequences of extracted audio features. By leveraging self-attention mechanisms introduced in the paper "Attention Is All You Need", the model is able to capture global context efficiently.

## Overview

- **Task:** Multiclass classification (speaker identification)
- **Input:** Audio features for each utterance (sequence data)
- **Output:** Predicted speaker label (integer from 0 to 599)
- **Model:** Transformer-based network utilizing self-attention

![Model Diagram](image)

## Data

- **Training Set:** 62,464 utterances (each with a label indicating one of the 600 speakers)
- **Test Set:** 6,944 utterances (no labels provided)
- Each utterance is a sequence of preprocessed audio features. The model reads these features and outputs a predicted speaker ID.

## Transformer Model

### Self-Attention
- Allows the model to attend to different positions within the input sequence to learn context more effectively than a simple RNN or CNN.

### Positional Encoding
- Injects sequence order information into the model, since self-attention alone does not inherently encode positional details.

### Feedforward Layers
- Applied after the self-attention block, adding non-linearity and increasing model capacity.

### Classification Head
- A final linear layer to map the pooled output or final hidden representation to a speaker class (0–599).

## Training Procedure

### Data Segmentation
- Each utterance is treated as a sequence of audio frames/features.
- The model processes the sequence in parallel using self-attention.

![Data Processing Diagram](image)

### Loss & Optimization
- A cross-entropy loss is used for speaker classification.
- The optimizer is typically Adam, balancing training speed and convergence.

### Hyperparameter Tuning
- Number of transformer layers, hidden dimensions, and attention heads are adjusted to achieve better performance.
- The training epochs are selected to ensure the model converges while avoiding overfitting.

## Evaluation

- **Metric:** Accuracy on the test set (i.e., the proportion of correctly predicted speaker labels).
- **Output Format:**
A CSV file with two columns:
- **Id:** uttr0, uttr1, uttr2, ...
- **Category:** predicted speaker label (0, 1, 2, ...)

## Code Structure

### Notebook
- `Week5_HW_ap2938 (1).ipynb` – Contains:
- Data loading and preprocessing.
- Definition of the Transformer-based model with self-attention.
- Training loop, including the cross-entropy loss function and optimizer setup.
- Evaluation loop for measuring classification accuracy.
- Generation of the final CSV submission file.

## Getting Started

### Clone the Repository & Install Dependencies
- Ensure you have a Python environment with the necessary libraries (e.g., PyTorch, NumPy).

### Download the Data
- Obtain the training set (62,464 utterances) and test set (6,944 utterances).
- Place them in the appropriate directory so the notebook can load them.

### Run the Notebook
- Execute `Week5_HW_ap2938 (1).ipynb` to train the transformer model and produce predictions.


Anchor 1

© 2025 by Anish Panicker. 

bottom of page