Duplicate Question Detector
Duplicate Question Detector
π₯ Quora Duplicate Question Detection
This project uses an LSTM (Long Short-Term Memory) model to identify whether two questions from the Quora dataset are semantically similar (duplicates).
π Overview
This project tackles the problem of semantic similarity between question pairs. By combining custom feature engineering with deep text preprocessing, the model learns to classify whether two questions convey the same meaning.
π Links
- π₯ Live Demo: Try the App
- π GitHub Repo: View on GitHub
βοΈ Key Features & Workflow
This project combines structured feature engineering with deep text preprocessing to model the semantic similarity between two questions.
π Feature Engineering (22 Features)
These features are designed to capture lexical, syntactic, and fuzzy similarities between question pairs:
q1_len
β Number of characters in Question 1q2_len
β Number of characters in Question 2q1_num_words
β Number of words in Question 1q2_num_words
β Number of words in Question 2word_common
β Count of common words in both questionsword_total
β Total unique words across both questionsword_share
β Ratio of shared words to total unique wordscwc_min
β Common word count divided by minimum of word countscwc_max
β Common word count divided by maximum of word countscsc_min
β Ratio of common stopwords to minimum stopword countcsc_max
β Ratio of common stopwords to maximum stopword countctc_min
β Common token count to minimum token countctc_max
β Common token count to maximum token countlast_word_eq
β Whether the last words of both questions matchfirst_word_eq
β Whether the first words of both questions matchabs_len_diff
β Absolute difference in length between questionsmean_len
β Average length of both questionslongest_substr_ratio
β Ratio of longest common substring length to minimum question lengthfuzz_ratio
β Fuzzy string match ratiofuzz_partial_ratio
β Partial fuzzy match scoretoken_sort_ratio
β Fuzzy match after sorting tokenstoken_set_ratio
β Fuzzy match using token sets
π Text Preprocessing Workflow
- Lowercasing & Cleaning
- Removed punctuation, special characters, and HTML tags.
- Tokenization
- Used Keras tokenizer to convert text to sequences.
- Padding
- Ensured uniform input length using
pad_sequences
.
- Ensured uniform input length using
- Vocabulary Size
- Limited vocab size to reduce complexity (e.g., top 45,000 words).
- Embedding (optional)
- Used pre-trained embeddings or trained embeddings from scratch.
- Used pre-trained embeddings or trained embeddings from scratch.
π¦ Model Architecture
- Embedding Layer: To convert text to dense vectors.
- LSTM Layers: Capture sequential dependencies in both questions.
- Concatenation: Combine features from both questions.
- Dense Layers: For final binary classification.
π Final Workflow
- Input Preparation
- Text sequences + engineered features combined to form a vector of shape (1, 518).
- Model Selection
- Applied LSTM to capture semantic similarity between question pairs more effectively and boost performance.
- Evaluation
- Measured accuracy, F1-score, precision, recall on validation set.
- Used confusion matrix and ROC curve for analysis.
π Tech Stack
- Python
- TensorFlow / Keras
- Pandas, NumPy
- NLTK / spaCy (for preprocessing)
- Jupyter Notebook
π Deployment
This project is deployed on Hugging Face Spaces using Gradio
as a frontend.
π Live Demo: Click Here to Try the App
- Enter two questions.
- The model will tell you if theyβre likely duplicates.
- Built with a simple and clean Gradio interface for fast user interaction.
β Feel free to fork or star this repo if you found it helpful! π
This post is licensed under CC BY 4.0 by the author.