Duplicate Question Detector
Duplicate Question Detector
π₯ Quora Duplicate Question Detection
This project uses an LSTM (Long Short-Term Memory) model to identify whether two questions from the Quora dataset are semantically similar (duplicates).
π Overview
This project tackles the problem of semantic similarity between question pairs. By combining custom feature engineering with deep text preprocessing, the model learns to classify whether two questions convey the same meaning.
π Links
- π₯ Live Demo: Try the App
- π GitHub Repo: View on GitHub
βοΈ Key Features & Workflow
This project combines structured feature engineering with deep text preprocessing to model the semantic similarity between two questions.
π Feature Engineering (22 Features)
These features are designed to capture lexical, syntactic, and fuzzy similarities between question pairs:
q1_lenβ Number of characters in Question 1q2_lenβ Number of characters in Question 2q1_num_wordsβ Number of words in Question 1q2_num_wordsβ Number of words in Question 2word_commonβ Count of common words in both questionsword_totalβ Total unique words across both questionsword_shareβ Ratio of shared words to total unique wordscwc_minβ Common word count divided by minimum of word countscwc_maxβ Common word count divided by maximum of word countscsc_minβ Ratio of common stopwords to minimum stopword countcsc_maxβ Ratio of common stopwords to maximum stopword countctc_minβ Common token count to minimum token countctc_maxβ Common token count to maximum token countlast_word_eqβ Whether the last words of both questions matchfirst_word_eqβ Whether the first words of both questions matchabs_len_diffβ Absolute difference in length between questionsmean_lenβ Average length of both questionslongest_substr_ratioβ Ratio of longest common substring length to minimum question lengthfuzz_ratioβ Fuzzy string match ratiofuzz_partial_ratioβ Partial fuzzy match scoretoken_sort_ratioβ Fuzzy match after sorting tokenstoken_set_ratioβ Fuzzy match using token sets
π Text Preprocessing Workflow
- Lowercasing & Cleaning
- Removed punctuation, special characters, and HTML tags.
- Tokenization
- Used Keras tokenizer to convert text to sequences.
- Padding
- Ensured uniform input length using
pad_sequences.
- Ensured uniform input length using
- Vocabulary Size
- Limited vocab size to reduce complexity (e.g., top 45,000 words).
- Embedding (optional)
- Used pre-trained embeddings or trained embeddings from scratch.
- Used pre-trained embeddings or trained embeddings from scratch.
π¦ Model Architecture
- Embedding Layer: To convert text to dense vectors.
- LSTM Layers: Capture sequential dependencies in both questions.
- Concatenation: Combine features from both questions.
- Dense Layers: For final binary classification.
π Final Workflow
- Input Preparation
- Text sequences + engineered features combined to form a vector of shape (1, 518).
- Model Selection
- Applied LSTM to capture semantic similarity between question pairs more effectively and boost performance.
- Evaluation
- Measured accuracy, F1-score, precision, recall on validation set.
- Used confusion matrix and ROC curve for analysis.
π Tech Stack
- Python
- TensorFlow / Keras
- Pandas, NumPy
- NLTK / spaCy (for preprocessing)
- Jupyter Notebook
π Deployment
This project is deployed on Hugging Face Spaces using Gradio as a frontend.
π Live Demo: Click Here to Try the App
- Enter two questions.
- The model will tell you if theyβre likely duplicates.
- Built with a simple and clean Gradio interface for fast user interaction.
β Feel free to fork or star this repo if you found it helpful! π
This post is licensed under CC BY 4.0 by the author.