AI Project – Research Projects
👉 My AI learning path — On-processing: AIO Project
🧠 Research Projects
📰 1. News Article Topic Classification (NLP)
Dataset: arXiv abstracts (5 classes: astro-ph, cond-mat, cs, math, physics)
Goal: Build an extensible NLP pipeline for topic classification.
- Compared BoW / TF-IDF / Sentence Embeddings (+ LSA / Faiss)
- Evaluated across KNN, Decision Tree, Naive Bayes, Logistic Regression, SVM, Random Forest, AdaBoost, Gradient Boost, Stacking
- Applied advanced preprocessing (lemmatization, stopword removal), data augmentation & imbalance handling (back-translation, synonyms, SMOTE/ADASYN, class weights)
- Evaluated with Accuracy, macro-F1, ROC-AUC, Confusion Matrix
- Deployed a Streamlit app for EDA, training, and live prediction with word-level explanations (LR/NB)
- Modular codebase (OOP), uv for package management, ruff for linting
🏠 2. House Price Prediction (Advanced Regression Techniques)
Dataset: Kaggle - Advanced Regression Techniques.
Goal: Build a reproducible ML pipeline for regression on structured tabular data.
- Modular data pipeline with preprocessing (imputation, one-hot, scaling, polynomial features, feature selection)
- 10 models from regularized linear (Lasso/Ridge/ElasticNet) to tree-based ensembles (RF, GBM, XGBoost, LightGBM, CatBoost)
- Used k-fold cross-validation and Optuna for hyperparameter tuning
- Evaluated by RMSE / MAE / R² + training time
- Delivered a Streamlit app for EDA, experiment runner, live prediction
- Added SHAP & tree feature importance for model interpretability
❤️ 3. Heart Disease Risk Prediction (Classification)
Dataset: UCI Heart Cleveland · 303 samples · 11 features + 1 label
Goal: Predict cardiovascular risk with a robust, generalizable ML pipeline.
- Feature engineering: created age-normalized ratios (chol_per_age, trestbps_per_age)
- Feature selection: top 10 by Decision Tree importance
- Pipeline + GridSearchCV (cv = 3, scoring = ROC-AUC) across 9 models (LR, KNN, DT, SVM-RBF, RF, AdaBoost, GBM, LGBM, XGB)
- Performance: AUC up to 0.97 (LR), ~0.95 (RF/XGB), Accuracy ~0.87
- Recall-oriented configs to minimize false negatives
- Built Streamlit interface for patient-level input, ROC curve, Confusion Matrix, Feature Importance visualization
© 2025 – Nguyễn Tuấn Anh et al., AIO 2025 (MIX002 Teams)
