Thanh
M. Brown

Data Scientist with an M.S. in Operations Research and 4+ years working with complex, real-world data. I combine deep analytical experience with hands-on ML — building predictive models, scalable pipelines, and data products across health, public, and economic domains.

Capabilities

Technical Skills

Full-stack data science from raw data to deployed model.

🐍
Languages
Python R SQL
⚙️
Data Science
Machine Learning Statistical Modeling Feature Engineering Hypothesis Testing
Big Data & Distributed Computing
PySpark HPC (SLURM) Parallel Processing
🐳
MLOps & Deployment
Docker Containerized Workflows Reproducible Pipelines
📈
Data Visualization & Applications
R Shiny Plotly Tableau
🛠️
Tools & Environment
Git Jupyter Notebook VS Code

Work

Portfolio Projects

Health data, clinical ML, large-scale EDA, and bioinformatics research.

02 — SQL Data Model · Snowflake · Python New

Neighborhood Economic Risk — U.S. Census at Scale

Built a 3-layer Snowflake SQL model transforming raw ACS Census data (2019–2020, 220k+ neighborhoods) into income adversity insights. Unemployment in the lowest income tier is 4× higher than the highest. High-Risk neighborhoods grew +57% from 2019→2020, reflecting COVID-19 economic impact. Includes staged data pipeline, data dictionary, and binary classification model (Stage 2 in progress).

SQL Snowflake US Census · ACS Python scikit-learn Dash

03 — Clinical ML · Python · Scikit-learn

Osteoporosis Risk Prediction with Ensemble Methods

Explored an osteoporosis case-control dataset through demographic-driven EDA and built classification models (Logistic Regression, Random Forest, SVM, Gradient Boosting). Top models achieved strong ROC performance, but consistently favored the negative class — highlighting the real-world challenges of identifying positive cases in imbalanced clinical data.

Classification EDA ROC/AUC Class Imbalance Scikit-learn

04 — FDA Data · R · Interactive App

FDA Medical Device Harm Trends — RShiny Dashboard

Built an interactive visualization app over the 2016 MAUDE (FDA medical device passive surveillance) dataset. Users explore temporal harm trends across device categories and manufacturers. Demonstrates stakeholder-facing data product design.

RShiny FDA · MAUDE Time-series Dashboard R

05 — Bioinformatics Research · Docker Deployment

ISCVAM — Interactive Visual Analytics for Single-Cell Multiomics Research

An interactive visual analytics platform for single-cell multiome data. Integrates sc-RNA and sc-ATAC data to study transcriptomic and epigenetic profiles simultaneously. Features flexible clustering to identify rare cell populations, and supports cross-dataset comparison of up to three datasets for reproducibility. Accepted for presentation at AACR 2023.

Multiomics Reproducible pipeline HPC-slurm Docker Deployment Research

Background

About Me

I'm a Data Scientist with an M.S. in Operations Research and 4+ years working with some of the messiest, most complex data out there — clinical records, genomic profiles, large-scale census datasets. That background has made me unusually comfortable with ambiguity: when the data is sparse, domain-specific, and nothing works out of the box.

My work spans the full data science stack — statistical modeling, ML pipelines, big data processing with PySpark, and building data products that non-technical stakeholders can actually use. I care about end-to-end ownership: from raw, untidy data to a deployed, reproducible result.

Outside of health data, I've worked on economic risk modeling with U.S. Census data and large-scale EDA on income and healthcare spending patterns. I'm now focused on applying this foundation to general data science problems — anywhere rigorous analysis and practical ML can drive real decisions.

Contact

Get in Touch