Camellia (Xinyue) Rui

PhD Candidate in Biostatistics · University of Southern California

I love AI Maxxing — from deep generative models and variational inference to LLM-powered agents and RAG pipelines. Passionate about using AI agents to accelerate research workflows in science. Currently researching with Prof. Nick Mancuso and Prof. Steven Gazal. Previously ML intern at Genentech.

Skills

Core Competencies

Deep Generative Modeling Representation Learning Variational Autoencoder LLM Agents Machine Learning Deep Learning Transformer Statistical Inference NLP Big Data Technologies RAG

Languages

Python R SQL Bash LaTeX

Libraries & Frameworks

JAX PyTorch Scikit-learn NumPy Pandas SciPy Keras TensorFlow Spark Dask LangChain Chroma OpenAI SDK Claude Code SDK MCP

Experience

PhD Agent — AI-Powered Research Assistant

University of Southern California · Sep 2025 – Present

  • Built an LLM agent with Claude Code SDK, integrating GitLab MCP to summarize weekly commits and track project progress
  • Integrated with Slack MCP and Zotero MCP to retrieve research papers, store references, and generate automatically structured meeting agendas contextualized by the prior week's work
  • Extended the agent with a Retrieval-Augmented Generation (RAG) system using LangChain, Chroma, and OpenAI API to answer domain-specific questions from papers
  • Developed an AI-powered conference session recommender using RAG and vector similarity search to automatically filter and rank 500+ abstracts against personalized research interests, achieving 92.3% precision (F1: 0.96) validated through systematic human evaluation
  • Designed an end-to-end agent pipeline combining multi-platform MCP APIs, RAG workflows, and conversational interfaces for a research assistant that serves PhD researchers

ML Research Intern — AI for Biology

Genentech, Inc. · May 2025 – Aug 2025

  • Conducted research and developed a deep learning Variational Autoencoder (VAE) model to model gene regulatory networks in a team of four using PyTorch
  • Engineered a model prototype from scratch using JAX and identified identifiability and misparametrization issues in the existing codebase
  • Reduced overall model loss from 6.7 × 10⁻² to 1 × 10⁻⁷ and improved inference accuracy by 19.7%
  • Successfully implemented the knockout procedure within the VAE model to denoise real biological signals while controlling the false discovery rate (FDR) under 10%
  • Managed reproducible code through GitLab using Merge Request-based Model Context Protocol (MCP), integrating open collaboration and clear communication across the team

Research Assistant — SCFM

Prof. Nicholas Mancuso & Prof. Steven Gazal · Mar 2024 – Present

  • Developed a machine learning method SCFM that identifies gene-to-disease associations on the largest-scale single-cell RNA-seq data (4.1GB), utilizing coordinate ascent variational inference
  • Achieved an average of 32% improvement in sensitivity and discovered an average of 15% more genetic variants when benchmarking against the existing method through extensive simulations
  • Built a new Python package implementing SCFM framework with JAX, leveraging big data technologies and HPC clusters to achieve ultra-fast computing speed with an average inference time 15x faster than the existing method (1.3s vs 20s)
  • Enabled robustness on calibration and model misspecification over 4000+ simulation scenarios and benchmarked the method against baseline and other published models
  • Accepted as the first-author abstract to a top-tier conference American Society of Human Genetics, demonstrating strong communication and publication skills

Research Assistant — PerturbVI

Prof. Nicholas Mancuso · Mar 2024 – Present

  • Developed a machine learning method PerturbVI that discovered gene regulatory networks with CRISPR perturbation data and single-cell RNA-seq data using Variational Inference and JAX in a team of three
  • Simulated model misspecification of latent variables using Python and improved 6.5% sensitivity compared to existing methods
  • Enabled ultra-fast inference speed with an average convergence time of 70x faster on the largest scale perturbation matrix (310,385 × 8,563) than the existing method
  • Optimized core algorithms, improving statistical inference by reducing computation time of false signal rate by 4x and significantly accelerating large-scale genetic analysis
  • Collaborated with team members to enhance model initialization, decreasing compiling time from 3.5 minutes to 1 minute and improving overall productivity

Accomplishments

Keck School of Medicine/Graduate School Fellowship

Aug 2022

University of Southern California

Jennifer Battat Scholarship

Jun 2020

University of Southern California

Provost's Research Fellowship

Sep 2019

University of Southern California

Publications

scFM: an efficient statistical fine-mapping approach for eQTLs using large-scale single-cell data

Rui X, et al. (1st author) · ASHG 2024 Abstract, 2024

perturbVI: A Scalable Latent Factor Model to Infer Genetic Regulatory Modules through CRISPR Perturbation Data

2nd author · In preparation, 2025

Estimating heritability explained by local ancestry and evaluating stratification bias in admixture mapping from summary statistics

Contributing author · American Journal of Human Genetics, 2024

A global view of disparity in imputation resources for conducting genetic studies in diverse populations

Rui X, et al. (2nd author) · American Journal of Human Genetics, 2022