Click each tab below for a description of the topic and representative papers. See the "Publications" page for a complete list of papers.


Synthetic Data

Synthetic data are artificially generated data that preserve the key statistical patterns of real data while reducing reliance on sensitive or hard-to-access records. Our work studies both the foundations and applications of synthetic data, with an emphasis on how generated samples can support modern data analysis and machine learning when privacy concerns, limited sample size, or class imbalance make real-world data insufficient. We are particularly interested in understanding when synthetic augmentation improves prediction, when it may introduce bias, and how to use it responsibly in practice. This includes developing statistical theory for synthetic data generation, designing data-driven methods for tuning synthetic sample size, and exploring applications in domains where privacy, scarcity, and fairness are central concerns.

Synthetic data illustration

Representative papers:


Health Informatics

Electronic Health Records (EHR) play a critical role in modern healthcare, providing comprehensive digital records of patient histories, treatments, and outcomes. Our team leverages EHR data for innovative research. We are currently focusing on several key areas, including: (1) phenotyping, where we identify and classify patient subgroups based on shared characteristics to better understand disease patterns; (2) timeline registration, which involves aligning and integrating medical events across different timelines to create a cohesive patient history; and (3) the development of synthetic EHR data. We collaborate with medical researchers on several syndromes, such as sepsis and acute kidney disease, aiming to enhance our ability to simulate, analyze, and predict outcomes in these critical health areas.

Real vs synthetic EHR illustration
EHR logic diagram

Representative papers:


Generative Models

Generative models are a class of machine learning models that aim to understand and model the underlying distribution of data, allowing researchers to generate synthetic data that resemble those in the original dataset. These models have broad applications in biomedical data analysis, where generating realistic data is crucial for tasks like simulation, privacy preservation, and data augmentation.

Example of real and synthetic time series

Representative papers:


Tensor Data Analysis

Our group focuses on the analysis of high-dimensional tensors, which commonly arise in fields like neuroimaging, microbiology, bioinformatics, and materials science. Traditional statistical methods often fall short when applied to these complex data structures, leading to computational challenges and suboptimal results. We have developed statistically optimal, computationally efficient methods with strong theoretical guarantees for tensor problems, including completion, regression, SVD/PCA, and clustering. These methods have been successfully applied to microscopy imaging, neuroimaging, genomics data, and more.

Illustration of tensor data analysis

Representative papers:


Microbiome Data Analysis

The human microbiome is the totality of all microorganisms in and on the human body. These microbes play a significant role in human metabolism and energy generation and are crucial to human health. Our group's research focuses on analyzing the human microbiome and addressing the challenges of analyzing compositional data.

Microbiome illustration

Representative papers:


High-dimensional Statistics

High-dimensional statistics focuses on the statistical inference of data where the number of variables (dimensions) is comparable to or greater than the number of observations. Traditional low-dimensional methods often fail in such settings due to challenges like overfitting, multicollinearity, and computational complexity. Our group has been working on various problems in this field, including specific topics such as compressed sensing, sparse linear regression, low-rank matrix recovery, and their applications.

Polytope representation illustration

Representative papers:


Non-convex/Riemannian Optimization & Statistics

Riemannian optimization is a framework for solving optimization problems on smooth manifolds, where traditional methods in Euclidean spaces are not directly applicable. By leveraging the geometric structure of the manifold, Riemannian optimization enables more accurate and efficient optimization on curved spaces. Our group has been utilizing and developing Riemannian optimization theory and methods to tackle complex, high-dimensional problems.

Riemannian optimization illustration

Representative papers:


Markov (Decision) Processes

Our research focuses on model reduction of Markov processes, a crucial problem in high-dimensional state-transition systems and reinforcement learning. We develop methods for estimating and aggregating states in discrete-time Markov processes using empirical trajectories, with a focus on key properties such as representability, aggregatability, and lumpability. We also study the tensor structure of the transition kernel in continuous-state-action Markov decision processes, proposing a tensor-inspired unsupervised learning method to identify low-dimensional state and action representations.

State aggregation illustration 1 State aggregation illustration 2

   High-order Markov chain illustration 1 High-order Markov chain illustration 2

Representative papers:


Network Analysis

Network analysis is a method used to study the relationships and interactions within a network. It is widely applied in fields such as social network analysis and gene interaction studies. Our group's research spans various projects involving tensor networks and multilayer networks.


Representative papers:


Computational Complexity of Statistical Inference

Traditional statistical inference has focused on determining fundamental statistical limits and developing algorithms to achieve them. However, a key challenge arises when statistically optimal estimators are computationally infeasible, while efficient algorithms often fall short of these theoretical limits, requiring more data or higher signal strength. This disconnect suggests that the true benchmark in modern high-dimensional settings is the statistical limit achievable by computationally efficient algorithms. Our team has investigated several topics related to the computational complexity of statistical inference, particularly for problems arising from tensor and network data.

Statistical-computational tradeoff diagram

Representative papers:


Collaborative Research

Collaborative research is essential for advancing scientific knowledge across disciplines. I served in the BERD (Biostatistics, Epidemiology, and Research Design) Core at Duke Biostatistics & Bioinformatics from 2020 to 2023, collaborating on various projects with the Departments of Neurosurgery, Radiology, and Psychiatry & Behavioral Sciences at Duke School of Medicine. Additionally, I have worked on several projects involving scientific topics outside the School of Medicine.

Duke BERD Methods Core cover

Representative papers:


Our research is supported in part by the NSF CAREER Grant 2203741 (sole PI) and NIH Grants R01HL169347 (sole PI) and R01HL168940 (multi-PI).

NSF Logo NIH Logo NHLBI Logo

 

Web Analytics