Seminars and Events
Past Seminars
2024-09-24, Dr. Miyeon Yeon, UTHSC
Mediation analysis has been widely adopted to elucidate the role of intermediary variables
derived from neuroimaging data. Structural equation models (SEMs) are typically employed
to investigate the influences of exposures on outcomes, with model coefficients being
interpreted as causal effects. While existing SEMs are effective tools, limited research
has considered shape mediators. In addition, the linear assumption may lead to efficiency
losses and decreased predictive accuracy in real-world applications. To address these
challenges, we introduce a novel framework for shape mediation analysis, designed
to explore the causal relationships between genetic exposures and clinical outcomes,
whether mediated or unmediated by shape-related factors while accounting for potential
confounding variables. We propose a two-layer shape regression model to characterize
the relationships among neurocognitive outcomes, elastic shape mediators, genetic
exposures, and clinical confounders. Both simulated studies and real-data analyses
demonstrate the superior performance of our proposed method in terms of estimation
accuracy and robustness when compared to existing approaches.
Past Biostatistics Seminars
2024-08-26, Durbadal Ghosh, Ph.D. candidate Biostatistics, Florida State University,
Tallahassee, BigRiverQTL: A Toolbox for Navigating Big Genetic Data Workflows
Genotype-phenotype associations provide crucial insights for disciplines such as biology,
agriculture, and medicine. This project introduces BigRiverQTL.jl, a comprehensive
Julia package designed to streamline quantitative trait locus (QTL) analysis. It features
three main components: preprocessing, genome scans, and visualization. The preprocessing
functions convert genomic data into an efficient format, calculate kinship matrices,
and prepare data for analysis. For genome scans, BigRiverQTL.jl utilizes BulkLMM.jl
for univariate scans and integrates the FlxQTL.jl module for multivariate and longitudinal
trait analyses. Additionally, BigRiverQTL includes visualization tools that facilitate
the examination of both QTL and eQTL scans. Overall, BigRiverQTL.jl represents a significant
computational advancement in high-throughput QTL analysis, making sophisticated statistical
tools accessible within the Julia programming environment.
2024-08-19, Shanthi Sree Edara, M.S., University of Southern Mississippi, Hattiesburg,
Impact of Gift Card Challenges on the Quality and Quantity of Research
Financial incentives, such as gift cards, are essential for increasing participant
recruitment in research involving human subjects. However, administrative hurdles,
Internal revenue services (IRS) regulations and university-specific challenges, often
complicate the process of securing and distributing these incentives. Despite their
benefits, these difficulties can impact both the quality and quantity of research.
Therefore, understanding and addressing these challenges is crucial for ensuring that
incentives effectively improve participant engagement and the success of research
studies. This study examines the perceived impact of challenges in obtaining gift
cards on both the quality and quantity of research, and how these effects vary depending
on researchers' professional experience levels. Methods: A Qualtrics online survey
was distributed to participants who are at least 18 years old and a person who uses
gift cards in research. Participants were recruited through institutional listservs,
scientific organization listservs, NIH-funded researcher email lists, and multiple
sampling strategies. The experience levels were group into: Trainees (Undergraduate,
graduate, post-doctoral, resident); Managers (Project managers, Research coordinators);
Faculty (Junior faculty -less than 5 years, established faculty-more than 5 years
of experience, research scientist). The variable how much do challenges in obtaining
gift cards affect the quality of your research were recoded into not affected (Not
at all), moderately affected (little bit and moderately), highly affected (quite a
bit and very much). How much do challenges in obtaining gift cards affect the quantity
of your research were recoded into not affected (Not at all), moderately affected
(little bit and moderately), highly affected (quite a bit and very much). Data was
analyzed using IBM SPSS Statistics 29.0, with emphasis on descriptive statistics and
ordinal regression to compare the perceived impact of challenges in obtaining gift
cards on the quality and quantity of research between trainees, managers, and faculty,
with faculty as the reference group. Results: Among the 242 respondents, there were
117 established faculty, 46 junior faculty, 30 project managers, and 20 graduate students.
These respondents were grouped into three categories: 168 faculty, 34 managers, and
42 trainees. Researchers reported that both research quality (73.1%) and quantity
(68.3%) were moderately to highly affected by challenges in obtaining gift cards.
Ordinal regression analysis showed that trainees experienced a significantly greater
perceived impact on both the quality (parametric estimate = 1.10, p = 0.001) and quantity
(parametric estimate = 1.68, p < 0.001) of their research compared to faculty. In
contrast, managers did not report a significant difference in perceived impact on
research quality (parametric estimate = 0.16, p = 0.64) or quantity (parametric estimate
= 0.41, p = 0.24) when compared to faculty. Conclusion: The analysis reveals that
trainees perceive a significantly greater impact on both the quality and quantity
of their research due to challenges in obtaining gift cards, compared to faculty,
while managers' perceptions do not significantly differ from those of faculty. Future
research should explore strategies to streamline gift card procurement to mitigate
these impacts.
2024-08-12, Yinan Chen, M.S. Candidate in Statistics, University of Illinois Urbana-Champaign,
Exploring trends in US adolescent and young adult drug overdose mortality and creating
an interactive web application
Drug overdose has affected adolescents and young adults in the US drastically. This
study examined trends in drug overdose deaths related to illicitly manufactured fentanyls
(IMFs) among adolescents aged 10-24 from 1999 to 2020. Linear regression was employed
to analyze the changes in drug use across this demographic, while bootstrap was used
to compare rates of increase. The results indicate that IMFs are implicated in the
majority of overdose deaths among youth. Specifically, individuals aged 20-24 exhibited
the most rapid increase in overdose rates and the highest incidence of fatalities.
Additionally, significant racial disparities were observed in the rates of drug overdose
deaths. An interactive web application has been developed to help researchers and
the public visualize these trends.
2024-07-15, Dr. Chris Emfinger, Department of Biochemistry, University of Wisconsin-Madison,
Systems Genetics of Diet Responsivity
Diabetes, resulting from insufficient insulin secretion to match metabolic demand,
affects hundreds of millions of people. Diabetes risk is highly heritable and the
majority of single-nucleotide polymorphisms conferring diabetes risk are thought to
influence insulin production. Conversely, environmental factors such as obesity and
diet can elevate metabolic demand and drive resistance to insulin. Individuals display
a wide variation in diet responsivity. Consequently, despite many years of research
there is no consensus on a single diet most compatible with health and most useful
in preventing or reversing metabolic dysfunction in susceptible individuals. Our lab
focuses on understanding the genetic factors driving insulin production and diet responsivity.
In contrast to many labs studying metabolism using only a handful of highly inbred
mouse strains, we focus on mice with high genetic and phenotypic variation. This allows
us to perform genetic screens in well controlled experiments with high power, identifying
novel target genes regulating metabolism which we can then validate experimentally.
This talk will be an overview of our genetics pipeline and some of the techniques
we are currently applying to nominate and validate interesting targets. Bio: https://attielab.biochem.wisc.edu/staff/emfinger-chris/
2024-07-01, Dr. Mehmet Koçak, Istanbul Medipol University, Istanbul, Turkey, Modeling
GeoSpatial Data: Turkish Health Studies
Spatial autocorrelation is a fundamental concept in spatial statistics, describing
the degree to which a set of spatial data points are correlated with each other across
a geographical space. This presentation provides a comprehensive overview of spatial
autocorrelation, discussing its definitions, detection methods, and modeling techniques.
Firstly, we delve into the definition of spatial autocorrelation, highlighting its
significance in understanding geographical data patterns. We cover various aspects
such as self-correlation due to geographical ordering, information content in geo-referenced
data, and its role as a diagnostic tool for spatial model misspecification. Next,
we explore methods for detecting spatial autocorrelation, focusing on Moran’s I and
Geary’s C statistics, which provide formal tests for spatial dependency. The presentation
explains how to compute these statistics and interpret their results, supported by
visual examples and practical applications. We then shift to modeling spatial autocorrelation,
demonstrating the use of SAS procedure PROC SPATIALREG. The SPATIALREG procedure enables
the application of various spatial models, including Spatial Auto-regressive (SAR),
Spatial Durbin Model (SDM), Spatial Error Model (SEM), etc. We also provide examples
of geospatial modeling using sets of Turkish Health Studies, which are national surveys
having more than 20,000 participants, considering spatial autocorrelation in both
the response and predictor variables. We discuss the generation of spatial weight
matrices and the fitting of spatial models to the data, highlighting key findings
and model performance metrics. https://sabita.medipol.edu.tr/index.php/portfolio-item/prof-dr-mehmet-kocak/
2023-05-06, Dr. Lou You, Health Informatics Institute, University of South Florida,
Joint Modeling of Multivariate Nonparametric Longitudinal Data and Survival Data by
A Local Smoothing Approach
In many clinical studies, evaluating the association between longitudinal and survival
outcomes is of primary concern. For analyzing data from such studies, joint modeling
of longitudinal and survival data becomes an appealing approach. In some applications,
there are multiple longitudinal outcomes whose longitudinal pattern is difficult to
describe by a parametric form. For such applications, existing research on joint modeling
is limited. In this paper, we develop a novel joint modeling method to fill the gap.
In the new method, a local polynomial mixed-effects model is used for describing the
nonparametric longitudinal pattern of the multiple longitudinal outcomes. Two model
estimation procedures, i.e., the local EM algorithm and the local penalized quasi-likelihood
estimation, are explored. Practical guidelines for choosing tuning parameters and
for variable selection are provided. The new method is justified by some theoretical
arguments and numerical studies. Bio: https://usf.discovery.academicanalytics.com/scholar/599629/LU-YOU
2024-04-22, Dr. Lorin Crawford, Brown University and Microsoft Research New England,
Probabilistic methods to identify multi-scale enrichment in genomic sequencing studies
A consistent theme of the work done in my lab group is to take modern computational
approaches and develop theory that enable their interpretations to be related back
to classical genomic principles. The central aim of this talk is to address variable
selection questions in nonlinear and nonparametric regression. Motivated by statistical
genetics, where nonlinear interactions and non-additive variation are of particular
interest, we introduce a novel, interpretable, and computationally efficient way to
summarize the relative importance of predictor variables. Methodologically, we present
flexible and scalable classes of Bayesian models which provide interpretable probabilistic
summaries such as posterior inclusion probabilities and credible sets for association
mapping tasks in high-dimensional studies. We illustrate the benefits of our methods
over state-of-the-art linear approaches using extensive simulations. We also demonstrate
the ability of these methods to recover both novel and previously discovered genomic
associations using real human complex traits from the Wellcome Trust Case Control
Consortium (WTCCC), the Framingham Heart Study, and the UK Biobank. Bio: https://www.lorincrawford.com/
2024-03-28, Dr. Hailin Sang, University of Mississippi, Error analysis of generative
adversarial network
The generative adversarial network (GAN) is an important model developed for high-dimensional
distribution learning in recent years. However, there is a pressing need for a comprehensive
method to understand its error convergence rate. In this research, we focus on studying
the error convergence rate of the GAN model that is based on a class of functions
encompassing the discriminator and generator neural networks. These functions are
VC type with bounded envelope function under our assumptions, enabling the application
of the Talagrand inequality. By employing the Talagrand inequality and Borel-Cantelli
lemma, we establish a tight convergence rate for the error of GAN. This method can
also be applied on existing error estimations of GAN and yields improved convergence
rates. In particular, the error defined with the neural network distance is a special
case error in our definition. This talk is based on the project jointly with Mahmud
Hasan. Bio: https://math.olemiss.edu/hailin-sang/
2024-02-12, Dr. Florian Privé , Aarhus University, Denmark, Quality control of GWAS
summary statistics
Results from genome-wide association studies (GWAS summary statistics) have been extensively
used in different applications such as estimating the genetic architecture of complex
traits and diseases, identifying causal variants with fine-mapping, and predicting
complex traits with polygenic scores. One reason behind the popularity of GWAS summary
statistics is that they are widely available and shared, e.g. in the GWAS Catalog.
However, these GWAS summary statistics come with varying degrees of quality, and from
many different tools and studies. In this presentation, I go over several different
cases that could go wrong when using GWAS summary statistics and what we can do about
it. Most of these are still unknown to many people using GWAS summary statistics on
a regular basis, which can cause results they derive to be biased or suboptimal. I
present and discuss past and current work on how to perform some quality control and
ultimately improve the quality of GWAS summary statistics, in order to make best use
of them. Bio: https://privefl.github.io/
2024-01-22, Dr. Madeleine Udell, Stanford University, Big Data is Low Rank
Data scientists are often faced with the challenge of understanding a high dimensional
data set organized as a table. These tables may have columns of different (sometimes,
non-numeric) types, and often have many missing entries. This talk surveys methods
based on low rank models to analyze these big messy data sets. We show that low rank
models perform well — indeed, suspiciously well — across a wide range of data science
applications, including in social science, medicine, and machine learning. This good
performance demands (and this talk provides) a simple mathematical explanation for
their effectiveness, which identifies when low rank models perform well and when to
look beyond low rank. Bio: https://web.stanford.edu/~udell/bio.html
2023-11-13, Dr. Xichen Mou, University of Memphis, Generalized kernel machine regression
Kernel Machine Regression (KMR) serves as a nonparametric regression approach fundamental
in numerous scientific domains. By utilizing a map determined by the kernel function,
KMR transforms original predictors into a higher-dimensional feature space, simplifying
the recognition of patterns between outcomes and independent variables. KMR is invaluable
in studies within the biomedical and environmental health sectors, where it aids in
identifying crucial exposure points and gauging their impact on results. In our study,
we introduce the Generalized Bayesian Kernel Machine Regression (GBKMR) which integrates
the KMR model within the Bayesian context. GBKMR not only complements the conventional
KMR but also suits a range of outcome data, from continuous to binary and count data.
Simulation studies confirm GBKMR's superior precision and robustness. We further employ
this method on a real data set to pinpoint specific cytosine phosphate guanine (CpG)
locations correlated with health-related outcomes or exposures. Bio: https://www.memphis.edu/publichealth/contact/faculty_profiles/mou.php
2023-10-30, Kip Handwerker MSc, A Clustering Approach to Non-Equal Length Joint Pattern
Genetic and Epigenetics Factors Weighted by Covariates
University of Memphis Biostatistics PhD Student. Clustering analysis is a popular
approach to gaining insight into the structure of data, especially on a large scale.
Some of the most popular approaches are the K-means and K-prototype algorithms which
are partitioning methods that use distance measures to assign groups. While these
methods are good, especially for large datasets, when it comes to genetics data they
fail to consider potential joint effects and require the same dimensionality across
variables. The Vector in Partition (VIP) algorithm fills this gap with a distance
measure designed to partition genetic and epigenetic data with non-equal length dimensions;
specifically, gene expression (GE), DNA methylation (CPG), and single nucleotide polymorphisms
(SNP). The VIP extension method extends this framework by adding another layer of
complex joint effects of genetic and epigenetic data with other potential health-related
variables to dictate clustering. The extension algorithm performs well on simulated
data when the clustering of the covariates follows the same clustering scheme of the
genetics data. Like other distance measures, when the data does not follow a clear
clustering scheme the algorithm tends to underperform, especially against numeric
data. The results highlight many aspects of the algorithm’s performance capabilities,
as well as multiple areas for future improvements.
2023-05-22, Dr. Hua Zhou, Inferring Within-Subject Variances From Intensive Longitudinal
Data
University of California Los Angeles, The availability of vast amounts of longitudinal
data from electronic health records (EHR) and personal wearable devices opens the
door to numerous new research questions. In many studies, individual variability of
a longitudinal outcome is as important as the mean. Blood pressure fluctuations, glycemic
variations, and mood swings are prime examples where it is critical to identify factors
that affect the within-individual variability. We propose a scalable method, within-subject
variance estimator by robust regression (WiSER), for the estimation and inference
of the effects of both time-varying and time-invariant predictors on within-subject
variance. It is robust against the misspecification of the conditional distribution
of responses or the distribution of random effects. It shows similar performance as
the correctly specified likelihood methods but is 10_ ~10_ times faster. The estimation
algorithm scales linearly in the total number of observations, making it applicable
to massive longitudinal data sets. The effectiveness of WiSER is illustrated using
the accelerometry data from the Women's Health Study and a clinical trial for longitudinal
diabetes care.
2023-04-24, Dr. Quian Li, Statistical Frameworks for Longitudinal Metagenomic and
Transcriptomic Data
St. Jude Children's Research Hospital, Longitudinal sampling has become popular in
the omics studies, such as microbiome and transcriptome. To identify operational taxonomy
units (OTUs) signaling disease onset, a powerful and cost-efficient strategy was selecting
participants by matched sets and profiling their temporal metagenomes, followed by
trajectory analysis. We proposed a joint model with matching and regularization (JMR)
to detect OTU-specific trajectory predictive of host disease status. The between-
and within-matched-sets heterogeneity in OTU relative abundance and disease risk were
linked by nested random effects. The inherent negative correlation in microbiota composition
was adjusted by incorporating and regularizing the top-correlated taxa as longitudinal
covariate, pre-selected by Bray-Curtis distance and elastic net regression. For the
longitudinal bulk transcriptomes, we propose a statistical framework ISLET to infer
individual-specific and cell-type-specific transcriptome reference panels. ISLET models
the repeatedly measured bulk gene expression data to optimize the usage of shared
information within each subject. ISLET is the first available method to achieve individual-specific
reference estimation in repeated samples. In the simulation study and an application
to a large-scale metagenomic study, JMR outperformed the competing methods and identified
important taxa in infantsÕ fecal samples with dynamics preceding host autoimmune status.
We also show outstanding performance of ISLET in the reference estimation and downstream
cell-type-specific differentially expressed genes testing in simulation. An application
of ISLET to the longitudinal PBMC transcriptomes in the same study confirms the cell-type-specific
gene signatures for early-life autoimmunity.
2023-03-27, Dr. Patrik Breheny, False Discovery Rates for Penalized Regression Models
University of Iowa, Penalized regression is an attractive methodology for dealing
with high-dimensional data where classical likelihood approaches to modeling break
down. However, its widespread adoption has been hindered by a lack of inferential
tools. In particular, penalized regression is very useful for variable selection,
but how confident should one be about those selections? How many of those selections
would likely have occurred by chance alone? In this talk, I will review recent developments
in this area, with an emphasis on my work and that of my recent graduate students.
2023-02-20, Dr. Cecile An, Methods for Phylogenetic Networks
University of Wisconsin-Madison, Phylogenetic networks and admixture graphs can represent
the past history of a group of species or populations and how they diversified. Unlike
trees, networks can represent events such as migration between populations, admixture,
hybridization between species, or recombination between viral strains. I will give
an overview of how phylogenetic networks are used and the difficulties of estimating
phylogenetic networks from genome-wide data. Then I will focus on the characterization
of what is (or is not) knowable about the network based on genetic distance data.
2022-10-31, Chenhao Zhao, MatrixLM: a flexible, interpretable framework for high-throughput
data
Geisel School of Medicine, Dartmouth College, The Matrix Linear Model (MLM) is an
efficient and computationally feasible solution to association analysis for biomedical
high-throughput data. Sen and Liang (2018) developed the MatrixLM.jl package in Julia.
programming language, providing core functions to estimate matrix linear models. The
project's main goal was to collaborate on user-friendly documentation, increase testing
features, and improve certain coding functionalities. We used simulated data to demonstrate
how to use the package and used a case study example based on an actual disease metabolomics
study to showcase MLM's benefits. Nonalcoholic fatty liver disease (NAFLD) is a progressive
liver disease that is strongly associated with type II diabetes 2. Using Matrix Linear
Models, our analysis investigated the association between metabolite characteristics
(e.g., pathways) and patient characteristics such as type II diabetes.
2022-10-17, Dr. Heather M. Highland, A multi-omics approach to understanding the role
of APOL1 in CKD amongst African Ancestry individuals
The University of North Carolina at Chapel Hill, APOL1 is an integral part of the
complement system, a component of innate immunity that serves as the first line of
defense against pathogens. The trypanosome that causes African sleeping sickness developed
mechanisms to evade the innate immunity system after the migration out of Africa.
This created a selective pressure on non-synonymous variants in the APOL1 gene. These
variants are only observed in people with recent African ancestry. People carrying
two variants in APOL1 are at increased risk of developing a variety of kidney disease.
The mechanism by which APOL1 alters kidney function is currently unclear. To investigate
potential mechanisms, we have looked at differences in DNA methylation across the
genome and metabolomic profiles. Using 1740 African Americans (AA) in the ARIC study
and 3886 Hispanic/Latinos in SOL, we did not observe any statistically significant
differences in metabolism in preliminary analyses after adjusting for multiple comparisons.
In 947 AAs in ARIC, 949 AAs in JHS and 332 AAs in MESA, we identified extensive differences
in methylation near the APOL gene family region on chromosome 22 (p=2.7x10-80). Additional
methylation differences were seen near FEZF2, FAM20A, and KIAA0556. The role of these
loci in the development of kidney disease in people with two APOL1 risk alleles continues.
2022-09-19, Dr. Luis FS Castro-de-Araujo, Bidirectional Causal Modeling With Instrumental
Variables and Data From Relatives
The University of Melbourne, Establishing (or falsifying) causal associations is an
essential step towards developing effective interventions for psychiatric and substance
use problems. While randomized controlled trials (RCTs) are considered the gold standard
for causal inference in health research, they are impossible or unethical in many
common scenarios. Mendelian randomization (MR) can be used where RCTs are not feasible,
but it requires stringent assumptions that can be fundamentally flawed when applied
to complex traits. Some assumptions of MR can be avoided with using structural equation
modeling. In this paper we developed an extension of the Direction of Causation twin
model (Neale 1994) that includes two polygenic risk scores in the specification, as
an approach to avoid some inherent restrictions of both MR and RCT. We hypothesize
that adding a second PRS will generate a more flexible model in terms of identification,
whilst maintaining reasonable power and allowing for bidirectional causation. OpenMx
software is used to explore the power of such a model and its identification. We arrive
at an extension of the Direction of Causation model that can be used both in a twin
design or in a extended family design, but at the same time relaxing some of MRs assumptions.
We further report the model is well powered enough for current data set sizes (from
around 13000 observations or less, depending on the variance of the instruments),
and in a range of additive, shared and environmental variances found in common clinical
scenarios.
2022-08-29, Dr. Arash Shaban-Nejad, Using Ontologies for Knowledge Engineering and
Management in Medicine and Healthcare
University of Tennessee Health Science Center, Health intelligence relies on the systematic
collection and integration of data from diverse distributed and heterogeneous sources
at various levels of granularity. These sources include data from multiple disciplines
represented in different formats, languages, and structures posing significant integration
and analytics challenges. Using a series of clinical and population health applications
and use cases, this seminar highlights the contribution made by emerging semantic
technologies that offer enhanced interoperability, interpretability, and explainability
through the adoption of ontologies (a computational artifact capturing domain knowledge
using concepts, relations, and complex logical rules and axioms), and knowledge graphs.
2022-08-22, Dr. Abraham Palmer, Using Outbred HS Rats to Study the Genetic Basis of
Almost Everything You Can Think Of
University of California San Diego, Whereas there are well established methods for
translating findings about single genes from humans to non-humans, there is an urgent
need for methods to translate the polygenic signals obtained from GWAS across species.
This is difficult because GWAS produces information about SNPs rather than genes,
however SNPs are inherently species specific. My lab is helping to develop two complementary
methods to address this problem. Both methods depend on translation of GWAS signals
from SNPs to genes. In one method, this is done by choosing the gene that is nearest
to an implicated SNP. The list of orthologous genes from two or more species are then
projected into a previously defined gene network and a random walk is used to defuse
the signal to neighboring genes. The overlap between the network defined by each species
is then assessed for significance relative to permuted gene sets. In the second method,
SNPs are used to predict gene expression and these predictions are used to estimate
the effect of each geneÕs expression on phenotype, creating what we term a polygenetic
transcriptomic risk score (PTRS). A PTRS can then be used in conjunction with orthologous
genes such that a PTRS defined in one species can be used to estimate an analogous
trait in individuals from another species. In preliminary work we found that both
methods identify highly statistically significant overlap in the signals associated
with both BMI and body length. We are extending these methods to behavioral traits,
including those relevant for substance use disorders.
2022-04-11, Dr. Daniel Roden, A Single-Cell and Spatially Resolved Atlas of Human
Breast Cancers
University of New South Wales, Breast cancers are complex cellular ecosystems where
heterotypic interactions play central roles in disease progression and response to
therapy. However, our knowledge of their cellular composition and organization remains
limited. Recently we published an integrated cellular and spatial atlas of 26 primary
human breast cancers spanning all major molecular subtypes1. This provided a systematic,
high-resolution, characterization of the cellular diversity of the epithelial, immune
and stromal cellular landscape. To investigate neoplastic cell heterogeneity, we developed
a single cell classifier of intrinsic subtype (scSubtype) and revealed recurrent transcriptional
gene modules that define the neoplastic cells. This detailed cellular taxonomy was
then used to deconvolute large breast cancer cohorts, allowing their stratification
into nine clusters, termed ÔecotypesÕ, with unique cellular compositions and association
with clinical outcome. Further, Visium spatial profiling provided an initial view
of how stromal, immune and neoplastic cells are spatially organized in tumours, offering
insights into tumour regulation. This work is now being expanded to the generation
and integration of 100s of cellular profiles and a matching dataset of spatially resolved
tumour transcriptomes. We propose that this will identify cellular niches that are
spatially organized in breast tumours, offering insights into anti-tumour immune regulation
and neoplastic heterogeneity. In particular, IÕll discuss recent work, using whole-transcriptome
Nanostring spatial profiling, that analyses T-cell and cancer rich regions of the
tumour micro-environment in triple-negative breast cancers, showing how integration
with our cellular taxonomy allows for deconvolution of the cell-type abundances in
these specific tissue regions. Our work highlights the potential of large-scale, integrated,
cellular and spatial genomics to unravel the complex cellular heterogeneity within
tumours and identify novel cell types, niches, and regulatory states that will inform
treatment response.
2022-03-28, Dr. Li Hsu, A Mixed-Effects Model for Powerful Association Tests in Integrative
Functional Genomic Data
Fred Hutchinson Cancer Research Center, Genome-wide association studies (GWAS) have
successfully identified thousands of genetic variants for many complex diseases; however,
these variants explain only a small fraction of the heritability. Recent developments
in genotype-omics studies have shown promises for discovering novel loci by leveraging
genetically regulated molecular phenotypes (e.g., gene expression, methylation, proteomics)
into GWAS. However, there is a limitation in the existing approaches. Some variants
can individually influence disease risk through alternative functional mechanisms.
Existing approaches of testing only the association of imputed molecular phenotypes
will potentially lose power. To tackle this challenge, we consider a unified mixed
effects model that formulates the association of intermediate phenotypes such as imputed
gene expression through fixed effects, while allowing for residual effects of individual
variants as random effects. We consider a set-based score testing framework, MiST
(Mixed effects Score Test), and propose data-driven combination approaches to jointly
test for the fixed and random effects. We also provide p-values for fixed and random
effects separately to enhance interpretability of the association signals. Recently,
we extend the MiST to depend on only GWAS summary statistics instead of individual
level data, allowing for a broad application of MiST to GWAS data. Extensive simulations
demonstrate that MiST is more powerful than existing approaches and summary statistics-based
MiST (sMiST) agrees well those obtained from individual level data with substantively
improved computational speed. We apply sMiST to a large-scale GWAS of colorectal cancer
using summary statistics from >120, 000 study participants and gene expression data
from the Genotype-Tissue Expression (GTEx) project. We identify several novel and
secondary independent genetic loci.
2022-03-21, Dr. Hongkai Ji, A statistical framework for differential pseudotime analysis
with multiple single-cell RNA-seq samples
Johns Hopkins Bloomberg School of Public Health, Pseudotime analysis with single-cell
RNA-sequencing (scRNA-seq) data has been widely used to study dynamic gene regulatory
programs along continuous biological processes. While many computational methods have
been developed to infer the pseudo-temporal trajectories of cells within a biological
sample, methods that compare pseudo-temporal patterns with multiple samples (or replicates)
across different experimental conditions are lacking. Lamian is a comprehensive and
statistically-rigorous computational framework for differential multi-sample pseudotime
analysis. It can be used to identify changes in a biological process associated with
sample covariates, such as different biological conditions, and also to detect changes
in gene expression, cell density, and topology of a pseudotemporal trajectory. Unlike
existing methods that ignore sample variability, Lamian draws statistical inference
after accounting for cross-sample variability and hence substantially reduces sample-specific
false discoveries that are not generalizable to new samples. Using both simulations
and real scRNA-seq data, including an analysis of differential immune response programs
between COVID-19 patients with different disease severity levels, we demonstrate the
advantages of Lamian in decoding cellular gene expression programs in continuous biological
processes.
2022-03-14, Dr. Lin Hou, Inference of disease associated genomic segments in post-GWAS
analysis
Tsinghua University, Identification and interpretation of disease associated loci
remain a paramount challenge in genome-wide association (GWAS) of complex disease.
We develop post-GWAS analysis tools, which leverage pleiotropy and functional annotations
to dissect the genetic architecture of complex traits. In this talk, I will first
introduce LOGODetect, a powerful and efficient statistical method to identify small
genome segments harboring local genetic correlation signals. LOGODetect automatically
identifies genetic regions showing consistent association with multiple phenotypes
through a scan statistic approach. Applied to seven neuropsychiatric traits, we identify
hub regions showing concordant effect on five or more traits. Next, I will introduce
Openness Weighted Association Studies (OWAS), a computational approach that leverage
and aggregate predictions of chromatin accessibility in personal genomes for prioritizing
GWAS signals. In extensive simulation and real data analysis, OWAS identifies genes/segments
that explain more heritability than existing methods, and has a better replication
rate in independent cohorts than GWAS. Moreover, the identified genes/segments show
tissue-specific patterns and are enriched in disease relevant pathways.
2022-02-14, Dr. Danyu Lin, Durability of Covid-19 Vaccines
University of North Carolina at Chapel Hill, Evaluating the durability of protection
afforded by Covid-19 vaccines is a public health priority, with the results needed
to inform policies around booster vaccinations as well as those around non-pharmaceutical
interventions. In this talk, I will present a general framework for estimating the
effects of Covid-19 vaccines over time in phase 3 clinical trials and observational
studies. I will show some results on the duration of vaccine protection from the Moderna
pivotal trial and from the North Carolina statewide surveillance data. The latter
data, which were published in the New England Journal of Medicine in January, provided
rich information about the effectiveness of the Pfizer, Moderna, and Johnson & Johnson
vaccines in reducing the risks of Covid-19, hospitalization, and death over time.
I will discuss the implications of these results for booster vaccinations.
2022-02-07, Dr. Arjun Krishnan, Democratizing data-driven biology: Tackling incomplete
data, unstructured metadata, and hidden curricula
Michigan State University, There is much enthusiasm about using omics and biomedical
data collections to fuel research on complex traits and diseases. However, there are
still some well-known fundamental challenges in seamlessly and effectively using these
data to drive research. For instance, there are >1.5 million human gene expression
profiles that are publicly available, but, depending on the technology/platform used
to record each profile, different subsets of genes in the genome are measured in these
transcriptomes, leading to thousands of unmeasured genes in many of these profiles.
These gaps in data are major hurdles for integrative analysis. Critical problems also
exist with data descriptions: the majority of >2 million publicly available omics
samples lack structured metadata, including information about tissue of origin, disease
status, and environmental conditions. Thus, discovering samples and datasets of interest
is not straightforward. In this seminar, I will present recent work from our group
on developing machine learning approaches to address these fundamental challenges.
In addition, I will discuss the need for improving advanced research training in biological
data analysis by formalizing concepts in statistical procedures, study design, data/code
management, critically consuming data-driven findings, and reproducible research.
2022-01-31, Dr. Yingying Wei, Meta-clustering of Genomic Data
Chinese University of Hong Kong, Like traditional meta-analysis that pools effect
sizes across studies to improve statistical power, it is of increasing interest to
conduct clustering jointly across datasets to identify disease subtypes for bulk genomic
In this talk, I will present our Batch-effects-correction-with-Unknown-Subtypes (BUS)
framework. BUS is capable of correcting batch effects explicitly, grouping samples
that share similar characteristics into subtypes, identifying features that distinguish
subtypes, and enjoying a linear-order computational complexity. We prove the identifiability
of BUS for not only bulk data but also scRNA-seq data whose dropout events suffer
from missing not at random. We mathematically show that under two very flexible and
realistic experimental designsÑthe Òreference panelÓ and the Òchain-typeÓ designsÑtrue
biological variability can also be separated from batch effects. Moreover, despite
the active research on analysis methods for scRNA-seq data, rigorous statistical methods
to estimate treatment effects for scRNA-seq dataÑhow an intervention or exposure alters
the cellular composition and gene expression levelsÑare still lacking. Building upon
our BUS framework, we further develop statistical methods to quantify treatment effects
for scRNA-seq data. data and discover cell types for single-cell RNA-sequencing (scRNA-seq)
data. Unfortunately, due to the prevalence of technical batch effects among high-throughput
experiments, directly clustering samples from multiple datasets can lead to wrong
results. The recent emerging meta-clustering approaches require all datasets to contain
all subtypes, which is not feasible for many experimental designs.
2022-01-24, Dr. Xuexia Wang, Novel Genetic Association Test and Cardiomyopathy Risk
Prediction in Cancer Survivors
University of North Texas, This talk includes two projects. Project 1: Gene-based
association tests are widely used in GWAS. The power of a test is often limited by
the sample size, the effect size, and the number of causal variants or their directions
in a gene. In addition, access to individual-level data is often limited. To resolve
the existing limitations, we proposed an optimally weighted combination (OWC) test
based on summary statistics from GWAS. We analytically proved that aggregating the
variants in one gene is the same as using the weighted combination of Z-scores for
each variant based on the proposed score test. Several existing methods are special
cases. We also numerically illustrated that our proposed test outperforms several
existing methods via simulation studies. Furthermore, we utilized schizophrenia GWAS
data and fasting glucose GWAS meta-analysis data to demonstrate that our method outperforms
the existing methods in real data analyses. Project 2: We used a carefully curated
list of 87 previously published genetic variants to determine whether the incorporation
of genetic variants with non-genetic variables could improve the identification of
cancer survivors at risk for anthracycline-related cardiomyopathy. We used anthracycline-exposed
childhood cancer survivors from a ChildrenÕs Oncology Group study (COG-ALTE03N1: 146
cases; 195 matched controls) as the discovery set. Replication was performed in two
anthracycline-exposed survivor populations: i) childhood cancer survivors from the
Childhood Cancer Survivor Study (CCSS: 126 cases; 250 controls); ii) autologous blood
or marrow transplantation (BMT) survivors from the BMT Survivor Study (BMTSS: 80 cases;
78 controls). The Clinical+Genetic model performed better than the Clinical Model
in COG-ALTE03N1 (AUC of Clinical+Genetic Model = 0.88 vs. AUC of Clinical Model =
0.81) and BMTSS (AUC of Clinical+Genetic Model = 0.72 vs. AUC of Clinical Model =
0.64), but not in CCSS (AUC of Clinical+Genetic Model = 0.88 vs. AUC of Clinical Model
= 0.89). However, the Clinical+Genetic model performed marginally better in CCSS patients
without CVRFs where cardiomyopathy developed within 30 years of anthracycline exposure
(AUC of Clinical+Genetic Model = 0.90 vs. AUC of Clinical Model = 0.85). Conclusions:
Adding a comprehensively assembled genetic profile to clinical characteristics improves
the identification of cancer survivors at risk for anthracycline-related cardiomyopathy.
2021-11-22, Dr. Zoltn Kutalik, Advances in Mendelian Randomization
University of Lausanne, First, I will motivate the need for causal inference in contrast
to observational correlations. In particular, IÕll describe the principle of an instrumental
variable approach heavily applied in genetic research, termed Mendelian Randomization
(MR). Next, I will show four extensions of this method to different settings/assumptions:
(i) In-depth quantification and correction of the bias of the most popular MR method
(IVW); (ii) Modelling genetic architecture simultaneously with bidirectional causal
effects in the presence of a heritable confounder; (iii) Causal inference for composite
trait type exposures; (iv) estimation of non-linear causal effects. IÕll finish with
two applications to omics data: first, we link differential gene expression analyses
with bi-directional causal effects and finally, IÕll touch on how causal effects of
different omics layers are mediated.
2021-11-01, Anna Reisetter, Standardization and Penalty Parameter Selection in Penalized
Linear Mixed Models
University of Iowa, Penalized linear mixed models (LMMs) have been developed to accurately
identify genotype-phenotype associations in the presence of dependent samples. In
spite of this, the statistical properties of these models are not well understood.
In addition, there is a lack of available software for their implementation. In this
talk, we provide an overview of penalized LMMs for the analysis of structured genetic
data, while examining their statistical properties in the genetic association setting.
We then focus on the statistical properties of penalized LMMs in a general setting,
and provide recommendations for key components of their implementation, including
appropriate standardization and penalty parameter selection. We demonstrate the benefits
of our recommendations using both a general setting, and one specific to genetic data.
We conclude with a detailed analysis of a large, empirical GWAS data set which contains
complex correlation among samples. We use this analysis to illustrate the benefits
of penalized LMMs compared to traditional genome-wide association methods, and to
demonstrate the utility of penalizedLMM, an R package we have developed for the flexible,
and user-friendly implementation of penalized LMMs.
2021-10-18, Jane Liang, PanelPRO: A General Framework for Multi-Gene, Multi-Cancer
Mendelian Risk Prediction Models
Harvard University, Risk evaluation to identify individuals who are at greater risk
of cancer as a result of heritable pathogenic variants is a valuable component of
individualized clinical management. Using principles of Mendelian genetics, Bayesian
probability theory, and variant-specific knowledge, Mendelian models derive the probability
of carrying a pathogenic variant and developing cancer in the future, based on family
history. Existing Mendelian models are widely employed, but are generally limited
to specific genes and syndromes. However, the upsurge of multi-gene panel germline
testing has spurred the discovery of many new gene-cancer associations that are not
presently accounted for in these models. We have developed PanelPRO, a flexible, efficient
Mendelian risk prediction framework that can incorporate an arbitrary number of genes
and cancers, overcoming the computational challenges that arise because of the increased
model complexity. Using simulations and a clinical cohort with germline panel testing
data, we evaluate model performance, validate the reverse-compatibility of our approach
with existing Mendelian models, and illustrate its usage.
2021-10-04, Dr. Rebecca Hubbard, Principled Approaches to the Practical Challenges
of Real-World Data
University of Pennsylvania, Interest in conducting research using real-world data,
data generated as a by-product of digital transactions, has exploded over the past
decade and has been further spurred by the 21st Century Cures Act. Real-world data
facilitate understanding of treatment utilization and outcomes as they occur in routine
practice, and studies using these data sources can potentially proceed rapidly compared
to trials and observational studies that rely on primary data collection. However,
using data sources that were not collected for research purposes comes at a cost,
and nave use of such data without considering their complexity and imperfect quality
can lead to bias and inferential error. Real-world data frequently violate the assumptions
of standard statistical methods, but it is not practicable to develop new methods
to address every possible complication arising in their analysis. The statistician
is faced with a quandary: how to effectively utilize real-world data to advance research
without compromising best practices for principled data analysis. In this talk I will
use examples from my research on methods for the analysis of electronic health records
(EHR) derived-data to illustrate approaches to understanding the data generating mechanism
for real-world data. Drawing on this understanding, I will then discuss approaches
to identify, use, and develop principled methods for incorporating EHR into research.
The overarching goal of this presentation is to raise awareness of challenges associated
with the analysis of real-world data and demonstrate how a principled approach can
be grounded in an understanding of the scientific context and data generating process.
2021-09-27, Dr. Duyeol Lee, Generalized V-Learning Framework for Estimating Dynamic
Treatment Regimes
Quantitative Analytics Specialist, Model Innovations Team, Wells Fargo Bank, Precision
medicine is an approach that incorporates personalized information to efficiently
determine which treatments are best for which types of patients. A key component of
precision medicine is creating mathematical estimators for clinical decision-making.
Dynamic treatment regimes formalize tailored treatment plans as sequences of decision
rules. Recently, the V-learning method was introduced to estimate optimal dynamic
treatment regimes. This method showed good performance compared to the existing reinforcement
learning methods such as greedy gradient Q-learning. However, the complicated functional
form of its loss function makes it difficult to apply modern machine learning methods
for the estimation of value functions of treatment policies. We propose a generalized
V-learning framework for estimating optimal treatment regimes. The proposed method
adopts widely used loss functions and an iterative method to estimate value functions.
Simulation studies show that the proposed method provides better performance compared
with the original V-learning method.
2021-09-24, Dr. Karl Broman, OSGA Webinar - Organizing Data in Spreadsheets
University of Wisconsin-Madison, Spreadsheets are widely used software tools for data
entry, storage, analysis, and visualization. Focusing on the data entry and storage
aspects, this presentation will offer practical recommendations for organizing spreadsheet
data to reduce errors and ease later analyses. The basic principles are: be consistent,
write dates like YYYY-MM-DD, do not leave any cells empty, put just one thing in a
cell, organize the data as a single rectangle (with subjects as rows and variables
as columns, and with a single header row), create a data dictionary, do not include
calculations in the raw data files, do not use font color or highlighting as data,
choose good names for things, make backups, use data validation to avoid data entry
errors, and save the data in plaintext files.
2021-09-13, Dr. Phillip Hunter Allman, Mendelian Randomization Analyses in a Multivariate
Framework
University of Alabama at Birmingham School of Public Health, Mendelian randomization
(MR) is an application of instrumental variable (IV) methods to observational data
in which the IV is a genetic variant. An IV is a variable which functions similarly
to the random treatment group assignment seen in clinical trials. Numerous statistical
methods exist for subject-level MR or IV analysis; such as two-stage predictor substitution
(2SPS) or the correlated errors model (CEM). These methods have some limitations depending
on the distributions involved and the number of IVs; particularly when it comes to
asymptotic variance estimation and hypothesis testing. Our research explores extensions
of the CEM to scenarios with non-normal variates. This talk will provide a brief introduction
to MR; describe the popular statistical methods for such analyses; describe our extensions
to the correlated errors model; and compare these methods with simulations and real
data analysis.
2021-08-24, Ye Eun Bae, Building tools for understanding the genetic architecture
of allopolyploidy
Florida State University, Analysis of quantitative trait loci (QTL) is a useful approach
to identify putatively causal genetic factors underlying quantitative trait variation
in species. While many QTL analysis tools and methods have been proposed, it is challenging
to extend them to allopolyploid species which require polyploidy-aware genetic mapping
by using the fact that there is preferential paring between pairs of homologous chromosomes.
In this project, we developed a QTL analysis tool for allopolyploids to provide insights
into their genetic architecture and genotype-phenotype associations. As a starting
point, we applied our tool to switchgrass (Panicum virgatum) datasets and visualized
the application results to help understanding the allopolyploid nature of switchgrass.
2021-08-23, Nadeesha Thewarapperuma, Analysis of DNA methylation data using a functional
data approach
University of Kansas Medical Center, We will use a data series that examines oral
and pharyngeal carcinoma, and which includes 154 cases and 72 controls. A functional
data approach will be used to analyze CpG island data for each gene to determine if
there is a difference between case and control groups. The package fdANOVA will be
used for the hypothesis tests. The null hypothesis is that the case and control groups
have the same mean function, and the alternative is that that the mean functions differ.
2021-06-21, Dr. Tianxiao Huan, Genome-Wide Identification of DNA Methylation QTLs
in Whole Blood Highlights Pathways for Cardiovascular Disease
UMASS, Abstract: Identifying methylation quantitative trait loci (meQTLs) and integrating
them with disease-associated variants from genome-wide association studies (GWAS)
may illuminate functional mechanisms underlying genetic variant-disease associations.
Here, we perform GWAS of >415 thousand CpG methylation sites in whole blood from 4170
individuals and map 4.7 million cis- and 630 thousand trans-meQTL variants targeting
>120 thousand CpGs. Independent replication is performed in 1347 participants from
two studies. By linking cis-meQTL variants with GWAS results for cardiovascular disease
(CVD) traits, we identify 92 putatively causal CpGs for CVD traits by Mendelian randomization
analysis. Further integrating gene expression data reveals evidence of cis CpG-transcript
pairs causally linked to CVD. In addition, we identify 22 trans-meQTL hotspots each
targeting more than 30 CpGs and find that trans-meQTL hotspots appear to act in cis
on expression of nearby transcriptional regulatory genes. Our findings provide a powerful
meQTL resource and shed light on DNA methylation involvement in human diseases.
2021-05-29, Dr. Andrew Lawson, Bayesian Spatio-Temporal SIR Modeling of Covid-19 in
SC, USA
Medical University of South Carolina, The Covid-19 pandemic has focused awareness
on the need for good modeling of infectious disease spread and the need for surveillance
which can alert public health officials to developing adverse events such as clusters
of unusual risk (hot spots). Bayesian models can provide a dynamically flexible framework
for such modeling via recursive Bayesian learning. The use of susceptible-Infected
removed compartment models is exemplified In addition, monitoring of events can be
facilitated by using posterior functionals of risk. This talk will address some infectious
disease modeling basics and demonstrate the need for flexible models that can address
transmission, both symptomatic and asymptomatic, and be allowed to address the spatial
structure of the pandemic via neighborhood effects and correlated random effects.
The addition of predictors is considered and it has been found that % below poverty
line at the county level makes a significant contribution to the transmission variation.
Prediction form both daily count data and smoothed data is also explored
2021-05-10, Dr. Jay Greene, A Machine Learning Compatible Method for Ordinal Propensity
Score Stratification and Matching?
GlaxoSmithKline, Abstract: Although machine learning techniques that estimate propensity
scores for observational studies with multi-valued treatments have advanced rapidly
in recent years, the development of propensity score adjustment techniques has not
kept pace. While machine learning propensity models provide numerous benefits, they
do not produce a single variable balancing score that can be used for propensity score
stratification and matching. This issue motivates the development of a flexible ordinal
propensity scoring methodology that does not require parametric assumptions for the
propensity model. The proposed method fits a one-parameter power function to the cumulative
distribution function (CDF) of the generalized propensity score (GPS) vector resulting
from any machine learning propensity model, and is called the GPS-CDF method. The
estimated parameter from the GPS-CDF method, ?, is a scalar balancing score that can
be used to group similar subjects in outcome analyses. Specifically, subjects who
received different levels of the treatment are stratified or matched based on their
? value to produce unbiased estimates of the average treatment effect (ATE). Simulation
studies presented show remediation of covariate balance, minimal bias in ATEs, and
maintain coverage probability. The proposed method is applied to the Mexican-American
Tobacco use in Children (MATCh) study to determine whether ordinal exposure to smoking
imagery in movies is causally associated with cigarette experimentation in Mexican-American
adolescents (manuscript: https://onlinelibrary.wiley.com/doi/full/10.1002/sim.8846)
2021-04-26, Dr. ShanShan Zhao, Assessing the Effects of Multiple Exposures Subject
to Limit of Detection
NIH/NIEHS, Studies on the health effects of environmental mixtures face the challenge
of limit of detection (LOD) in multiple correlated exposure measurements. Conventional
approaches to deal with covariates subject to LOD, including complete-case analysis,
substitution methods and parametric modeling of covariate distribution, are feasible
but may result in efficiency loss or bias. With a single covariate subject to LOD,
a flexible semiparametric accelerated failure time (AFT) model to accommodate censored
measurements has been proposed. We generalize this approach by considering a multivariate
AFT model for the multiple correlated covariates subject to LOD and a generalized
linear model for the outcome. A two-stage procedure based on semiparametric pseudo-likelihood
is proposed for estimating the effects of these covariates on health outcome. Consistency
and asymptotic normality of the estimators are derived for an arbitrary fixed dimension
of covariates. Simulations studies demonstrate good large sample performance of the
proposed methods versus conventional methods in realistic scenarios. We illustrate
the practical utility of the proposed method with the LIFECODES birth cohort data,
where we compare our approach to existing approaches in an analysis of multiple urinary
trace metals in association with oxidative stress in pregnant women.
2021-03-29, Dr. Omer Ozturk, Meta?analysis of quantile intervals from different studies
with an application to a pulmonary tuberculosis data
Ohio State University, After the completion of many studies, experimental results
are reported in terms of distribution?free confidence intervals that may involve pairs
of order statistics. This article considers a meta?analysis procedure to combine these
confidence intervals from independent studies to estimate or construct a confidence
interval for the true quantile of the population distribution. Data synthesis is made
under both fixed?effect and random?effect meta?analysis models. We show that mean
square error (MSE) of the combined quantile estimator is considerably smaller than
that of the best individual quantile estimator. We also show that the coverage probability
of the meta?analysis confidence interval is quite close to the nominal confidence
level. The random?effect meta?analysis model yields a better coverage probability
and smaller MSE than the fixed?effect meta?analysis model. The meta?analysis method
is then used to synthesize medians of patient delays in pulmonary tuberculosis diagnosis
in China to provide an illustration of the proposed methodology. (manuscript: https://onlinelibrary.wiley.com/doi/full/10.1002/sim.8738)
2021-03-15, Dr. Maiying Kong, Propensity score specification for optimal estimation
of average treatment effect with binary response
University of Louisville, Propensity score methods are commonly used in statistical
analyses of observational data to reduce the impact of confounding bias in estimations
of average treatment effect. While the propensity score is defined as the conditional
probability of a subject being in the treatment group given that subject?s covariates,
the most precise estimation of average treatment effect results from specifying the
propensity score as a function of true confounders and predictors only. This property
has been demonstrated via simulation in multiple prior research articles. However,
we have seen no theoretical explanation as to why this should be so. This paper provides
that theoretical proof. Furthermore, this paper presents a method for performing the
necessary variable selection by means of elastic net regression, and then estimating
the propensity scores so as to obtain optimal estimates of average treatment effect.
The proposed method is compared against two other recently introduced methods, outcome-adaptive
lasso and covariate balancing propensity score. Extensive simulation analyses are
employed to determine the circumstances under which each method appears most effective.
We applied the proposed methods to examine the effect of pre-cardiac surgery coagulation
indicator on mortality based on a linked dataset from a retrospective review of 1390
patient medical records at Jewish Hospital (Louisville, KY) with the Society of Thoracic
Surgeons database. (manuscript: https://journals.sagepub.com/doi/full/10.1177/0962280220934847
2021-03-08, Dr. Cen Wu, Sparse group variable selection for gene--environment interactions
in the longitudinal study
Kansas State University, Regularized variable selection for high dimensional longitudinal
data has received much attention as accounting for the correlation among repeated
measurements can provide additional and essential information for improved identification
and prediction performance. Despite the success, in longitudinal studies, the potential
of regularization methods is far from fully understood for accommodating structured
sparsity. In this work, we have developed a sparse group penalization method to conduct
the bi-level G-E interaction study under the repeatedly measured phenotype. Within
the quadratic inference function (QIF) framework, the proposed method can achieve
simultaneous identification of main and interaction effects on both the group and
individual level. Simulation studies have shown that the proposed method outperforms
major competitors. In the case study of asthma data from the Childhood Asthma Management
Program (CAMP), our method leads to identification of improved prediction, as well
as main and interaction effects with important implications.
2021-02-22, Dr. Hyeonju Kim, A tutorial for a multivariate linear mixed model based
QTL(Quantitative Trait Loci) analysis tool : FlxQTL
UTHSC , FlxQTL is a Julia software package, developed for genetic mapping of multivariate
traits faster with greater flexibility compared to existing linear mixed model-based
algorithms. It offers comprehensive functionalities for QTL analysis: genetic kinship
computation, genome scan, pair scan, permutation test, and visualization, etc. In
this talk, a step-by-step demo from the package installation to visualization will
be presented using Arabidopsis thaliana data (Recombinant Inbred Lines).
2021-02-08, Dr. Aiyi Liu, Nonparametric estimation of distributions and diagnostic
accuracy based on group-tested results with differential misclassification
NIH, This talk concerns the problem of estimating a continuous distribution in a diseased
or nondiseased population when only group-based test results on the disease status
are available. The problem is challenging in that individual disease statuses are
not observed and testing results are often subject to misclassification, with further
complication that the misclassification may be differential as the group size and
the number of the diseased individuals in the group vary. We propose a method to construct
nonparametric estimation of the distribution and obtain its asymptotic properties.
The performance of the distribution estimator is evaluated under various design considerations
concerning group sizes and classification errors. The method is exemplified with data
from the National Health and Nutrition Examination Survey (NHANES) study to estimate
the distribution and diagnostic accuracy of C-reactive protein in blood samples in
predicting chlamydia incidence.
2020-11-16, Dr. Rima Izem , Parallelizing Research Efforts with Hosted GIT and Modern
Asynchornous Workflows
Children's National Research Institute and the George Washington University, Git is
a modern distributed version control system that enables collaborators to explore
various avenues of research without interfering with the common core of their project
while also keeping track of contributions and project versions over time. It has been
the cornerstone of both the smallest and largest software development collaborations
over the past decade scaling from a handful of collaborators to thousands. Taking
cues from the software engineering ecosystem of workflows and tools that have evolved
around Git we can increase reproducibility, assuage pain points in distributed collaboration,
and pursue multiple research avenues within projects asynchronously.
2020-11-09, Dr. Lifeng Lin, Treatment ranking in Bayesian network meta-analysis and
predictions
Florida State University, Network meta-analysis (NMA) is an important tool to provide
high-quality evidence about available treatments? benefits and harms for comparative
effectiveness research. Compared with conventional meta-analyses that synthesize related
studies for pairs of treatments separately, an NMA uses both direct and indirect evidence
to simultaneously compare all available treatments for a certain disease. It is of
primary interest for clinicians to rank these treatments and select the optimal ones
for patients. Various methods have been proposed to evaluate treatment ranking; among
them, the mean rank, the so-called surface under the cumulative ranking curve (SUCRA),
and P-score are widely used in current practice of NMAs. However, these measures only
summarize treatment ranks among the studies collected in the NMA. Due to heterogeneity
between studies, they cannot predict treatment ranks in a future study and thus may
not be directly applied to healthcare for new patients. We propose innovative measures
to predict treatment ranks by accounting for the heterogeneity between the existing
studies in an NMA and a new study. They are the counterparts of the mean rank, SUCRA,
and P-score under the new study setting. We use illustrative examples and simulation
studies to evaluate the performance of the proposed measures.
2020-10-26, Dr. Brian Egleston , Statistical inference for natural language processing
algorithms with a demonstration using type 2 diabetes prediction from electronic health
record notes
Fox Chase Cancer Center , The pointwise mutual information statistic (PMI), which
measures how often two words occur together in a document corpus, is a cornerstone
of recently proposed popular natural language processing algorithms such as word2vec.
PMI and word2vec reveal semantic relationships between words and can be helpful in
a range of applications such as document indexing, topic analysis, or document categorization.
We use probability theory to demonstrate the relationship between PMI and word2vec.
We use the theoretical results to demonstrate how the PMI can be modeled and estimated
in a simple and straight forward manner. We further describe how one can obtain standard
error estimates that account for within?patient clustering that arises from patterns
of repeated words within a patient's health record due to a unique health history.
We then demonstrate the usefulness of PMI on the problem of predictive identification
of disease from free text notes of electronic health records. Specifically, we use
our methods to distinguish those with and without type 2 diabetes mellitus in electronic
health record free text data using over 400 000 clinical notes from an academic medical
center.
2020-09-28, Dr. Bin Zhu , A hidden Markov modeling approach for identifying tumor
subclones in next-generation sequencing studies
National Cancer Center/NIH , Allele-specific copy number alteration (ASCNA) analysis
is for identifying copy number abnormalities in tumor cells. Unlike normal cells,
tumor cells are heterogeneous as a combination of dominant and minor subclones with
distinct copy number profiles. Estimating the clonal proportion and identifying mainclone
and subclone genotypes across the genome are important for understanding tumor progression.
Several ASCNA tools have recently been developed, but they have been limited to the
identification of subclone regions, and not the genotype of subclones. In this article,
we propose subHMM, a hidden Markov model-based approach that estimates both subclone
region and region-specific subclone genotype and clonal proportion. We specify a hidden
state variable representing the conglomeration of clonal genotype and subclone status.
We propose a two-step algorithm for parameter estimation, where in the first step,
a standard hidden Markov model with this conglomerated state variable is fit. Then,
in the second step, region-specific estimates of the clonal proportions are obtained
by maximizing region-specific pseudo-likelihoods. We apply subHMM to study renal cell
carcinoma datasets in The Cancer Genome Atlas. In addition, we conduct simulation
studies that show the good performance of the proposed approach. The R source code
is available online at https://dceg.cancer.gov/tools/analysis/subhmm.
2020-09-21, Dr. Danh V. Nguyen , Profiling Dialysis Facilities for Adverse Recurrent
Events
UC Irvine , Profiling analysis aims to evaluate health care providers, such as hospitals,
nursing homes, or dialysis facilities, with respect to a patient outcome. Previous
profiling methods have considered binary outcomes, such as 30-day hospital readmission
or mortality. For the unique population of dialysis patients, regular blood works
are required to evaluate effectiveness of treatment and avoid adverse events, including
dialysis inadequacy, imbalance mineral levels, and anemia among others. For example,
anemic events (when hemoglobin levels exceed normative range) are recurrent and common
for patients on dialysis. Thus, we propose high-dimensional Poisson and negative binomial
regression models for rate/count outcomes and introduce a standardized event ratio
(SER) measure to compare the event rate at a specific facility relative to a chosen
normative standard, typically defined as an ?average? national rate across all facilities.
Our proposed estimation and inference procedures overcome the challenge of high-dimensional
parameters for thousands of dialysis facilities. Also, we investigate how overdispersion
affects inference in the context of profiling analysis.
2020-07-27, Dr. Abolfazl Mollalo , Spatial Variations of the COVID-19 Incidence in
the United States: A GIS-based Approach
Baldwin Wallace University , The outbreak of COVID19 in the United States is posing
an unprecedented socioeconomic burden to the country. Due to inadequate research on
geographic modeling of COVID-19, we investigated county-level variations of disease
incidence across the continental United States. We compiled a geodatabase of 35 environmental,
socioeconomic, topographic, and demographic variables that could potentially explain
the spatial variability of disease incidence. Further, we employed spatial lag and
spatial error models to investigate spatial dependencies and geographically weighted
regression (GWR) and multiscale GWR (MGWR) models to locally examine spatial non-stationarity.
2020-07-13, Dr. Qi Yan , Deep-learning-based Prediction of Late Age-Related Macular
Degeneration Progression
U of Pittsburgh , The results suggested that even though incorporating spatial autocorrelation
could significantly improve the performance of the global ordinary least square model;
these models still represent a significantly poor performance compared to the local
models. Moreover, MGWR could explain the highest variations (adj. R?: 68.1%) with
the lowest AICc compared to the others. In the second study we added mortality rates
of several infectious and chronic diseases as explanatory variables and examined the
performance of multilayer perceptron (MLP) neural network in predicting the cumulative
COVID-19 incidence rates using 57 variables. Our results indicated that a single-hidden-layer
MLP could explain almost 65% of the correlation with ground truth for the holdout
samples. Sensitivity analysis conducted on this model showed that the age-adjusted
mortality rates of ischemic heart disease, pancreatic cancer, and leukaemia, together
with two socioeconomic and environmental factors (median household income and total
precipitation), are among the most substantial factors for predicting COVID-19 incidence
rates. The findings may provide useful insights for public health decision makers
regarding the influence of potential risk factors associated with the COVID-19 incidence
at the county level.
2020-06-29, Dr. Fatma Gunturkun , Artificial Intelligence for Prediction of Late Onset
Cardiomyopathy among Childhood Cancer Survivors
UTHSC , Background: Early identification of childhood cancer survivors at high risk
for treatment-related cardiomyopathy may improve outcomes by enabling timely intervention.
We implemented deep learning and signal processing methods using the Children's Oncology
Group (COG) guideline-recommended baseline electrocardiography (ECG) to predict future
cardiomyopathy.Methods: Signal processing and deep learning tools were applied to
12-lead electrocardiogarms (ECG) obtained on 1,217 adult survivors (18 years of age,
10 years from diagnosis) of childhood cancer, without evidence of cardiomyopathy,
prospectively followed in the St. Jude Lifetime Cohort (SJLIFE) Study. Clinical and
echocardiographic assessment of cardiac function was performed at baseline and follow-up
evaluations and graded per a modified version of the Common Terminology Criteria for
Adverse Events (CTCAE). Extreme gradient boosting (XGboost) algorithms were applied,
and model performance evaluated by 5-fold stratified cross validation.Results: Median
age at baseline evaluation was 31.7 years (range 18.4-66.4), and median age at cancer
diagnosis was 8.4 years (range 0.01 ?C 22.7). The average length of follow-up time
following baseline SJLIFE evaluation was 5 years (0.5-9). Among survivors, 67.1% were
exposed to chest radiation (median dose of 1,200 cGy (4-6,200 cGy) and 76.6% were
exposed to anthracyclines (mean dose of 168.7 mg/m2 (35.1-734.2 mg/m2). A total of
117 (9.6%) survivors developed cardiomyopathy during follow-up. In the model based
on ECG features, the cross-validation AUC was 0.87 (95% CI 0.83-0.90), with sensitivity
76% and specificity 79%, and in the model based on ECG and clinical features, the
cross-validation AUC was 0.89 (95% CI 0.86-0.91), with sensitivity 78% and specificity
81%.Conclusion: Artificial intelligence using electrocardiographic data may assist
in the early identification of childhood cancer survivors at high risk for cardiomyopathy.
2020-06-22, Dr. Hansapani Rodrigo , HIV-Associated Neurocognitive Disorder (HAND)
Biomarker Identification with Significance Analysis of Microarrays and Random Forest
Analysis
The University of Texas Rio Grande Valley , Genome-wide screening of transcription
regulation in brain tissue helps in identifying substantial abnormalities present
in patients? gene transcripts and to discover possible biomarkers for HIV Associated
Neurological Disorders (HAND). This study explores the possibility of identifying
differentially expressed (DE) genes, which can serve as potential biomarkers to detect
HAND. In this regard, gene expression levels of three subject groups with different
impairment levels of HAND along with a control group in three distinct brain sectors:
white matter, frontal cortex, and basal ganglia have been investigated and compared
using multiple statistical analysis methods including Significance Analysis of Microarrays
(SAM) and random forests (RF). In fact, two aspects of gene expressions have been
investigated; single-gene and sub-gene network effects, where the latter model accounts
for the co-regulation, and interrelations among genes. The sub-gene network RF model
incorporates prior biological information into account and hence has widened the path
in the exploration of DE genes.
2020-06-01, Dr. Sixia Chen , Nonparametric Mass Imputation for Data Integration
University of Oklahoma Health Sciences Center , This study has shed light on new sets
of genes including CIRBP, RBM3, GPNMB, ISG15, IFIT6, IFI6, and IFIT3, which were previously
undermined or undetected to be significantly expressed among the subjects with HAND
either in the frontal cortex or basal ganglia. The gene, GADD45A, a protein-coding
gene whose transcript levels tend to increase with stressful growth arrest conditions,
was consistently ranked among the top genes by two RF models within the frontal cortex.
2020-05-11, Dr. Zonghui Hu , Assessment of collective genetic impact from twin study:
a mixture distribution approach
NIH/NIAID , It is challenging to evaluate the genetic impacts on a biologic feature
and separate them from the environmental impacts. We approach this through twin studies
by assessing the collective genetic impact defined by the differential correlation
in monozygotic twins versus dizygotic twins. Since the underlying order in a twin,
determined by latent genetic factors, is unknown, the observed twin data are unordered.
Conventional methods for correlation are not appropriate. To handle the missing order,
we model twin data by a mixture bivariate distribution and estimate under two likelihood
functions: the likelihood over the monozygotic and dizygotic twins separately, and
the likelihood over the two twin types combined. Both likelihood estimators are consistent.
More importantly, the combined likelihood overcomes the drawback of mixture distribution
estimation, namely, the slow convergence. It yields correlation coefficient estimator
of root-n consistency and allows effective statistical inference on the collective
genetic impact. The method is demonstrated by a twin study on immune traits. This
is a joint work with Pengfei Li, Dean Follmann and Jing Qin.
2020-05-04, Dr. Marco Geraci , Quantile contours and allometric modelling for risk
classification of abnormal ratios with an application to asymmetric growth restriction
in preterm infants
University of South Carolina , In this talk, I will present a novel approach to risk
classification based on quantile contours and allometric modelling of multivariate
anthropometric measurements. I propose the definition of allometric direction tangent
to the directional quantile envelope, which divides ratios of measurements into half-spaces.
This in turn provides an operational definition of directional quantile that can be
used as cutoff for risk assessment. I will apply the proposed approach to a large
dataset from the Vermont Oxford Network containing observations of birthweight (BW)
and head circumference (HC) for more than 150,000 preterm infants. The analysis suggests
that small preterm infants with large HC-to-BW ratio are at increased risk of mortality
as compared to appropriate-for-gestational-age as well as proportionately-growth-restricted
preterm infants. This study offers not only an approach to risk classification, but
also large-sample estimated cutoffs that can be immediately used by practitioners
(Geraci M et al, 2019, Statistical Methods in Medical Research, doi:10.1177/0962280219876963).
2020-04-20, D. Mehmet Kocak and Mr. Tristan Hayes , Tracking the Covid-19 Pandemic
in the U.S.
UTHSC , Mehmet Kocak, PhD, Associate Professor of Biostatistics in the Department
of Preventive Medicine and Tristan Hayes, MSc, Biostatistics Consulting Manager, will
present on efforts to track the COVID-19 virus using publicly available data sources.
Dr. Kocak's presentation, "A comparative look at the Covid-19 Pandemic in the U.S",
will focus on comparing the virus' progression in the US to other nations and will
also briefly touch on forecasting. Mr. Hayes's presentation, "Visualizing the Covid-19
Pandemic in the U.S." will cover several methods for depicting the epidemic visually,
using R and R Shiny.
2020-03-09, Dr. Yunxi Zhang , Variable Selection and Imputation for High-Dimensional
Incomplete Data
University of Mississippi Medical Center , Missing data are an inevitable problem
in data with a large number of variables. The presence of missing data obstructs the
implementation of the existing variable selection methods. This is especially an issue
when there is a limited number of observations. Applicable and efficient selection
with imputation method is necessary to obtain valid results. In this talk, I will
propose an approach to efficiently select important variables from high dimensional
data in the presence of missing data. We employ the shrinkage prior and multiple imputation
for variable selection in the high-dimensional setting with missing values. Simulation
study and analysis results will be presented and compared with other possible approaches.
2020-02-24, Dr. Dong Wang , New technology and statistical issues for toxicity testing,
from in vitro assays to postmarket surveillance
National Center for Toxicological Research/FDA , The emergence of high throughput
in vitro assays has the potential to significantly improve toxicological evaluations
and lead to more efficient, accurate, and less animal-intensive testing. However,
effectively utilizing data from in vitro assays in a predictive model is still a challenging
problem. One major difficulty is caused by the small sample size of the training data
set in most toxicology problems. Thus, how to utilize data most efficiently takes
a premium in predictive modeling of toxicity, and it requires some creative techniques
not commonly used in settings with copious amount of data. In this presentation, several
examples on how to perform statistical modeling and machine learning in toxicity testing
within the constraints of small sample sizes will be discussed. In the first example,
a robust learning method was applied to the prediction of the point of departure so
that a small portion of outlier chemicals will not harm the overall performance of
the model. In the second example, adverse outcome pathway networks were utilized to
filter potential predictors from in vitro assays to construct a more parsimonious
model with improved performance. Finally, we will discuss how to use postmarket surveillance
databases to provide independent validation for predictive models based on in vitro
assays.
2019-11-18, Sedigheh Mirzaei Salehabadi , Estimation of time-to-event distribution
based on partially recalled data
Biostat, St. Jude , In some retrospective studies, participants are asked to recall
the time of occurrence of a landmark event, if experienced. Some respondents who had
experienced the event are able to recall the date exactly, some recall only the month
or year of the event, and some are unable to recall it. This interval censored data
bears evidence of being informatively censored, for which we build a special model
for estimating the time-to-event distribution. We provide a set of regularity conditions
on the distribution, subject to which the consistency and the asymptotic normality
of the parametric maximum likelihood estimator are established. We also provide a
computationally simple approximation to the nonparametric maximum likelihood estimator
and establish its consistency under mild conditions. The small sample performance
of the two estimators through Monte Carlo simulations are studied. Moreover, we provide
a graphical check of the assumption of the multinomial model for the recall probabilities,
which appears to hold for a menarcheal data set. Its analysis shows that the use of
the partially recalled part of the data indeed leads to smaller confidence intervals
of the survival function.
2019-11-11, Yujiao Mai , Classic Mediation Analysis for Complex Surveys with Balanced
Repeated Replications
Biostat, St. Jude , Mediation analysis is to investigate the role of a third variable
as a transmitter in the relationship between the exposure and the outcome. The third
variable is called the mediator. Although there have been established methodology
and computer tools for classic mediation analysis which is within structural equation
modeling (SEM) framework, applications to survey data of complex sampling designs
have not been addressed. As complex sampling designs using balanced repeated replications
are common in national surveys, this study introduces a classic mediation analysis
algorithm adjusting for complex surveys with balanced repeated replications and develops
the software packages in R and SAS. The study then illustrates the application of
the algorithm and packages to Tobacco Use Supplement to the Current Population Survey.
In the end of the talk, we will discuss the limitations of this study and future development.
2019-09-16, Hyo Young Choi , SCISSOR: a novel framework for identifying changes in
RNA transcript structures
Computational Biology, UTHSC , High-throughput sequencing protocols such as RNA-seq
have made it possible to interrogate the sequence, structure and abundance of RNA
transcripts at higher resolution than previous microarray and other molecular techniques.
While many computational tools have been proposed for identifying mRNA variation through
differential splicing/alternative exon usage, the promise of RNA-seq remains largely
unrealized. In this talk, we propose a novel framework for unbiased and robust discovery
of aberrant RNA transcript structures using short read sequencing data based on shape
changes in an RNA-seq coverage profile. Shape changes in selecting sample outliers
in RNA-seq (SCISSOR), is a series of procedures for transforming and normalizing base
level RNA sequencing coverage data in a transcript independent manner, followed by
a statistical framework for its analysis. The resulting high dimensional object is
amenable to unsupervised screening of structural alterations across RNA-seq cohorts
with nearly no assumption on the forms of underlying abnormalities. This enables to
independently recapture known variants (such as splice site mutations in tumor suppressor
genes) as well as novel variants that are previously unrecognized or difficult to
identify by any existing methods including recurrent alternate start sites and recurrent
complex deletions in 3? UTRs.
2019-08-12, Dr. Qian (Michelle) Zhou , Risk prediction with Cohort Studies
Statistics, Mississipi State U. , Accurate risk prediction is a key component of precision
medicine. In this talk, I will present several research projects regarding the risk
prediction I have worked on. I will introduce a novel thresh-hold free metric to evaluate
the overall precision of a risk model. I will talk about how to identify subgroups
for which the new markers are most/least useful in improving risk prediction. In large
cohort studies, it is often not feasible to measure expensive biomarkers on the full
cohort, and efficient sampling designs, such as nested case-control (NCC), are employed.
I will illustrate how to accurately evaluate the incremental values of the new markers
under a three-phase NCC design. I will also present a flexible varying coefficient
model for estimating the age-specific absolute risk with the marker effects varying
with the age.
2019-05-13, Lih-Yuan Deng , Survey of Current Random Number Generators used in R and
Comparison with DX Generators
Statistics, University of Memphis , In this talk, we review and compare these PRNGs
available in RNGkind and compare them with our proposed DX(Deng-Xu) generator, a fast
and efficient, huge period multiple recursive generator (MRG) with equi-distribution
in thousands of dimensions. Comparing with these generators, DX generators are fastest
with longest period length. Furthermore, we show that DX generators consistently pass
all the empirical tests while most of the PRNGs used in base R are not able to pass
all these tests, including the default generator MT19937.
2019-04-29, Yimei Li , Genome-wide Association Analysis on Functional Imaging Data
Biostat, St. Jude , Imaging genetics allows for the identification of how common/rare
genetic polymorphisms influencing molecular processes (e.g., serotonin signaling),
bias neural pathways (e.g., amygdala reactivity), mediating individual differences
in complex behavioral processes (e.g., trait anxiety) related to disease risk in response
to environmental adversity. The challenges include: how to integrate imaging data
with genetic information, how to develop more advanced statistical method to analyze
those type of data and how to produce reproducible analysis. Functional phenotypes
(e.g., subcortical surface representation), which commonly arise in imaging genetic
studies, have been used to detect putative genes for complexly inherited neuropsychiatric
and neurodegenerative disorders. However, existing statistical methods largely ignore
the functional features (e.g., functional smoothness and correlation). I will focus
on introduction of some published methods on analyzing functional medical imaging
data with genetics information such as functional structure equation models for twin
imaging data, functional mixed effects models for candidate genetic mapping, functional
genome wide association analysis and their application in real data analysis.
2019-04-15, Hongmei Zhang , Bayesian network selection with ordered data
Biostat, U of Memphis , We propose an approach for constructing Bayesian networks
for data with unknown order and determining whether networks constructed under different
conditions are statistically identical. A Bayesian method is developed for this purpose.
A penalty-incorporated conditional posterior probability mass function, motivated
by conditional posterior distributions under non-informative priors, is implemented
to make a selection between identical networks and differential networks. Theoretical
assessment, simulations, and real data applications indicate the efficacy and efficiency
of the proposed method.
2019-04-08, Bernie J. Daigle, Jr. , Maximum Likelihood Parameter Estimation and Model
Identification from Single-Cell Distribution Data
Biological Sciences and Computer Science, U of Memphis , Recent advances in single-cell
experimental techniques have provided unprecedented access to the mechanisms underlying
fundamental cellular processes. In particular, techniques assaying populations of
single cells, including single-cell RNA sequencing (scRNA-seq), have highlighted the
importance of cellular noise-stochastic fluctuations within and heterogeneity between
genetically identical cells. Despite growing availability of such single-cell distribution
data, limitations in computational methods for parameter estimation and model identification
remain a bottleneck for developing accurate mechanistic descriptions of cellular processes.
Existing methods typically make simplifying assumptions about the underlying biochemical
model, impose limits on model size/complexity, or require prior knowledge of model
parameter values. I will present a novel maximum likelihood-based method for parameter
estimation and model identification-distribution Monte Carlo Expectation-Maximization
with Modified Cross-Entropy Method (dMCEM2)-that does not have these limitations.
Building upon a method developed for single-cell time series data, dMCEM2 enables
automated, computationally efficient inference and identification of stochastic biochemical
models from single-cell distribution data. Using both synthetic and real-world scRNA-seq
data, I will demonstrate the ability of dMCEM2 to accurately construct mechanistic
models of gene expression.
2019-03-18, Natasha Sahr , Multi-level variable screening and selection for survival
data
Biostat, St. Jude , Variable selection methods for the marginal proportional hazards
model is a relatively understudied research area in biostatistics. The limited available
methods focus on the selection of non-zero individual variables for a single outcome.
However, variable selection in the presence of grouped covariates is often required.
Some methods are available for the selection of non-zero group and within-group variables
for the Cox proportional hazards model. There are no available methods to perform
group variable selection in the clustered multivariate survival setting. In this context,
the hierarchical adaptive group bridge penalty is proposed to select non-zero group
and within-group variables for the independent or clustered marginal multivariate
proportional hazards model. Simulation studies show the hierarchical adaptive group
bridge method has superior performance compared to the extension of the adaptive group
bridge in terms of variable selection accuracy. Sometimes, survival data suffers from
high-dimensional group variables. Most existing screening methods address the sure
screening property for individual variable selection. The sure group joint variable
screening method is proposed to screen independent and clustered multivariate survival
data in the presence of group variables. Simulation studies show the sure group joint
variable screening method performs better than existing screening procedures extended
to the multivariate survival setting. The hierarchical adaptive group bridge and sure
group joint variable screening methods can be effective tools, used in a two-step
process, in identifying non-zero group and within-group variables for high-dimensional
multivariate survival data.
2019-02-21, Mehmet Kocak , Statistical Modeling for Growth Curves
Preventive Medicine, UTHSC , In pediatric clinical trials and cohort studies, actual
height and weight of children at a specific age may be required for certain developmental
assessments such as energy expenditure. This necessitates the choice of a growth model
with desired characteristics to predict height and weight accurately. In this talk,
we introduce and compare two commonly used growth curve models, namely, Logistic and
Gompertz models, with respect to the distribution of their residuals as well as the
logistical challenges in model convergence using the US and Turkish Growth Curve Standards
for the first 3 years of life.
2018-10-22, Zheng Xu , Association Testing Based On Sequencing Data With Arbitrary
Depth
Statistics, University of Nebraska-Lincoln , Association studies have been widely
conducted to explain the relationship between phenotypes and genetic variants. Researchers
have been developed association testing methods for a single marker or a group of
markers based on genotypes. The phenotypes, either continuous or binary traits, are
studied in the regression framework with genotypes as explanatory variables. Some
factors such as genotype calling uncertainty and sequencing depth may influence the
performance of these genotype-based testing methods. We propose association testing
methods directly based on next-generation sequencing data without genotype calling.
The methods have been applied to the testing for a single marker or a group of markers
with applications to rare variants.
2018-04-09, Luhang Han , Identifying stable and dynamic CpG sites pre- and post-adolescence
transition via a longitudinal genome-scale study
Statistics, University of Memphis , There is some evidence that DNA methylation (DNA-M)
over time is stable at certain cytosine-phosphate-guanine (CpG) sites and varies at
others (dynamic methylation). Adolescence transition (puberty) is considered associated
with DNA-M change and this change is gender specific. In adolescence, a gender reversal
of asthma prevalence occurs, from male predominance of asthma prevalence in young
childhood to female predominance after adolescence. Given that DNA methylation may
play a central role in susceptibility to asthma, assessing the stability of DNA-M
provides a potential to understand the mechanisms of asthma transition during adolescence.
The aim of this study was to identify dynamic and stable DNA-M at the genome scale
and assess their gender-specificity. Data from children at 10 and 18 years old from
the Isle of Wight birth(IOW) cohort in the United Kingdom were included. Epigenome-scale
DNA-M was assessed using Illumina 450K and 850k EPIC platforms. Linear mixed models
with repeated measures were implemented in our analysis. We identified 15,532 CpG
sites were dynamic during adolescence transition in both genders, the level of DNA-M
at 1,179 CpG sites is not stable during adolescence transition and this change is
gender specific. The findings were further tested in an independent study, the Avon
Longitudinal Study of Children and Parents (ALSPAC) study, as expected, the results
showed an agreement with the findings from IOW cohort.
2018-03-26, Chi-Yang Chiu , An additive hazards model for gene level association analysis
of survival traits of complex disorders
Preventive Medicine, UTHSC , Based on counting processes and Doob-Meyer decomposition
of submartingales and functional regression models, we propose an additive hazards
model for gene level association analysis of survival traits of complex disorders.
The additive hazards model overcomes proportional hazards assumptions of Cox models
and is more flexible to model association with time/age. Association between genetic
markers and eye disease will be investigated.
2018-03-12, Hyeonju Kim , Efficient algorithms for detecting GxE in Multivariate Linear
Model
Preventive Medicine, UTHSC , We develop a multivariate linear mixed model for detecting
gene-environment interactions when there are many environments, and we have information
annotating the environments. Our prototype example datasets are on segregating plant
populations grown in multiple sites in multiple years. We will have information on
the weather in each year as well as site-specific information such as latitude. The
goal is to find QTLs that depend on latitude accounting for weather patterns that
vary by year. We formulate a linear mixed model where traits can be correlated due
to genomewide similarities (genetic kinship) and due to weather similarities ("climate
kinship") between environments. We implement an efficient algorithm that uses an Expectation
Conditional Maximization (ECM) algorithm in conjunction with an acceleration step.
Its simulation study will be presented.
2018-02-12, Saunak Sen , Three algorithms for statistical computing: the MM, EM, and
proximal gradient algorithms
Preventive Medicine, UTHSC , Many problems in statistical estimation and machine learning
boil down to optimization of a criterion such as the log likelihood, the residual
sum of squares, or a penalized version. Commonly used algorithms include iteratively
reweighted least squares (IWLS) and Newton-Raphson methods. If the objective function
depends on a large number of parameters, is not smooth, or is difficult to compute,
other methods are needed. In statistics the EM algorithm is a common choice; in machine
learning proximal gradient algorithms are useful. It turns out that both are special
cases of the MM (minorization-maximization) algorithm. This method is guaranteed to
improve the objective function in each iteration under general conditions. We will
provide an overview of the three algorithms, and examine their use in penalized regression
models.
2018-01-29, Chris Gignoux , The role of human genetic diversity in the architecture
of traits: lessons from PAGE
Un. of Colorado , As we enter the second decade of genome-wide association studies,
there remains a persistent bias towards focusing on populations of European descent.
This has numerous drawbacks, both limiting new discoveries and impairing our translation
of findings into precision medicine for people across the world. To counteract this,
the Population Architecture using Genomics and Epidemiology (PAGE) Study was developed,
currently leveraging genome-wide data from 52,000 individuals representing four multi-ethnic
longitudinal cohorts. This large-scale work in diverse populations presents several
computational challenges. I will present on our efforts to design new arrays that
capture variation in diverse populations. Furthermore, I will demonstrate the use
of both statistical and population genetics methods to uncover new insights in complex
traits and the role of population structure in elucidating genetic architecture. I
will conclude with some of our work on Mendelian genetic disease in the context of
human genetic diversity and its implications for modern health systems and biobanks.
2017-11-13, Mina Sartipi , mStroke: Mobile Technology for Post-Stroke Recurrence Prevention
and Recovery
The University of Tennessee at Chatanooga , In this talk, I will present mStroke,
a real-time quantitative assessment of stroke rehabilitation using wearable Sensors.
The goal of mStroke is to explore mobile technology to improve stroke recovery and
prevent stroke recurrence. I will focus on mStroke current clinical functions/applications
such as the functional reach test (FRT), NIH Stroke Scale (NIHSS) motor arm/motor
leg tests, gait speed, activity recognition, and fall detection. mStroke has been
tested on more than 200 students emulating individuals post-stroke and also 40 patients
post-stroke. I will conclude my talk with discussing our vision for mStroke next steps.
We want to study post-stroke management system through multi-modal big data analytics
applied on joint real-time (from mStroke and other physiological sensors) and EMR
data.
2017-10-23, Saunak Sen , Statistical learning methods and the bias-variance tradeoff
Preventive Medicine, UTHSC , The bias-variance tradeoff is a fundamental concept underlying
statistical modeling or learning methods. As we increase the complexity of our models,
we usually decrease the bias in its prediction. However, increasing model complexity
also tends to make the predictions more variable or unstable. In this expository talk,
we will examine how the bias-variance tradeoff is exhibited in the context of some
common statistical learning methods such as k-nearest neighbors, multiple linear regression,
gradient boosting machines, and regularized regression. We will examine their performance
in the context of predicting plant fitness using genomewide genotype data.
2017-09-18, Fridtjof Thomas , Predictive Modeling: Can't See the Forest for the Trees?
Preventive Medicine, UTHSC , Predictive modeling is a very active research area constantly
adding new approaches as well as refining existing ones. This seminar provides an
overview of the approaches referred to as Random Forests, Elastic Nets, Support Vector
Machines (SVM), K-Nearest Neighbors (KNN), and Neural Nets. R-code and packages implementing
these approaches are presented.
2017-08-28, Dr. Gregory Farage , Feature Detection in PolSAR Images using the wavelet
transform
Preventive Medicine, UTHSC , After a brief introduction to Geographical Information
System (GIS) and Remote Sensing, I will present some of their applications with a
focus on polarimetric radar data. I will present two main approaches for classification
of remote sensing data. Next I will present a filtering technique that we developed
based on wavelet multiresolution analysis as a feature detection approach for PolSAR
data. Finally, I will open a discussion on the similarities between genomic data and
remote sensing data.
2017-08-22, Dr. Laura Saba , Animal models and statistical strategies for describing
the transcriptional connectome and its role in complex traits
University of Colorado at Denver , Rarely are the exact same genes associated with
a complex phenotype in different populations, e.g., different recombinant inbred rodent
panels, different human populations, or different species. In most cases, it is the
biological pathway that is important in the etiology of disease rather than an individual
gene. In our first analysis, information across several rat populations is combined
to identify a set of co-expressed genes in brain that predispose to alcohol-related
traits using weighted gene co-expression network analysis (WGCNA) and quantitative
trait loci. At the heart of this co-expression module is an unannotated transcript
that is likely to be a non-coding transcript. When this unannotated transcript is
genetically manipulated, not only does the phenotype of interest, alcohol consumption,
change but also the transcription levels of many other genes are also altered. This
statistical model provides insight about relevant biological processes but may not
give detail about the directed pathway from genotype to disease necessary to identify
potential therapeutic targets. In an additional study, we performed a similar WGCNA
analysis of liver RNA expression levels and associated modules and genetic variants
with alcohol metabolism-related phenotypes. In this study, we identified a module
that contains multiple isoforms of alcohol dehydrogenase, which is known to have a
major role in the metabolism of alcohol to acetaldehyde. Using this co-expression
module, we also explored the application of Bayesian networks to further hypothesize
about the direction of relationships between transcripts using Mendelian randomization.
This model further distinguished transcripts.
2017-08-14, Dr. Ibrahim Abdelrazeq , Levy Driven CARMA(2,1) vs Realized Volatility
Rhodes College , The Levy driven CARMA(2,1) process is a popular one with which to
model stochastic volatility. However, there has been little development in statistical
tools to verify this model assumption and assess the goodness-of-fit of real world
data (Realized Volatility). When a L?vy driven CARMA(2,1) is observed at high frequencies,
the unobserved driving process can be approximated from the observed process. Since,
under general conditions, the L?vy driven CARMA(2,1) can be written as a sum of two
dependent L\'evy driven CAR(1) process, the methods developed in Abdelrazeq, Ivanoff,
and Kulik (2014) can be employed in order to use the approximated increments of the
driving process to test the assumption that the process is L?vy-driven. The performance
of the test is illustrated through simulation and real world data.
2017-07-24, Demba Fofana , Gene Expression and Network Analysis
University of Texas Rio Grande Valley , Analyzing gene expression data rigorously
requires taking assumptions into consideration but also relies on using information
about network relations that exist among genes. Combining these different elements
cannot only improve statistical power, but also provide a better framework through
which gene expression can be properly analyzed. We propose a novel statistical model
that combines assumptions and gene network information into the analysis. Assumptions
are important since every test statistic is valid only when required assumptions hold.
We incorporate gene network information into the analysis because neighboring genes
share biological functions. This correlation factor is taken into account via similar
prior probabilities for neighboring genes. With a series of simulations, our approach
is compared with other approaches. Our method that combines assumptions and network
information into the analysis is shown to be more powerful.
2017-07-17, Dr. Rishi Kamaleswaran , Continuous Big Data and Analytics at the Point
of Care
Department of Pediatrics, UTHSC , Critically ill patients are admitted to the intensive
care unit (ICU) for complex, time sensitive, and dynamic care. While traditional patient
monitors in the ICU have been used to generate large volumes of continuous physiological
data from sensors attached the patient, analytics applied to those systems have largely
been univariate and limited. Use of big data approaches, such as continuous and longitudinal
event stream analytics through open source software such as Apache Spark allows us
to analyze multiple channels of physiological data for prediction of potentially devastating
conditions prior to their clinical manifestation. The use of novel discretization
and machine learning methods allow us to identify salient 'physiomarkers', such as
reduced heart rate variability or arrhythmias. This presentation will highlight recent
work and new directions we are progressing at the Pediatric Intensive Care Unit at
Le Bonheur Children's Hospital.
2017-03-27, Fridtjof Thomas , P-values: Too Big To Fail?
Preventive Medicine, UTHSC, This seminar outlines what we like about p-values and
in which situations they are inherently meaningful. The participant will then get
an understanding why their use is somewhat less straight forward in situations arising
in observational studies and when very many hypotheses are tested (multiplicity of
testing problem). Details are given why p-values cannot serve as measures of support
for a specific hypothesis and why they do not measure "strength of scientific evidence"
in a general sense. The American Statistical Association has recently issued recommendations
on P-values and reporting of statistical analyses and the Journal of Basic and Applied
Social Psychology has banned P values altogether from its research articles. Are P
values still around because they are inherently beneficial or have they simply become
so prolific that they are "too big to fail"?
2016-10-03, Josh Callaway , The Next Generation of Data Analysis
CEO/Co-founder/Data Scientist, PendulumRock Analytics, LLC, According to Moore's Law,
the number of transistors on a microchip doubles nearly every 2 years. As we are now
approaching this theoretical limit, Ray Kurzweil points out that technological performance
will enter a new paradigm of quantum computing, neuromorphic chips, and 3-D stacking.
We can well extrapolate a similar paradigm shift in many other realms of information
technology. Apps are replacing the middle man and rendering life cheaper in nearly
any service imaginable from the hotel industry to hailing a taxi or ordering food.
It appears very likely, as well, that apps will make their way into medical diagnosis
and rental services for autonomous self-driving cars. The boon of this app awakening
manifests in the acceleration of declining costs and easy accessibility. In developing
countries, such technology is providing business opportunity where it was previously
too expensive. And, this is not limited to consumer services. The proliferation of
Big Data and high-performance computing has broken barriers in statistics. Ten years
ago, an analytical model that today takes 10 minutes to compute might have taken 10
days. Furthermore, this is not merely limited to high-cost licensed commercial software.
Breakthroughs have been realized that allow smaller institutions and individuals to
spread the work of programming across multiple processes. Additionally, cheap production
of statistical applications is now plausible through open-source platforms. Therefore,
we may be witnessing an analytical singularity in which nearly-instantaneous output
can be produced from low-cost input. Already underway, this is providing a mainspring
in technical startups and will fuel a similar paradigm shift in statistics and analytics
2016-08-22, Fridtjof Thomas , R for power users: Compiled R code and parallel computing
techniques
Preventive Medicine, UTHSC, This seminar will demonstrate how to harness the power
of R for large numerical computations with focus on simulation studies or other repetitive
computational tasks. I will outline how to write R code with speedy execution in mind
and how to investigate where computation time is spent in your R function by profiling
your R execution. A life demonstration will show how R code can be compiled for faster
execution as well as demonstrate how to create and utilize multiple cores that your
hardware provides. We will specifically address (pseudo) random number generation
and repeatability of parallel/distributed computing as a cornerstone of reproducible
research. All techniques are demonstrated in a Windows 10 environment but are generally
applicable to UNIX and Macintosh systems as well.
2016-05-23, Hyeonju Kim , Probabilities of Ruin in Economics and Insurance under Light-
and Heavy-tailed Distributions
Preventive Medicine, UTHSC, This research is conducted on ruin problems in two fields.
First, the ruin or survival of an economic agent over finite and infinite time horizons
is explored for a one-good economy. A recursive relation derived for the intractable
ruin distribution is used to compute its moments. A new system of Chebyshev inequalities,
using an optimal allocation of different orders of moments over different ranges of
the initial stock, provide good conservative estimates of the true ruin distribution.
The second part of the research is devoted to the study of ruin probabilities in the
general renewal model of insurance under both light- and heavy-tailed claim size distributions.
Recent results on the dual problem of equilibrium of the Lindley-Spitzer Markov process
provide clues to the orders of magnitude of finite time ruin probabilities in insurance.
2016-04-25, Parichoy Pal Chounhury , Causal Effect Among The Treated: Multiple Data
Sources and Censored Outcomes
Johns Hopkins Bloomberg School of Public Health , We develop an inferential framework
for estimating the causal effect among "exposed" subjects on a time-to-event outcome,
based on multiple data sources and censored outcome information. We conceptualize
a hypothetical point exposure study where subjects are enrolled and allowed to select
their own exposure. Using information from two data sources (one for exposed subjects
and one for non-exposed subjects with multiple examination times), we describe a process
of manufacturing a dataset that closely mimics this hypothetical study. The identification
of the causal effect relies on a no unmeasured confounding assumption based on covariates
available at exposure selection and a non-informative censoring assumption. Estimation
proceeds by fitting separate proportional hazards regression models for exposed and
non-exposed subjects using the manufactured dataset and using G-computation to estimate,
for exposed subjects, the distributions of time-to-event under exposure and non-exposure.
Using these estimated distributions, we compute a parsimonious measure of the causal
effect of interest.
2016-04-18, Stanley Pounds , Profiling Dialysis Facilities for Adverse Recurrent Events
St. Jude Children's Research Hospital , Profiling analysis aims to evaluate health
care providers, such as hospitals, nursing homes, or dialysis facilities, with respect
to a patient outcome. Previous profiling methods have considered binary outcomes,
such as 30-day hospital readmission or mortality. For the unique population of dialysis
patients, regular blood works are required to evaluate effectiveness of treatment
and avoid adverse events, including dialysis inadequacy, imbalance mineral levels,
and anemia among others. For example, anemic events (when hemoglobin levels exceed
normative range) are recurrent and common for patients on dialysis. Thus, we propose
high-dimensional Poisson and negative binomial regression models for rate/count outcomes
and introduce a standardized event ratio (SER) measure to compare the event rate at
a specific facility relative to a chosen normative standard, typically defined as
an ?average? national rate across all facilities. Our proposed estimation and inference
procedures overcome the challenge of high-dimensional parameters for thousands of
dialysis facilities. Also, we investigate how overdispersion affects inference in
the context of profiling analysis. (manuscript: https://onlinelibrary.wiley.com/doi/epdf/10.1002/sim.8482)
2016-04-04, Jonathan S. Schildcrout , Epidemiological sampling designs for longitudinal
binary data with application to spirometery-based COPD diagnosis
Vanderbilt University, We discuss an epidemiological study design and analyses approach
for longitudinal binary response data. Similar to other epidemiological study designs,
we seek to gain efficiency and increase power compared to standard designs by over-sampling
relatively informative subjects for inclusion into the sample. In particular, we will
discuss a design that conducts a case-control sample (i.e., sample cases with high
probability and controls with low probability); however, subjects are then followed
longitudinally and case-control status is observed repeatedly for each subject. If
the sampling variable (case-control status at baseline) is closely related to the
binary response (case-control status over time), we are able to observe a sample that
is highly enriched with case-visits compared to a standard random sampling design.
We may therefore realize a substantial improvements in power and efficiency. However,
because the design over-samples case-days, we must acknowledge the non-representativeness
of the sample when conducting statistical analyses. We will describe a sequentially
offsetted regression approach for valid inferences. Motivated by data provided by
the Lung Health Study we will show that targeted sampling designs can yield valid
inferences and can be far more resource efficient than standard random sampling designs
for longitudinal data.
2016-03-16, Charisse Madlock-Brown , Exploring the UTHSC co-authorship Network with
D3
Informatics and Information Management, UTHSC, The D3 JavaScript library is currently
having a huge impact on academic data analysis. The D3 library can allow researchers
to build interactive data visualizations for exploration using a wide variety of charts,
networks, and graphs. These visualizations can further help researchers communicate
their findings in presentations, and answers questions with on-the-fly interactive
investigation of their data. In this presentation, I will go over the basics of the
D3 library. I will also demonstrate its potential by exploring a collaborative network
visualization of UTHSC researchers. I will explore the co-authorship patterns between
basic and clinical scientists, investigate sub-structures, and explore the publication
history of prominent co-authorship communities using the network as a guide for exploration.
2016-02-29, Arzu-Onar Thomas , A critical look at non-inferiority trials: benefits
and pitfalls
St. Jude Children's Research Hospital, Non-inferiority trials have the potential to
be extremely useful and are designs of choice when placebo controlled trials are not
ethical or when a new treatment is thought to be similar in primary outcome to standard
of care but may have advantages in secondary endpoints such as quality of life, cost,
compliance etc. Designing and conducting non-inferiority trials however can be a lot
more challenging compared to a superiority trial as the sources of bias are more abundant
and lurk at unusual places such as in intent-to-treat populations etc. Furthermore
full interpretation of results may rely on information outside of the trial itself
making non-inferiority trials vulnerable to hazards of non-randomized trials. In this
talk, I will try to outline some of the issues that need to be taken into account
when designing and running a non-inferiority trial and provide guidance regarding
best-practices. I will also discuss potential biases that may be present and discuss
proper ways of analyzing the data and interpreting the results. The talk will primarily
focus on conceptual aspects rather than mathematical details and is intended for statisticians,
physician scientists as well as others involved in operational aspects of clinical
trials.
2016-01-25, Ethan Willis , Causal Inference in Rare Diseases, In Practice
Center of Biomedical Inforamatics, UTHSC, This presentation will give an overview
of how researchers can measure the impact of medical interventions on patients with
rare diseases using longitudinal data. In the United States, rare diseases affect
30 million patients in approximately 7000 disease areas. Causal inference to inform
therapy decisions in this area is challenged by several factors: the patient population
is small, the genotype or phenotype is variable, and the disease pathways can be uncharacterized
or lengthy. These challenges are not specific to the rare disease setting as they
apply more largely to answering causal inference questions in the era of precision
medicine where public health decisions rely on subgroups of sample sizes. In 2019,
the Food and Drug Administration published documents guiding best practices on data
collection, study design and use of registry data in the rare diseases setting. Over
the past few years, a few algorithms were published to guide researchers to appropriate
study design. This presentation will give an overview of best design practices and
innovative analysis methods and illustrate their use with case studies in rare diseases.
2015-11-04, Karl Broman, Reproducible Research
Biostatistics and Medical Informatics, University of Wisconsin Madison, A minimal
standard for data analysis and other scientific computations is that they be reproducible:
that the code and data are assembled in a way so that another group can re-create
all of the results (e.g., the figures in a paper). I will discuss my personal struggles
to make my work reproducible and will present a series of suggested steps on the path
towards reproducibility (see http://kbroman.org/steps2rr).
2015-11-02, Karl Broman, Interactive graphics for genetic data
Biostatistics and Medical Informatics, University of Wisconsin Madison, The value
of interactive graphics for making sense of high-dimensional data has long been appreciated
but is still not in routine use. I will describe my efforts to develop interactive
graphical tools for genetic data, using JavaScript and D3. (The tools are available
as an R package: R/qtlcharts, http://kbroman.org/qtlcharts) I will focus on an expression
genetics experiment in the mouse, with gene expression microarray data on each of
six tissues, plus high-density genotype data, in each of 500 mice. I argue that in
research with such data, precise statistical inference is not so important as data
visualization.