--- title: "Capstome Health Analytics Project using Big Autism Data" author: "SOCR DSPA Training Project" date: "`r format(Sys.time(), '%B %Y')`" output: html_document: theme: spacelab highlight: tango toc: yes number_sections: yes toc_depth: 3 toc_float: collapsed: no smooth_scroll: yes code_folding: hide word_document: toc: yes toc_depth: '3' --- This is an open-ended R&D projects that SOCR/DSPA Trainees can complete. Any creative solutions can be send to the instructor ([Ivo D. Dinov](https://www.socr.umich.edu/people/dinov/)). Use the Autism Brain Imaging Data Exchange (ABIDE) Data to design a meaningful biomedical study examining, characterizing, and contrasting normal and pathological (autism) brain neuro-development. # Autism Brain Imaging Data Exchange (ABIDE) Data These data consist of derived neuroimaging data, quality assessment (QA) metrics prefixed by *anat_* and *func_*, and manual quality assessment prefixed by *qc_*. Automated QA Measures: These columns reflect automated metrics where outliers may be identified by a statistical procedure (e.g., $2\sigma$). *Anatomical* measures: - Contrast to Noise Ratio [*anat_cnr*]: mean of the gray matter values minus the mean of the white matter values, divided by the standard deviation of the air values 1. - Entropy Focus Criterion [*anat_efc*]: Shannon’s entropy is used to summarize the principal directions distribution, higher energy indicating the distribution is more uniform (i.e., less noisy). - Foreground to Background Energy Ratio [*anat_fber*]: Mean energy of image values (i.e., mean of squares) within the head relative to outside the head. -Smoothness of Voxels [*anat_fwhm*]: The full-width half maximum (FWHM) of the spatial distribution of the image intensity values in terms of voxels (e.g., a value of 3 implies smoothness of 3 voxels). - Percent of Artifact Voxels [*anat_qi1*]: The proportion of voxels with intensity corrupted by artifacts normalized by the number of voxels in the background. - Signal to Noise Ratio [*anat_snr*]: The mean of image values within gray matter divided by the standard deviation of the image values within air (i.e., outside the head) 1. *Functional* measures: - Entropy Focus Criterion [*func_efc*]: Shannon’s entropy is used to summarize the principal directions distribution, higher energy indicating the distribution is more uniform (i.e., less noisy) - Foreground to Background Energy Ratio [*func_fber*]: Mean energy of image values (i.e., mean of squares) within the head relative to outside the head. Uses mean functional. - Smoothness of Voxels [*func_fwhm*]: The full-width half maximum (FWHM) of the spatial distribution of the image intensity values. Uses mean functional. - Standardized DVARS [*func_dvars*]: The spatial standard deviation of the temporal derivative of the data, normalized by the temporal standard deviation and temporal autocorrelation. - Fraction of Outlier Voxels [*func_outlier*]: The mean fraction of outliers found in each volume using [`3dTout` command in AFNI](http://afni.nimh.nih.gov/afni). - Mean Distance to Median Volume [*func_quality*]: The mean distance (1 – spearman’s rho) between each time-point’s volume and the median volume using [AFNI’s 3dTqual command](http://afni.nimh.nih.gov/afni). - Mean Framewise Displacement (FD) [*func_mean_fd*]: A measure of subject head motion, which compares the motion between the current and previous volumes. This is calculated by summing the absolute value of displacement changes in the x, y and z directions and rotational changes about those three axes. The rotational changes are given distance values based on the changes across the surface of a 50mm radius sphere. - Number FD greater than 0.2mm [*func_num_fd*]: The number of frames or volumes with displacement greater than 0.2mm. - Percent FD greater than 0.2mm [*func_perc_fd*]: The percent of frames or volumes with displacement greater than 0.2mm. - Ghost to Signal Ratio [*func_gsr*]: A measure of the mean signal in the ‘ghost’ image (signal present outside the brain due to acquisition in the phase encoding direction) relative to mean signal within the brain. - Manual QA measures: Manual inspection of the data was carried out by three independent raters. More information and meta-data are available in the data-provenance DOCX in the [DSPA ABIDE Case-Study Folder](https://umich.instructure.com/courses/38100/files/folder/Case_Studies/17_ABIDE_Autism_CaseStudy). ## Load in the data ```{r warning=F, error=F, message=F} # install.packages(magrittr) library(magrittr) # load ABIDE data (ABIDE_Aggregated_Data.csv) ABIDE_data <- read.csv('https://umich.instructure.com/files/20935287/download?download_frd=1', header=T) dim(ABIDE_data) # 1098 2145 attach(ABIDE_data) ``` ## Data Modeling, EDA ```{r warning=F, error=F, message=F} # Review the data element types # colnames(ABIDE_data) # Potential relevant Outcomes (Y) table(ABIDE_data$researchGroup) # Autism Control # 528 570 table(ABIDE_data$subjectSex) # Data Cleaning (QC) #replaces the missing (-9999) IQ values with 30 ABIDE_data$iq <- replace(ABIDE_data$iq, ABIDE_data$iq<0, 30) # Visualize the data #table(ABIDE_data$iq) library(plotly) xLabel <- list(title = "Intelligence (IQ)") yLabel <- list(title = "Frequency") plot_ly(x = ~ABIDE_data$iq, type = "histogram") %>% layout(xaxis = xLabel, yaxis = yLabel) # MODEL the data # Fit and plot linear models according to specified predictors and outcomes fitPlot_LM_Model <- function (Y, X) { # Y= outcome column name # X= vector of predictor column names ### ....... # return (myPlot) } #### Run the Full model-fitting prospectively and display the prediction forecasts ``` # Predict recessions ```{r warning=F, error=F, message=F} # Logit modeling ``` # Multiple Imputation of incomplete Data Introduce some MCAR deletions. Impute the missing values and compare the (simulated-missing) data and models to their complete (original) data counterparts. ```{r warning=F, error=F, message=F} # Introduce simulated MCAR missingness # Imputation # Rhat convergence statistics compares the variance between chains to the variance # within chains (similar to the ANOVA F-test). # Rhat Values ~ 1.0 indicate likely convergence, # Rhat Values > 1.1 indicate that the chains should be run longer # (use large number of iterations) # Compare the results of the complete data (1979-2020) models to the imputed data model (1979-2020) # Plot the resulting models and quantify model differences ``` # Mixture Distribution modeling Using the [DSPA Chapter 3 for more elaborate data mixture distribution modeling](https://www.socr.umich.edu/people/dinov/courses/DSPA_notes/03_DataVisualization.html#616_Mixture_distribution_data_modeling), develop some forward prediction models. # Unsupervised clustering Try some of the DSPA unsupervised clustering and classification techniques on the US macro-economic dataset. # Venture beyond ... Think out-of-the-box in this interactive-learning projects using the monthly US macro-economic data. Try to use the RMD source and the provided data to experiment with novel AI/ML techniques. Think of ways to **augment these data** (expand the time range and increase the feature richness). # References - [DSPA Techniques](https://dspa.predictive.space/). - [Autism ABIDE Dataset](https://umich.instructure.com/courses/38100/files/folder/Case_Studies/17_ABIDE_Autism_CaseStudy).