HW2
- Due Oct 2, 2020 by 11:59pm
- Points 100
- Submitting a file upload
Homework Project 2
- Due Fri, Oct 02, 2020
- Homeworks, projects and assignments
- Homework Submission Rules
- Homework Headers
Problem 2.1 (Data Manipulation):
Load the following two datasets separately, generate summary statistics for all features, plot some of the features using histograms, box plots, density plots, etc., as appropriate, and save the summaries locally as Text files.
- Use ALS case-study data and SOCR Knee Pain Data Links to an external site. to explore some bivariate relations (e.g. bivariate plot, correlation, table crosstable etc.)
- Use 07_UMich_AnnArbor_MI_TempPrecipitation_HistData_1900_2015 data to show the relations between temperature and time. [Hint: use
geom_line
orgeom_bar
]. Some sample code is included below.
<code> Temp_Data <- as.data.frame(read.csv("https://umich.instructure.com/files/706163/download?download_frd=1", header=T, na.strings=c("", ".", "NA", "NR"))) summary(Temp_Data) # View(Temp_Data); colnames(Temp_Data) # Wide-to-Long transformation: reshape arguments include # (1) list of variable names that define the different times or metrics (varying), # (2) the name we wish to give the variable containing these values in our long dataset (v.names), # (3) the name we wish to give the variable describing the different times or metrics (timevar), # (4) the values this variable will have (times), and # (5) the end format for the data (direction) # Before reshaping make sure all data types are the same as putting them in 1 column will # otherwise generate inconsistencies/errors colN <- colnames(Temp_Data[,-1]) longTempData <- reshape(Temp_Data, varying = colN, v.names = "Temps", timevar="Months", times = colN, direction = "long")
# chronologically order months (calendar not alphabetically)
longTempData$Months= factor(longTempData$Months, levels = month.abb)
# View(longTempData)
bar2 <- ggplot(longTempData, aes(x = Months, y = Temps, fill = Months)) +
geom_bar(stat = "identity");
print(bar2);
bar3 <- ggplot(longTempData, aes(x = Year, y = Temps, fill = Months)) +
geom_bar(stat = "identity");
print(bar3);
p <- ggplot(longTempData, aes(x=Year, y=as.integer(Temps), colour=Months)) +
geom_line();
p
</code>
Problem 2.3 (Missing Data)
Introduce (artificially) some missing data in the Knee Pain dataset Links to an external site., impute the missing values and examine the differences between the original, incomplete, and imputed datasets.
Problem 2.4 (Surface Plots)
Generate a surface plot for the (RF
) Knee Pain data illustrating the 2D distribution of locations of the patient reported knee pain (use plot_ly and kernel density estimation).
Problem 2.5 (Sample-Size Rebalancing)
Rebalance the groups of ALS (training data) patients according vs. , based on the synthetic minority oversampling (SMOTE) Links to an external site. to ensure approximately equal cohort sizes.
Rubric
Criteria | Ratings | Pts | ||
---|---|---|---|---|
Correctness and scientific validity
threshold:
pts
|
|
pts
--
|
||
Result reproducibility
threshold:
pts
|
|
pts
--
|
||
Content focus, presentaiton style, and clarity
threshold:
pts
|
|
pts
--
|
||
Total Points:
100
out of 100
|