HW4
- Due Oct 30, 2020 by 11:59pm
- Points 100
- Submitting a file upload
Homework Project 4
- Due Fri, Oct 30, 2020
- Homeworks, projects and assignments
- Homework Submission Rules
- Homework Headers
Problem 4.1:
Use the SOCR 2011 US Job Satisfaction data
Links to an external site. where the last column (Description
) contains free text describing each job type. Replace all underscores, _
, in the job descriptions by a space and develop an R protocol to examine the job-stress level and hiring-potential using the job description (JD) text:
- Remove the index (column 1), which contains the rank-ordering of the jobs
- Convert the textual JD meta-data into a corpus object.
- Triage some of the irrelevant punctuation and other symbols in the corpus document, change all text to lower case, etc.
- Tokenize the job descriptions into words. Examine the distributions of
Stress_Category
andHiring_Potential
. - Split the data 90:10 training:testing (randomly).
- Binarize the Job Stress into two categories (low/high stress levels), separately for training and testing data.
- Generate a word cloud to visualize the job descriptions (training data).
- Graphically visualize the difference between low and high stress categories.
- Transform the word count features into categorical data.
- Ignore low frequency words and report the sparsity of your categorical data matrix.
-
Analytics
- Apply the Naive Bayes classifier on the high frequency terms to predict stress level (low/high).
- Fit an LDA prediction model for job stress level and compare to the Naive Bayes classifier (stress-level), report the error rates, specificity and sensitivity (on testing data).
- Use
C5.0
andrpart
to train a decision tree and compare their job-stress predictions to their Naive Bayes counterparts (report results on testing data). - Fit a multivariate linear model to predict Overall job ranking (smaller is better). Generate some informative pairs plots. Use
backward
step-wise feature selection to simplify the model, report the AIC.
- Would these models of stress improve if we add to the derived JD text predictors some of the numerical covariates (e.g., `Average_Income`)?
Rubric
Keep in mind that 33 students have already been assessed using this rubric. Changing it will affect their evaluations.
Criteria | Ratings | Pts | ||
---|---|---|---|---|
Correctness and scientific validity
threshold:
pts
|
|
pts
--
|
||
Result reproducibility
threshold:
pts
|
|
pts
--
|
||
Content focus, presentaiton style, and clarity
threshold:
pts
|
|
pts
--
|
||
Total Points:
100
out of 100
|