HW4

Due Oct 30, 2020 by 11:59pm
Points 100
Submitting a file upload

Homework Project 4

Problem 4.1:

Use the SOCR 2011 US Job Satisfaction data Links to an external site. where the last column (Description) contains free text describing each job type. Replace all underscores, _, in the job descriptions by a space and develop an R protocol to examine the job-stress level and hiring-potential using the job description (JD) text:

Remove the index (column 1), which contains the rank-ordering of the jobs
Convert the textual JD meta-data into a corpus object.
Triage some of the irrelevant punctuation and other symbols in the corpus document, change all text to lower case, etc.
Tokenize the job descriptions into words. Examine the distributions of Stress_Category and Hiring_Potential.
Split the data 90:10 training:testing (randomly).
Binarize the Job Stress into two categories (low/high stress levels), separately for training and testing data.
Generate a word cloud to visualize the job descriptions (training data).
Graphically visualize the difference between low and high stress categories.
Transform the word count features into categorical data.
Ignore low frequency words and report the sparsity of your categorical data matrix.
Analytics
- Apply the Naive Bayes classifier on the high frequency terms to predict stress level (low/high).
- Fit an LDA prediction model for job stress level and compare to the Naive Bayes classifier (stress-level), report the error rates, specificity and sensitivity (on testing data).
- Use C5.0 and rpart to train a decision tree and compare their job-stress predictions to their Naive Bayes counterparts (report results on testing data).
- Fit a multivariate linear model to predict Overall job ranking (smaller is better). Generate some informative pairs plots. Use backward step-wise feature selection to simplify the model, report the AIC.
Would these models of stress improve if we add to the derived JD text predictors some of the numerical covariates (e.g., `Average_Income`)?

Rubric

Title:

Keep in mind that 33 students have already been assessed using this rubric. Changing it will affect their evaluations.

DSPA Rubric

Criteria

Ratings

Pts

Edit criterion description

Correctness and scientific validity

Range

threshold: pts

Edit rating Delete rating

50 pts

Full Marks

Edit rating Delete rating

0 pts

No Marks

pts

50 pts

Edit criterion description

Result reproducibility

Range

threshold: pts

Edit rating Delete rating

30 pts

Full Marks

Edit rating Delete rating

0 pts

No Marks

pts

30 pts

Edit criterion description

Content focus, presentaiton style, and clarity

Range

threshold: pts

Edit rating Delete rating

20 pts

Full Marks

Edit rating Delete rating

0 pts

No Marks

pts

20 pts

Total Points: 100 out of 100

Rubric

Find a Rubric

Title:

Title

Criteria

Ratings

Pts

Edit criterion description Delete criterion row

Description of criterion

Range

threshold: 5 pts

Edit rating Delete rating

5 pts

Full Marks

Edit rating Delete rating

0 pts

No Marks

pts

5 pts

Total Points: 5 out of 5