HW_Project_4
- Due Mar 9, 2018 by 11:59pm
- Points 100
- Submitting a file upload
Homework Project 4
- Due Fri, Mar 09, 2018
- Homeworks, projects and assignments
- Homework Submission Rules
- Homework Headers
Problem 4.1:
Use the SOCR 2011 US Job Satisfaction data
Links to an external site. where the last column (Description
) contains free text describing each job type. Replace all underscores, _
, in the job descriptions by a space and construct an R protocol to examine the job-stress level and hiring-potential using the job description (JD) text:
- Split the data 90:10 training:testing (randomly).
- Convert the textual JD meta-data into a corpus object.
- Triage some of the irrelevant punctuation and other symbols in the corpus document, change all text to lower case, etc.
- Tokenize the job descriptions into words. Examine the distributions of
Stress_Category
andHiring_Potential
. - Binarize the Job Stress into two categories (low/high stress levels), separately for training and testing data.
- Generate a word cloud to visualize the job descriptions (training data).
- Graphically visualize the difference between low and high stress categories.
- Transform the word count features into categorical data.
- Ignore low frequency words and report the sparsity of your categorical data matrix.
- Apply the Naive Bayes classifier on the high frequency terms.
- Fit an LDA prediction model for job stress level and compare to the Naive Bayes classifier (stress-level), report the error rates, specificity and sensitivity (on testing data).
- Use
C5.0
andrpart
to train a decision tree and compare their job-stress predictions to their Naive Bayes counterparts (report results on testing data). -
Fit a multivariate linear model to predict Overall job ranking (smaller is better). Generate some informative pairs plots. Use
backward
step-wise feature selection to simplify the model, report the AIC.
Rubric
Keep in mind that 56 students have already been assessed using this rubric. Changing it will affect their evaluations.
Criteria | Ratings | Pts | ||
---|---|---|---|---|
Correctness and scientific validity
threshold:
pts
|
|
pts
--
|
||
Result reproducibility
threshold:
pts
|
|
pts
--
|
||
Content focus, presentaiton style, and clarity
threshold:
pts
|
|
pts
--
|
||
Total Points:
100
out of 100
|