Homework 4
- Due Oct 27, 2023 by 11:59pm
- Points 100
- Submitting a file upload
Homework Project 4
- Due Fri, Oct 27, 2023
- Homeworks, projects and assignments
- Homework Submission Rules
- Homework Headers
Problem 4.1:
Use the SOCR 2011 US Job Satisfaction data
Links to an external site. where the last column (Description
) contains free text describing each job type. Replace all underscores, _
, in the job descriptions by a space and develop an R protocol to examine the job-stress level and hiring-potential using the job description (JD) text:
- Remove the index (column 1), which contains the rank-ordering of the jobs
- Convert the textual JD meta-data into a corpus object.
- Triage some of the irrelevant punctuation and other symbols in the corpus document, change all text to lower case, etc.
- Tokenize the job descriptions into words. Examine the distributions of Physical_Demand and Hiring_Potential.
- Split the data 90:10 training:testing (randomly).
- Binarize the Job Physical Demand into two categories (low/high stress levels, <20 lbs vs. 20 lbs), both training and testing data.
- Generate a word cloud to visualize the job descriptions (training data).
- Graphically visualize the difference between low and high physical-demand categories.
- Transform the word count features into categorical data.
- Ignore low frequency words and report the sparsity of your categorical data matrix.
- Analytics based on document term matrices of the JDs
- Apply the Naïve Bayes classifier on the high frequency terms to predict the job physical-demand (low/high).
- Fit an LDA prediction model for job physical-demand and compare to the Naïve Bayes classifier, report the error rates, specificity, and sensitivity (on testing data).
- Use
C5.0
andrpart
to train a decision tree and compare their physical-demand predictions to their Naïve Bayes counterparts (report results on testing data). - Fit a multivariate linear model to predict Overall job ranking (smaller is better). Generate some informative pairs plots. Use
backward
step-wise feature selection to simplify the model, report the AIC.
- Would these models of physical demand improve if we add to the derived JD text predictors some of the numerical covariates (e.g., *Average_Income*)?
Rubric
Keep in mind that 99 students have already been assessed using this rubric. Changing it will affect their evaluations.
Criteria | Ratings | Pts | ||
---|---|---|---|---|
Correctness and scientific validity
threshold:
pts
|
|
pts
--
|
||
Result reproducibility
threshold:
pts
|
|
pts
--
|
||
Content focus, presentaiton style, and clarity
threshold:
pts
|
|
pts
--
|
||
Total Points:
100
out of 100
|