Classification of job titles into categories, using different ML techniques

rootstrap, updated 🕥 2022-09-30 18:50:05

Classification of job positions by area

This is a project to classify job positions using machine learning, more specifically, supervised learning. The main goal is to get a classifier that receives a job position in the form of a sentence, written in natural language, for example CEO and Founder and returns the job area for that position. The different areas (labels for the classification) are:
* Business * Technical * Sales-Marketing * Other

In the example, CEO and Founder would return Business.

There is an analogous project but it classifies according to the level of the position: Classification of job positions by level.

Implementation

This project is programed using the Python language. The trained classifiers are implemented in the Scikit Learn library, a set of tools for machine learning in Python. Two classification classes are studied:
* SGDClassifier * MLPClassifier

Needed libraries

If you use pip and virtual environments, you can install easily the needed libraries, such as pandas and scikit learn:
pip install -r requirements.txt.

Process data

Since both classifiers belong to supervised learning, they are trained using manually classified data, that you can see on data_process/data_sets/classified_titles.tsv. That is a tab-separated-values file, that has two columns in the form:
<job position> | <classification for the job position>.
The script data_process/tsv_to_dataframe.py takes the tsv file and: 1. Generates a dataframe that represents the job positions and corresponding classifications. 2. Split the dataframe into train(X) and test(y) set. 3. Normalizes the dataframe according to a defined criteria. 4. Stores X_train, X_test, y_train, y_test sets.

Training and tuning

A classifier has:
* parameters: values that corresponds to the mathematical model, that are adjusted after the training. * hyper-parameters: values related to the way of training, that are adjusted using a selected part of the training set.

It's possible to use a fit function to train and adjust the params. Besides, Scikit learn provides a tool named Pipeline, and GridSearchCV in order to make exhaustive search to achieve the hyperparams that optimize the results.

Script execution

The steps are the same as the classification by level: 1. Run data_process/tsv_file_to_dataframe.py to extract the data from the tsv file and split the dataset. 2. Run <clf name>_fit_tune_classifier.py to fit and tune the classifier. fit is to learn and fit the model to the train set. tune is to search for the optimal combination of the hyperparams, the ones that achieves better results (tuning may take a while). 3. Run <clf name>_test_classifier.py to test the trained classifiers and show the results. Besides, a classified example set is stored in test_data/<clf name>_results.tsv.

Notes: * <clf name> can be mlp or sgd, depending on the classifier. * mlp_fit_tune_classifier.py can take days to complete the tuning.

Results

These are the results of each classification class:

MLP

``` precision recall f1-score support

   Business       0.91      0.92      0.91        63
      Other       0.79      0.86      0.83        44

Sales-Marketing 0.90 0.90 0.90 10 Technical 0.86 0.73 0.79 33

   accuracy                           0.86       150
  macro avg       0.86      0.85      0.86       150

weighted avg 0.86 0.86 0.86 150 ```

SGD

``` precision recall f1-score support

   Business       0.97      0.97      0.97        63
      Other       0.95      0.95      0.95        44

Sales-Marketing 0.89 0.80 0.84 10 Technical 0.97 1.00 0.99 33

   accuracy                           0.96       150
  macro avg       0.95      0.93      0.94       150

weighted avg 0.96 0.96 0.96 150 ```

SGD has better average than MLP. SGD uses SVM model, a good model for text classification.

Issues

Bump joblib from 0.14.1 to 1.2.0

opened on 2022-09-30 18:50:04 by dependabot[bot]

Bumps joblib from 0.14.1 to 1.2.0.

Changelog

Sourced from joblib's changelog.

Release 1.2.0

  • Fix a security issue where eval(pre_dispatch) could potentially run arbitrary code. Now only basic numerics are supported. joblib/joblib#1327

  • Make sure that joblib works even when multiprocessing is not available, for instance with Pyodide joblib/joblib#1256

  • Avoid unnecessary warnings when workers and main process delete the temporary memmap folder contents concurrently. joblib/joblib#1263

  • Fix memory alignment bug for pickles containing numpy arrays. This is especially important when loading the pickle with mmap_mode != None as the resulting numpy.memmap object would not be able to correct the misalignment without performing a memory copy. This bug would cause invalid computation and segmentation faults with native code that would directly access the underlying data buffer of a numpy array, for instance C/C++/Cython code compiled with older GCC versions or some old OpenBLAS written in platform specific assembly. joblib/joblib#1254

  • Vendor cloudpickle 2.2.0 which adds support for PyPy 3.8+.

  • Vendor loky 3.3.0 which fixes several bugs including:

    • robustly forcibly terminating worker processes in case of a crash (joblib/joblib#1269);

    • avoiding leaking worker processes in case of nested loky parallel calls;

    • reliability spawn the correct number of reusable workers.

Release 1.1.0

  • Fix byte order inconsistency issue during deserialization using joblib.load in cross-endian environment: the numpy arrays are now always loaded to use the system byte order, independently of the byte order of the system that serialized the pickle. joblib/joblib#1181

  • Fix joblib.Memory bug with the ignore parameter when the cached function is a decorated function.

... (truncated)

Commits
  • 5991350 Release 1.2.0
  • 3fa2188 MAINT cleanup numpy warnings related to np.matrix in tests (#1340)
  • cea26ff CI test the future loky-3.3.0 branch (#1338)
  • 8aca6f4 MAINT: remove pytest.warns(None) warnings in pytest 7 (#1264)
  • 067ed4f XFAIL test_child_raises_parent_exits_cleanly with multiprocessing (#1339)
  • ac4ebd5 MAINT add back pytest warnings plugin (#1337)
  • a23427d Test child raises parent exits cleanly more reliable on macos (#1335)
  • ac09691 [MAINT] various test updates (#1334)
  • 4a314b1 Vendor loky 3.2.0 (#1333)
  • bdf47e9 Make test_parallel_with_interactively_defined_functions_default_backend timeo...
  • Additional commits viewable in compare view


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/rootstrap/ai-job-title-area-classification/network/alerts).

Bump numpy from 1.17.4 to 1.22.0

opened on 2022-06-22 00:01:03 by dependabot[bot]

Bumps numpy from 1.17.4 to 1.22.0.

Release notes

Sourced from numpy's releases.

v1.22.0

NumPy 1.22.0 Release Notes

NumPy 1.22.0 is a big release featuring the work of 153 contributors spread over 609 pull requests. There have been many improvements, highlights are:

  • Annotations of the main namespace are essentially complete. Upstream is a moving target, so there will likely be further improvements, but the major work is done. This is probably the most user visible enhancement in this release.
  • A preliminary version of the proposed Array-API is provided. This is a step in creating a standard collection of functions that can be used across application such as CuPy and JAX.
  • NumPy now has a DLPack backend. DLPack provides a common interchange format for array (tensor) data.
  • New methods for quantile, percentile, and related functions. The new methods provide a complete set of the methods commonly found in the literature.
  • A new configurable allocator for use by downstream projects.

These are in addition to the ongoing work to provide SIMD support for commonly used functions, improvements to F2PY, and better documentation.

The Python versions supported in this release are 3.8-3.10, Python 3.7 has been dropped. Note that 32 bit wheels are only provided for Python 3.8 and 3.9 on Windows, all other wheels are 64 bits on account of Ubuntu, Fedora, and other Linux distributions dropping 32 bit support. All 64 bit wheels are also linked with 64 bit integer OpenBLAS, which should fix the occasional problems encountered by folks using truly huge arrays.

Expired deprecations

Deprecated numeric style dtype strings have been removed

Using the strings "Bytes0", "Datetime64", "Str0", "Uint32", and "Uint64" as a dtype will now raise a TypeError.

(gh-19539)

Expired deprecations for loads, ndfromtxt, and mafromtxt in npyio

numpy.loads was deprecated in v1.15, with the recommendation that users use pickle.loads instead. ndfromtxt and mafromtxt were both deprecated in v1.17 - users should use numpy.genfromtxt instead with the appropriate value for the usemask parameter.

(gh-19615)

... (truncated)

Commits


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/rootstrap/ai-job-title-area-classification/network/alerts).

Bump ipython from 7.10.2 to 7.16.3

opened on 2022-01-21 20:10:27 by dependabot[bot]

Bumps ipython from 7.10.2 to 7.16.3.

Commits


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/rootstrap/ai-job-title-area-classification/network/alerts).

Bump pygments from 2.5.2 to 2.7.4

opened on 2021-03-29 21:25:20 by dependabot[bot]

Bumps pygments from 2.5.2 to 2.7.4.

Release notes

Sourced from pygments's releases.

2.7.4

  • Updated lexers:

    • Apache configurations: Improve handling of malformed tags (#1656)

    • CSS: Add support for variables (#1633, #1666)

    • Crystal (#1650, #1670)

    • Coq (#1648)

    • Fortran: Add missing keywords (#1635, #1665)

    • Ini (#1624)

    • JavaScript and variants (#1647 -- missing regex flags, #1651)

    • Markdown (#1623, #1617)

    • Shell

      • Lex trailing whitespace as part of the prompt (#1645)
      • Add missing in keyword (#1652)
    • SQL - Fix keywords (#1668)

    • Typescript: Fix incorrect punctuation handling (#1510, #1511)

  • Fix infinite loop in SML lexer (#1625)

  • Fix backtracking string regexes in JavaScript/TypeScript, Modula2 and many other lexers (#1637)

  • Limit recursion with nesting Ruby heredocs (#1638)

  • Fix a few inefficient regexes for guessing lexers

  • Fix the raw token lexer handling of Unicode (#1616)

  • Revert a private API change in the HTML formatter (#1655) -- please note that private APIs remain subject to change!

  • Fix several exponential/cubic-complexity regexes found by Ben Caller/Doyensec (#1675)

  • Fix incorrect MATLAB example (#1582)

Thanks to Google's OSS-Fuzz project for finding many of these bugs.

2.7.3

... (truncated)

Changelog

Sourced from pygments's changelog.

Version 2.7.4

(released January 12, 2021)

  • Updated lexers:

    • Apache configurations: Improve handling of malformed tags (#1656)

    • CSS: Add support for variables (#1633, #1666)

    • Crystal (#1650, #1670)

    • Coq (#1648)

    • Fortran: Add missing keywords (#1635, #1665)

    • Ini (#1624)

    • JavaScript and variants (#1647 -- missing regex flags, #1651)

    • Markdown (#1623, #1617)

    • Shell

      • Lex trailing whitespace as part of the prompt (#1645)
      • Add missing in keyword (#1652)
    • SQL - Fix keywords (#1668)

    • Typescript: Fix incorrect punctuation handling (#1510, #1511)

  • Fix infinite loop in SML lexer (#1625)

  • Fix backtracking string regexes in JavaScript/TypeScript, Modula2 and many other lexers (#1637)

  • Limit recursion with nesting Ruby heredocs (#1638)

  • Fix a few inefficient regexes for guessing lexers

  • Fix the raw token lexer handling of Unicode (#1616)

  • Revert a private API change in the HTML formatter (#1655) -- please note that private APIs remain subject to change!

  • Fix several exponential/cubic-complexity regexes found by Ben Caller/Doyensec (#1675)

  • Fix incorrect MATLAB example (#1582)

Thanks to Google's OSS-Fuzz project for finding many of these bugs.

Version 2.7.3

(released December 6, 2020)

... (truncated)

Commits
  • 4d555d0 Bump version to 2.7.4.
  • fc3b05d Update CHANGES.
  • ad21935 Revert "Added dracula theme style (#1636)"
  • e411506 Prepare for 2.7.4 release.
  • 275e34d doc: remove Perl 6 ref
  • 2e7e8c4 Fix several exponential/cubic complexity regexes found by Ben Caller/Doyensec
  • eb39c43 xquery: fix pop from empty stack
  • 2738778 fix coding style in test_analyzer_lexer
  • 02e0f09 Added 'ERROR STOP' to fortran.py keywords. (#1665)
  • c83fe48 support added for css variables (#1633)
  • Additional commits viewable in compare view


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/rootstrap/ai-job-title-area-classification/network/alerts).

regarding test data

opened on 2020-10-19 10:54:46 by GV67

Hi, When i use different test data, the sgd/mlp_results shows. ",,,,,,,," Could you please help with that?

Added Random forest classifier

opened on 2020-10-08 17:37:57 by Darknez07

Random forest with parameter tuning, According to instructions included a fit tuning file with clf_name_fit_tune.py And also Added the clf_name_test.py file, Will be giving out the output instances soon

hacktoberfest