This is a project to classify job positions using machine learning, more specifically, supervised learning. The main goal is to get a classifier
that receives a job position in the form of a sentence, written in natural language, for example CEO and Founder
and returns the job area for that position. The different areas (labels for the classification) are:
* Business
* Technical
* Sales-Marketing
* Other
In the example, CEO and Founder
would return Business
.
There is an analogous project but it classifies according to the level of the position: Classification of job positions by level.
This project is programed using the Python language. The trained classifiers are implemented in the Scikit Learn library, a set of tools for machine learning in Python.
Two classification classes are studied:
* SGDClassifier
* MLPClassifier
If you use pip and virtual environments, you can install easily the needed libraries, such as pandas and scikit learn:
pip install -r requirements.txt
.
Since both classifiers belong to supervised learning, they are trained using manually classified data, that you can see on data_process/data_sets/classified_titles.tsv
. That is a tab-separated-values file, that has two columns in the form:
<job position> | <classification for the job position>
.
The script data_process/tsv_to_dataframe.py
takes the tsv
file and:
1. Generates a dataframe that represents the job positions and corresponding classifications.
2. Split the dataframe into train(X) and test(y) set.
3. Normalizes the dataframe according to a defined criteria.
4. Stores X_train, X_test, y_train, y_test sets.
A classifier has:
* parameters: values that corresponds to the mathematical model, that are adjusted after the training.
* hyper-parameters: values related to the way of training, that are adjusted using a selected part of the training set.
It's possible to use a fit
function to train and adjust the params. Besides, Scikit learn provides a tool named
Pipeline
, and GridSearchCV
in order to make exhaustive search to achieve the hyperparams that optimize the results.
The steps are the same as the classification by level:
1. Run data_process/tsv_file_to_dataframe.py
to extract the data from the tsv file and split the dataset.
2. Run <clf name>_fit_tune_classifier.py
to fit and tune the classifier. fit
is to learn and fit the model to the train set. tune
is to search for the optimal combination of the hyperparams, the ones that achieves better results (tuning may take a while).
3. Run <clf name>_test_classifier.py
to test the trained classifiers and show the results. Besides, a classified example set is stored in test_data/<clf name>_results.tsv
.
Notes:
* <clf name>
can be mlp
or sgd
, depending on the classifier.
* mlp_fit_tune_classifier.py
can take days to complete the tuning.
These are the results of each classification class:
``` precision recall f1-score support
Business 0.91 0.92 0.91 63
Other 0.79 0.86 0.83 44
Sales-Marketing 0.90 0.90 0.90 10 Technical 0.86 0.73 0.79 33
accuracy 0.86 150
macro avg 0.86 0.85 0.86 150
weighted avg 0.86 0.86 0.86 150 ```
``` precision recall f1-score support
Business 0.97 0.97 0.97 63
Other 0.95 0.95 0.95 44
Sales-Marketing 0.89 0.80 0.84 10 Technical 0.97 1.00 0.99 33
accuracy 0.96 150
macro avg 0.95 0.93 0.94 150
weighted avg 0.96 0.96 0.96 150 ```
SGD has better average than MLP. SGD uses SVM model, a good model for text classification.
Bumps joblib from 0.14.1 to 1.2.0.
Sourced from joblib's changelog.
Release 1.2.0
Fix a security issue where
eval(pre_dispatch)
could potentially run arbitrary code. Now only basic numerics are supported. joblib/joblib#1327Make sure that joblib works even when multiprocessing is not available, for instance with Pyodide joblib/joblib#1256
Avoid unnecessary warnings when workers and main process delete the temporary memmap folder contents concurrently. joblib/joblib#1263
Fix memory alignment bug for pickles containing numpy arrays. This is especially important when loading the pickle with
mmap_mode != None
as the resultingnumpy.memmap
object would not be able to correct the misalignment without performing a memory copy. This bug would cause invalid computation and segmentation faults with native code that would directly access the underlying data buffer of a numpy array, for instance C/C++/Cython code compiled with older GCC versions or some old OpenBLAS written in platform specific assembly. joblib/joblib#1254Vendor cloudpickle 2.2.0 which adds support for PyPy 3.8+.
Vendor loky 3.3.0 which fixes several bugs including:
robustly forcibly terminating worker processes in case of a crash (joblib/joblib#1269);
avoiding leaking worker processes in case of nested loky parallel calls;
reliability spawn the correct number of reusable workers.
Release 1.1.0
Fix byte order inconsistency issue during deserialization using joblib.load in cross-endian environment: the numpy arrays are now always loaded to use the system byte order, independently of the byte order of the system that serialized the pickle. joblib/joblib#1181
Fix joblib.Memory bug with the
ignore
parameter when the cached function is a decorated function.
... (truncated)
5991350
Release 1.2.03fa2188
MAINT cleanup numpy warnings related to np.matrix in tests (#1340)cea26ff
CI test the future loky-3.3.0 branch (#1338)8aca6f4
MAINT: remove pytest.warns(None) warnings in pytest 7 (#1264)067ed4f
XFAIL test_child_raises_parent_exits_cleanly with multiprocessing (#1339)ac4ebd5
MAINT add back pytest warnings plugin (#1337)a23427d
Test child raises parent exits cleanly more reliable on macos (#1335)ac09691
[MAINT] various test updates (#1334)4a314b1
Vendor loky 3.2.0 (#1333)bdf47e9
Make test_parallel_with_interactively_defined_functions_default_backend timeo...Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase
.
Bumps numpy from 1.17.4 to 1.22.0.
Sourced from numpy's releases.
v1.22.0
NumPy 1.22.0 Release Notes
NumPy 1.22.0 is a big release featuring the work of 153 contributors spread over 609 pull requests. There have been many improvements, highlights are:
- Annotations of the main namespace are essentially complete. Upstream is a moving target, so there will likely be further improvements, but the major work is done. This is probably the most user visible enhancement in this release.
- A preliminary version of the proposed Array-API is provided. This is a step in creating a standard collection of functions that can be used across application such as CuPy and JAX.
- NumPy now has a DLPack backend. DLPack provides a common interchange format for array (tensor) data.
- New methods for
quantile
,percentile
, and related functions. The new methods provide a complete set of the methods commonly found in the literature.- A new configurable allocator for use by downstream projects.
These are in addition to the ongoing work to provide SIMD support for commonly used functions, improvements to F2PY, and better documentation.
The Python versions supported in this release are 3.8-3.10, Python 3.7 has been dropped. Note that 32 bit wheels are only provided for Python 3.8 and 3.9 on Windows, all other wheels are 64 bits on account of Ubuntu, Fedora, and other Linux distributions dropping 32 bit support. All 64 bit wheels are also linked with 64 bit integer OpenBLAS, which should fix the occasional problems encountered by folks using truly huge arrays.
Expired deprecations
Deprecated numeric style dtype strings have been removed
Using the strings
"Bytes0"
,"Datetime64"
,"Str0"
,"Uint32"
, and"Uint64"
as a dtype will now raise aTypeError
.(gh-19539)
Expired deprecations for
loads
,ndfromtxt
, andmafromtxt
in npyio
numpy.loads
was deprecated in v1.15, with the recommendation that users usepickle.loads
instead.ndfromtxt
andmafromtxt
were both deprecated in v1.17 - users should usenumpy.genfromtxt
instead with the appropriate value for theusemask
parameter.(gh-19615)
... (truncated)
4adc87d
Merge pull request #20685 from charris/prepare-for-1.22.0-releasefd66547
REL: Prepare for the NumPy 1.22.0 release.125304b
wipc283859
Merge pull request #20682 from charris/backport-204165399c03
Merge pull request #20681 from charris/backport-20954f9c45f8
Merge pull request #20680 from charris/backport-20663794b36f
Update armccompiler.pyd93b14e
Update test_public_api.py7662c07
Update init.py311ab52
Update armccompiler.pyDependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase
.
Bumps ipython from 7.10.2 to 7.16.3.
d43c7c7
release 7.16.35fa1e40
Merge pull request from GHSA-pq7m-3gw7-gq5x8df8971
back to dev9f477b7
release 7.16.2138f266
bring back release helper from master branch5aa3634
Merge pull request #13341 from meeseeksmachine/auto-backport-of-pr-13335-on-7...bcae8e0
Backport PR #13335: What's new 7.16.28fcdcd3
Pin Jedi to <0.17.2.2486838
release 7.16.120bdc6f
fix conda buildDependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase
.
Bumps pygments from 2.5.2 to 2.7.4.
Sourced from pygments's releases.
2.7.4
Updated lexers:
Fix infinite loop in SML lexer (#1625)
Fix backtracking string regexes in JavaScript/TypeScript, Modula2 and many other lexers (#1637)
Limit recursion with nesting Ruby heredocs (#1638)
Fix a few inefficient regexes for guessing lexers
Fix the raw token lexer handling of Unicode (#1616)
Revert a private API change in the HTML formatter (#1655) -- please note that private APIs remain subject to change!
Fix several exponential/cubic-complexity regexes found by Ben Caller/Doyensec (#1675)
Fix incorrect MATLAB example (#1582)
Thanks to Google's OSS-Fuzz project for finding many of these bugs.
2.7.3
... (truncated)
Sourced from pygments's changelog.
Version 2.7.4
(released January 12, 2021)
Updated lexers:
Fix infinite loop in SML lexer (#1625)
Fix backtracking string regexes in JavaScript/TypeScript, Modula2 and many other lexers (#1637)
Limit recursion with nesting Ruby heredocs (#1638)
Fix a few inefficient regexes for guessing lexers
Fix the raw token lexer handling of Unicode (#1616)
Revert a private API change in the HTML formatter (#1655) -- please note that private APIs remain subject to change!
Fix several exponential/cubic-complexity regexes found by Ben Caller/Doyensec (#1675)
Fix incorrect MATLAB example (#1582)
Thanks to Google's OSS-Fuzz project for finding many of these bugs.
Version 2.7.3
(released December 6, 2020)
... (truncated)
4d555d0
Bump version to 2.7.4.fc3b05d
Update CHANGES.ad21935
Revert "Added dracula theme style (#1636)"e411506
Prepare for 2.7.4 release.275e34d
doc: remove Perl 6 ref2e7e8c4
Fix several exponential/cubic complexity regexes found by Ben Caller/Doyenseceb39c43
xquery: fix pop from empty stack2738778
fix coding style in test_analyzer_lexer02e0f09
Added 'ERROR STOP' to fortran.py keywords. (#1665)c83fe48
support added for css variables (#1633)Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase
.
Hi, When i use different test data, the sgd/mlp_results shows. ",,,,,,,," Could you please help with that?
Random forest with parameter tuning, According to instructions included a fit tuning file with clf_name_fit_tune.py And also Added the clf_name_test.py file, Will be giving out the output instances soon
hacktoberfest