Data sourcing and pre-processing for raplyrics.eu - A rap music lyrics generation project

fpaupier, updated 🕥 2022-12-08 02:21:49

RapLyrics-Scraper

CircleCI

Context

This project aims to provide high quality text dataset of rap music lyrics. Such dataset are then fed to a neural network to build lyrics-generation model. The resulting word-to-word lyrics-generative model is served on raplyrics.eu.

Feel free to tweak this scraper to fit your needs. Kudos to open source.

Setup

  • First you will need to create a genius API key to be able to call their API. Once done, copy your client_access_token in genius/credentials.ini.

  • Get the repo - clone from GitHub

    $ git clone https://github.com/fpaupier/RapLyrics-Scraper

  • Setup a virtualenv

This project is built on python3 - I recommend using a virtual environment.

bash `which python3` -m venv RapLyrics-Scraper source RapLyrics-Scraper/bin/activate pip install -r requirements.txt

Run the lyrics scraper

  • Update the list of artists you want to get the lyrics from and the number of songs to get per artists. To do so, directly edit the artists list defined at lyrics_scraper.py:39.

  • To run the script: be sure to set the lyrics_dir and songs_per_artists arguments.

    • Specify the directory in which the scraped lyrics should be saved with lyrics_dir
    • Specify the number of songs to scrap per artist with the songs_per_artists arg. Run python lyrics_scraper.py --help for more information on the available arguments

Let's say you want to scrap 2 songs per artist and save them in the folder my_lyrics_folder with a verbose output, run: bash python lyrics_scraper.py --verbose --lyrics_dir='my_lyrics_folder' --songs_per_artists=2

  • Once the scraping is done : one lyric file is generated per artist scraped. Merge the files with: bash cat *_lyrics.txt > merged_lyrics.txt

Utils

A toolbox is also provided to analyze some of the dataset properties. To run a quick analysis of any .txt file, update the file to consider in pre_processing/analysis.py then run: bash python pre_processing/analysis.py

Notes

Currently we get the songs by decreasing popularity order.

Related work

This project was intensively used to generate high quality text dataset that were consumed by:

Issues

chore(deps): bump certifi from 2018.4.16 to 2022.12.7

opened on 2022-12-08 02:21:48 by dependabot[bot]

Bumps certifi from 2018.4.16 to 2022.12.7.

Commits


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/fpaupier/RapLyrics-Scraper/network/alerts).

chore(deps): bump ipython from 6.3.1 to 7.16.3

opened on 2022-01-21 19:28:54 by dependabot[bot]

Bumps ipython from 6.3.1 to 7.16.3.

Commits


Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `@dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `@dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `@dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `@dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/fpaupier/RapLyrics-Scraper/network/alerts).
François Paupier

Software Engineer

GitHub Repository Homepage

data data-mining scraper beautiful-soup beautiful-soup-scraper genius genius-api genius-lyrics-search genius-lyrics music python python3 mit-license