Create a ZIM file from a Youtube channel/username/playlist

openzim, updated 🕥 2023-02-07 16:39:35

Youtube2zim

CodeFactor Docker License: GPL v3 PyPI version shields.io

youtube2zim allows you to create a ZIM file from a Youtube Channel/Username or one-or-more Playlists.

It downloads the video (webm or mp4 extension – optionnaly recompress them in lower-quality, smaller size), the thumbnails, the subtitles and the authors' profile pictures ; then, it create a static HTML files folder of it before creating a ZIM off of it.

Requirements

  • ffmpeg for video transcoding (only used with --lower-quality).
  • curl and unzip to install Javascript dependencies. See get_js_deps.sh if you want to do it manually.

Installation

Here comes a few different ways to install youtube2zim.

Virtualenv

youtube2zim is a Python3 software. If you are not using the Docker image, you are advised to use it in a virtualenv to avoid installing software dependences on your system.

bash virtualenv -p python3 ./ # Create virtualenv source bin/activate # Activate the virtualenv pip3 install youtube2zim # Install dependencies youtube2zim --help # Display youtube2zim help

At the end, call deactivate to quit the virtual environment.

See requirements.txt for the list of python dependencies.

Docker

bash docker run -v my_dir:/output openzim/youtube youtube2zim --help

Globally (on GNU/Linux)

bash sudo pip3 install -U youtube2zim

Usage

youtube2zim uses Youtube API v3 to fetch data from Youtube. You thus need to provide an API_KEY to use the scraper.

To get an API:

  1. Connect to Google Developers Console
  2. Create a new Project then Select it.
  3. When asked, choose Create Credentials and select the API Key type. (Credentials page)

bash youtube2zim --api-key "<your-api-key>" --type user --id "Vsauce"

Notes

  • Your API_KEY is subject to usage quotas (10,000 requests/day) so use --only_test_branding when adjusting parameters and branding to not waste your quota.
  • If you encounter issues reading ZIM files created using this scraper, please take a look at the Compatibility Matrix before opening a ticket.

youtube2zim-playlists

youtube2zim produces a single ZIM file for a youtube request (channel, user, playlists.

youtube2zim-playlists allows you to create one ZIM file per playlist instead.

This script is a wrapper around youtube2zim and is bundled with the main package.

Usage

youtube2zim-playlists --help

Sample usage:

youtube2zim-playlists --indiv-playlists --api-key XXX --type user --id Vsauce --playlists-name="vsauce_en_playlist-{playlist_id}"

Those are the required arguments for youtube2zim-playlists but you can also pass any regular youtube2zim argument. Those will be forwarded to youtube2zim (which will be run independently for each playlist).

Specificities:

  • --title and --description are mutually exclusive with --playlists-title and --playlists-description.
  • If using --title or --description, all your playlists ZIMs will have the same, static metadata. This is rarely wanted.
  • --playlists-title and --playlists-description allows you to dynamically customize them via some playlist-related variables:
  • {title}: the playlist title
  • {description}: the playlist description
  • {slug}: slugified version of the playlist title
  • {playlist_id}: playlist ID on youtube
  • {creator_id}: playlist's owner channel/user ID.
  • {creator_name}: playlist's owner channel/user name.
  • You can omit them and youtube2zim will auto-generate those.
  • you must specify --playlists-name (supports variables listed above).
  • --playlists-name is used to set the Name metadata of the ZIM (should be unique) and if not set separately, the output file name for the ZIM.
  • --metadata-from allows to specify a path or URL to a JSON file specifying custom static metadata for individual playlists. Format:

json { "<playlist-id>": { "name": "", "zim-file": "", "title": "", "description": "", "tags": "", "creator": "", "profile": "", "banner": "" } }

All fields are optional and taken from command-line/default if not found. <playlist-id> represents the Youtube Playlist ID.

If you feel the need for setting additional details in this file, chances are you should run youtube2zim independently for that playlist (still possible!)

Development

Before contributing be sure to check out the CONTRIBUTING.md guidelines.

License

GPLv3 or later, see LICENSE for more details.

Issues

Long videos need clickable Tables-of-Contents: narrative monologues -> interactive learning

opened on 2023-01-26 20:23:24 by holta

Teachers & Learners need clickable Tables-of-Contents to navigate long videos, to greatly deepen learning around "2-hour movies" and similar long videos.

How should this thoughtful interactivity be implemented in a ZIM file, for long videos especially?

YouTube calls this Video Chapters. Here's a great example beginning with an 11-hour YouTube video on Learning CSS:

image

Above screenshot is from the Description > "Show more" section of https://youtu.be/OXGznpKZ_sA

Here's a visual view of the same idea:

image

▶️ Much the same question has been asked at: https://community.learningequality.org/t/clickable-chapters-on-videos/2707

▶️ Central Question: What is likely the best way to make this critical interactivity (clickable "Video Chapters") happen in a ZIM file, when scraping from YouTube or in any similar way?

ASIDE: YouTube shows the "Video Chapter" name at the bottom — much like CNN's Chyron ("the crawl"). That more advanced UX/UI is certainly elegant — but might not be at all necessary (MVP == Minimum Viable Product!) A very basic/clickable Table-of-Contents being all that learners (and teachers) need — to make video-centric ZIM files more practical/pragmatic for everyday learning!

CONCERN: ZIM files seem to all use video player https://github.com/brion/ogv.js which might make this very difficult, if not impossible? (i.e. to embed a Clickable-Video-Table-of-Contents within any ZIM file, that students need) Or am I wrong?!

SIDE QUESTION: Should warc2zim or any other approaches perhaps be considered — if there's no other/obvious way to make this happen?

Playlist scraping does not "slug" properly in the ZIM filenames

opened on 2022-12-22 13:29:22 by kelson42

Look at how the ZIM filenames of https://farm.openzim.org/recipes/zenius_id_playlists look like: https://farm.openzim.org/pipeline/46001c89460732f50a144a36

They don't respect the norm, and actually I suspect that if the playlist name has a character which is not supported by the filesystem then it will crash.

Revamp the UI to stick to original

opened on 2022-09-03 13:02:36 by kelson42

A typical channel page looks like: image

My proposal would be to stick to this UI by removing mostly what is related to the "social" part of Youtube. That means in particular the top and sidebar, but not only. We would also obviously remove the "community" and "channels" tabs (of screenshot above) and everything related to comments like we do today.

This is for the channel/user view because for the playlists and video pages, the UI should be rethinked a bit more because all the comments and the related videos will disappear.

So basicaly this proposal leads to a few questions: * Would that be a good result to optain for a new UI? * Is that doable for a reasonable amount of work? * Is that a substainable solution?

Video needing age confirmation are not scraped

opened on 2022-08-06 21:49:13 by kelson42

This is the case for example of this video of « deusnex silicium » https://www.youtube.com/watch?v=QJSnf04K9WI

Coukd we fix this limitation?

Set ZIM description properly in case of multiple playlist scraping

opened on 2021-07-04 12:54:04 by kelson42

Currently it seems to be a - character like at http://library.kiwix.org/khan-academy-videos_en_geometric-optics-ap-physics-2-khan-academy_2021-04/M/Description

Better youtube_dl requests mgmt

opened on 2020-09-18 15:40:06 by rgaudin

The new structure from #119 PR accidentally strengthen something wrong in our requests management. Every youtube_dl call leads to a request to the video webpage from which youtube_dl extracts the information it needs to then - download the video file(s) - download the thumbnail(s) - download the subtitle(s) As this webpage is subject to the quota, this is not very efficient as we'd call it 3 times in case we have nothing for that video in S3.

We should also probably save those subtitles in S3 although those can change over time. As for thumbnails, I don't think this matters much until users raise concern about subtitles being outdated (then we'd expire them after some time).

Releases

2.1.18 2022-11-09 10:07:27

  • Switched to yt-dlp instead of youtube_dl
  • Added fallback for subtitle languages with IDs-like suffixes (#161)
  • Removed a reference to ZIM namespace that would break if first video has subtitles
  • Fixed expected returncodes on errors (#166)
  • Using ogv.js 1.8.9, videojs 7.20.3 and latest videojs-ogvjs (master)
  • Using zimscraperlib 1.8.0
openZIM

Home of the ZIM file format & tools

GitHub Repository

zim openzim youtube scraper