youtube2zim
allows you to create a ZIM file
from a Youtube Channel/Username or one-or-more Playlists.
It downloads the video (webm
or mp4
extension – optionnaly
recompress them in lower-quality, smaller size), the thumbnails, the
subtitles and the authors' profile pictures ; then, it create a static
HTML files folder of it before creating a ZIM off of it.
ffmpeg
for video transcoding (only used with --lower-quality
).curl
and unzip
to install Javascript dependencies. See get_js_deps.sh
if you want to do it manually.Here comes a few different ways to install youtube2zim
.
youtube2zim
is a Python3 software. If you are not using the
Docker image, you are advised to use it in a
virtualenv to avoid installing software
dependences on your system.
bash
virtualenv -p python3 ./ # Create virtualenv
source bin/activate # Activate the virtualenv
pip3 install youtube2zim # Install dependencies
youtube2zim --help # Display youtube2zim help
At the end, call deactivate
to quit the virtual environment.
See requirements.txt
for the list of python dependencies.
bash
docker run -v my_dir:/output openzim/youtube youtube2zim --help
bash
sudo pip3 install -U youtube2zim
youtube2zim
uses Youtube API v3 to fetch data from Youtube. You thus need to provide an API_KEY
to use the scraper.
To get an API:
bash
youtube2zim --api-key "<your-api-key>" --type user --id "Vsauce"
--only_test_branding
when adjusting parameters and branding to not waste your quota.youtube2zim
produces a single ZIM file for a youtube request (channel
, user
, playlists
.
youtube2zim-playlists
allows you to create one ZIM file per playlist instead.
This script is a wrapper around youtube2zim
and is bundled with the main package.
youtube2zim-playlists --help
Sample usage:
youtube2zim-playlists --indiv-playlists --api-key XXX --type user --id Vsauce --playlists-name="vsauce_en_playlist-{playlist_id}"
Those are the required arguments for youtube2zim-playlists
but you can also pass any regular youtube2zim
argument. Those will be forwarded to youtube2zim
(which will be run independently for each playlist).
Specificities:
--title
and --description
are mutually exclusive with --playlists-title
and --playlists-description
.--title
or --description
, all your playlists ZIMs will have the same, static metadata. This is rarely wanted.--playlists-title
and --playlists-description
allows you to dynamically customize them via some playlist-related variables:{title}
: the playlist title{description}
: the playlist description{slug}
: slugified version of the playlist title{playlist_id}
: playlist ID on youtube{creator_id}
: playlist's owner channel/user ID.{creator_name}
: playlist's owner channel/user name.youtube2zim
will auto-generate those.--playlists-name
(supports variables listed above).--playlists-name
is used to set the Name
metadata of the ZIM (should be unique) and if not set separately, the output file name for the ZIM.--metadata-from
allows to specify a path or URL to a JSON file specifying custom static metadata for individual playlists. Format:json
{
"<playlist-id>": {
"name": "",
"zim-file": "",
"title": "",
"description": "",
"tags": "",
"creator": "",
"profile": "",
"banner": ""
}
}
All fields are optional and taken from command-line/default if not found. <playlist-id>
represents the Youtube Playlist ID.
If you feel the need for setting additional details in this file, chances are you should run youtube2zim
independently for that playlist (still possible!)
Before contributing be sure to check out the CONTRIBUTING.md guidelines.
GPLv3 or later, see LICENSE for more details.
Teachers & Learners need clickable Tables-of-Contents to navigate long videos, to greatly deepen learning around "2-hour movies" and similar long videos.
How should this thoughtful interactivity be implemented in a ZIM file, for long videos especially?
YouTube calls this Video Chapters. Here's a great example beginning with an 11-hour YouTube video on Learning CSS:
Above screenshot is from the Description > "Show more" section of https://youtu.be/OXGznpKZ_sA
Here's a visual view of the same idea:
▶️ Much the same question has been asked at: https://community.learningequality.org/t/clickable-chapters-on-videos/2707
▶️ Central Question: What is likely the best way to make this critical interactivity (clickable "Video Chapters") happen in a ZIM file, when scraping from YouTube or in any similar way?
ASIDE: YouTube shows the "Video Chapter" name at the bottom — much like CNN's Chyron ("the crawl"). That more advanced UX/UI is certainly elegant — but might not be at all necessary (MVP == Minimum Viable Product!) A very basic/clickable Table-of-Contents being all that learners (and teachers) need — to make video-centric ZIM files more practical/pragmatic for everyday learning!
CONCERN: ZIM files seem to all use video player https://github.com/brion/ogv.js which might make this very difficult, if not impossible? (i.e. to embed a Clickable-Video-Table-of-Contents within any ZIM file, that students need) Or am I wrong?!
SIDE QUESTION: Should warc2zim or any other approaches perhaps be considered — if there's no other/obvious way to make this happen?
Look at how the ZIM filenames of https://farm.openzim.org/recipes/zenius_id_playlists look like: https://farm.openzim.org/pipeline/46001c89460732f50a144a36
They don't respect the norm, and actually I suspect that if the playlist name has a character which is not supported by the filesystem then it will crash.
A typical channel page looks like:
My proposal would be to stick to this UI by removing mostly what is related to the "social" part of Youtube. That means in particular the top and sidebar, but not only. We would also obviously remove the "community" and "channels" tabs (of screenshot above) and everything related to comments like we do today.
This is for the channel/user view because for the playlists and video pages, the UI should be rethinked a bit more because all the comments and the related videos will disappear.
So basicaly this proposal leads to a few questions: * Would that be a good result to optain for a new UI? * Is that doable for a reasonable amount of work? * Is that a substainable solution?
This is the case for example of this video of « deusnex silicium » https://www.youtube.com/watch?v=QJSnf04K9WI
Coukd we fix this limitation?
Currently it seems to be a -
character like at http://library.kiwix.org/khan-academy-videos_en_geometric-optics-ap-physics-2-khan-academy_2021-04/M/Description
The new structure from #119 PR accidentally strengthen something wrong in our requests management. Every youtube_dl call leads to a request to the video webpage from which youtube_dl extracts the information it needs to then - download the video file(s) - download the thumbnail(s) - download the subtitle(s) As this webpage is subject to the quota, this is not very efficient as we'd call it 3 times in case we have nothing for that video in S3.
We should also probably save those subtitles in S3 although those can change over time. As for thumbnails, I don't think this matters much until users raise concern about subtitles being outdated (then we'd expire them after some time).
zim openzim youtube scraper