shell
python dyspider.py --uid 64742778880
抖音ID可以通过将主页分享链接发用浏览器打开查看,详细步骤参见
https://jingyan.baidu.com/article/d2b1d102ce2a885c7e37d4f5.html
运行代码将会在项目文件夹下新建一个与目标用户同名的文件夹,所有视频皆存入于此
从用户主页的分享页面入手
https://www.amemv.com/aweme/v1/aweme/post/?user_id=6xx1xx0&count=21&max_cursor=0&aid=1128&_signature=TG2uvBAbGAHzG19a.rniF0xtrq&dytk=14d65256b82dd042058b0eca9f85461b
观察一下,这是一个ajax链接,我们需要的所有信息都在返回的包里了
模拟请求,直接点击链接,copy as curl,然后复制到Postman里转成requests代码
返回json里有一个has_more字段,如果为1表示还可以下拉出现更多作品,如果为0表示已经是最后
当我们下拉的时候可以发现,新出现的url里只有max_cursor变了,新出现的max_cursor就是上次请求返回的max_cursor,有了has_more和max_cursor两个参数,我们就可以把所有urls取到了
写一个递归函数 get_all_video_urls 根据has_more字段将所有urls递归爬取下来,终止条件是has_more==0
用一个全局变量url_list = [] 存放爬到的每一个视频的名字和地址
运行编写好的代码后发现,视频数据格式不对,返回去检查,原来第9步中的url不是真实的视频url,而是一个302跳转地址,真实视频地址在第9步response headers里的Location里
添加禁止跳转的代码,获取真实视频url
python
response = requests.request("GET", url, headers=headers, allow_redirects=False)
video_url = response.headers['Location']
2018/09/11更新:添加显示下载进度条功能
todolist:
Bumps certifi from 2018.8.24 to 2022.12.7.
9e9e840
2022.12.07b81bdb2
2022.09.24939a28f
2022.09.14aca828a
2022.06.15.2de0eae1
Only use importlib.resources's new files() / Traversable API on Python ≥3.11 ...b8eb5e9
2022.06.15.147fb7ab
Fix deprecation warning on Python 3.11 (#199)b0b48e0
fixes #198 -- update link in license9d514b4
2022.06.154151e88
Add py.typed to MANIFEST.in to package in sdist (#196)Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase
.
Bumps ipython from 6.5.0 to 7.16.3.
d43c7c7
release 7.16.35fa1e40
Merge pull request from GHSA-pq7m-3gw7-gq5x8df8971
back to dev9f477b7
release 7.16.2138f266
bring back release helper from master branch5aa3634
Merge pull request #13341 from meeseeksmachine/auto-backport-of-pr-13335-on-7...bcae8e0
Backport PR #13335: What's new 7.16.28fcdcd3
Pin Jedi to <0.17.2.2486838
release 7.16.120bdc6f
fix conda buildDependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase
.
Bumps urllib3 from 1.23 to 1.26.5.
Sourced from urllib3's releases.
1.26.5
:warning: IMPORTANT: urllib3 v2.0 will drop support for Python 2: Read more in the v2.0 Roadmap
- Fixed deprecation warnings emitted in Python 3.10.
- Updated vendored
six
library to 1.16.0.- Improved performance of URL parser when splitting the authority component.
If you or your organization rely on urllib3 consider supporting us via GitHub Sponsors
1.26.4
:warning: IMPORTANT: urllib3 v2.0 will drop support for Python 2: Read more in the v2.0 Roadmap
- Changed behavior of the default
SSLContext
when connecting to HTTPS proxy during HTTPS requests. The defaultSSLContext
now setscheck_hostname=True
.If you or your organization rely on urllib3 consider supporting us via GitHub Sponsors
1.26.3
:warning: IMPORTANT: urllib3 v2.0 will drop support for Python 2: Read more in the v2.0 Roadmap
Fixed bytes and string comparison issue with headers (Pull #2141)
Changed
ProxySchemeUnknown
error message to be more actionable if the user supplies a proxy URL without a scheme (Pull #2107)If you or your organization rely on urllib3 consider supporting us via GitHub Sponsors
1.26.2
:warning: IMPORTANT: urllib3 v2.0 will drop support for Python 2: Read more in the v2.0 Roadmap
- Fixed an issue where
wrap_socket
andCERT_REQUIRED
wouldn't be imported properly on Python 2.7.8 and earlier (Pull #2052)1.26.1
:warning: IMPORTANT: urllib3 v2.0 will drop support for Python 2: Read more in the v2.0 Roadmap
- Fixed an issue where two
User-Agent
headers would be sent if aUser-Agent
header key is passed asbytes
(Pull #2047)1.26.0
:warning: IMPORTANT: urllib3 v2.0 will drop support for Python 2: Read more in the v2.0 Roadmap
Added support for HTTPS proxies contacting HTTPS servers (Pull #1923, Pull #1806)
Deprecated negotiating TLSv1 and TLSv1.1 by default. Users that still wish to use TLS earlier than 1.2 without a deprecation warning should opt-in explicitly by setting
ssl_version=ssl.PROTOCOL_TLSv1_1
(Pull #2002) Starting in urllib3 v2.0: Connections that receive aDeprecationWarning
will failDeprecated
Retry
optionsRetry.DEFAULT_METHOD_WHITELIST
,Retry.DEFAULT_REDIRECT_HEADERS_BLACKLIST
andRetry(method_whitelist=...)
in favor ofRetry.DEFAULT_ALLOWED_METHODS
,Retry.DEFAULT_REMOVE_HEADERS_ON_REDIRECT
, andRetry(allowed_methods=...)
(Pull #2000) Starting in urllib3 v2.0: Deprecated options will be removed
... (truncated)
Sourced from urllib3's changelog.
1.26.5 (2021-05-26)
- Fixed deprecation warnings emitted in Python 3.10.
- Updated vendored
six
library to 1.16.0.- Improved performance of URL parser when splitting the authority component.
1.26.4 (2021-03-15)
- Changed behavior of the default
SSLContext
when connecting to HTTPS proxy during HTTPS requests. The defaultSSLContext
now setscheck_hostname=True
.1.26.3 (2021-01-26)
Fixed bytes and string comparison issue with headers (Pull #2141)
Changed
ProxySchemeUnknown
error message to be more actionable if the user supplies a proxy URL without a scheme. (Pull #2107)1.26.2 (2020-11-12)
- Fixed an issue where
wrap_socket
andCERT_REQUIRED
wouldn't be imported properly on Python 2.7.8 and earlier (Pull #2052)1.26.1 (2020-11-11)
- Fixed an issue where two
User-Agent
headers would be sent if aUser-Agent
header key is passed asbytes
(Pull #2047)1.26.0 (2020-11-10)
NOTE: urllib3 v2.0 will drop support for Python 2.
Read more in the v2.0 Roadmap <https://urllib3.readthedocs.io/en/latest/v2-roadmap.html>
_.Added support for HTTPS proxies contacting HTTPS servers (Pull #1923, Pull #1806)
Deprecated negotiating TLSv1 and TLSv1.1 by default. Users that still wish to use TLS earlier than 1.2 without a deprecation warning
... (truncated)
d161647
Release 1.26.52d4a3fe
Improve performance of sub-authority splitting in URL2698537
Update vendored six to 1.16.007bed79
Fix deprecation warnings for Python 3.10 ssl moduled725a9b
Add Python 3.10 to GitHub Actions339ad34
Use pytest==6.2.4 on Python 3.10+f271c9c
Apply latest Black formatting1884878
[1.26] Properly proxy EOF on the SSLTransport test suitea891304
Release 1.26.48d65ea1
Merge pull request from GHSA-5phf-pp7p-vc2rDependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase
.
Bumps pygments from 2.2.0 to 2.7.4.
Sourced from pygments's releases.
2.7.4
Updated lexers:
Fix infinite loop in SML lexer (#1625)
Fix backtracking string regexes in JavaScript/TypeScript, Modula2 and many other lexers (#1637)
Limit recursion with nesting Ruby heredocs (#1638)
Fix a few inefficient regexes for guessing lexers
Fix the raw token lexer handling of Unicode (#1616)
Revert a private API change in the HTML formatter (#1655) -- please note that private APIs remain subject to change!
Fix several exponential/cubic-complexity regexes found by Ben Caller/Doyensec (#1675)
Fix incorrect MATLAB example (#1582)
Thanks to Google's OSS-Fuzz project for finding many of these bugs.
2.7.3
... (truncated)
Sourced from pygments's changelog.
Version 2.7.4
(released January 12, 2021)
Updated lexers:
Fix infinite loop in SML lexer (#1625)
Fix backtracking string regexes in JavaScript/TypeScript, Modula2 and many other lexers (#1637)
Limit recursion with nesting Ruby heredocs (#1638)
Fix a few inefficient regexes for guessing lexers
Fix the raw token lexer handling of Unicode (#1616)
Revert a private API change in the HTML formatter (#1655) -- please note that private APIs remain subject to change!
Fix several exponential/cubic-complexity regexes found by Ben Caller/Doyensec (#1675)
Fix incorrect MATLAB example (#1582)
Thanks to Google's OSS-Fuzz project for finding many of these bugs.
Version 2.7.3
(released December 6, 2020)
... (truncated)
4d555d0
Bump version to 2.7.4.fc3b05d
Update CHANGES.ad21935
Revert "Added dracula theme style (#1636)"e411506
Prepare for 2.7.4 release.275e34d
doc: remove Perl 6 ref2e7e8c4
Fix several exponential/cubic complexity regexes found by Ben Caller/Doyenseceb39c43
xquery: fix pop from empty stack2738778
fix coding style in test_analyzer_lexer02e0f09
Added 'ERROR STOP' to fortran.py keywords. (#1665)c83fe48
support added for css variables (#1633)Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase
.