bilibili视频弹幕的爬取
参考博客:
- 如何获取 avid 和 cid
- 如何获取 评论 b站的一些api如何使用
- 如何获取 视频评论的接口
- https://blog.csdn.net/Mr_Ohahah/article/details/108315942
思路和这几个博客一样,但是解析部分有自己的内容
import requests
from lxml import etree
'''
需求: 输入相应的bv号,直接下弹幕
参考博客:
# 如何获取 avid 和 cid
1. https://blog.muna.uk/archives/Bilibili_apis.html
# 如何获取 评论 b站的一些api如何使用
2. https://zhuanlan.zhihu.com/p/357392015
# 如何获取 评论的接口
3. https://blog.csdn.net/Mr_Ohahah/article/details/108315942
进一步升级:
1. 下载更多的弹幕【目前撑死就1200条】
2. 爬取多个视频的弹幕
3. 改写成为异步模式
'''
url = 'https://www.bilibili.com/'
session = requests.session()
barrages = []
def get_barrage_data(bv,header):
new_url = f'https://api.bilibili.com/x/web-interface/view/detail?bvid={bv}'
res = session.get(new_url, headers=header)
print(res.status_code)
data = res.json()
cid = data['data']['View']['cid']
print('cid为',cid)
barrage_url = f'https://comment.bilibili.com/{cid}.xml'
res = session.get(barrage_url,headers=header)
res.encoding = 'utf-8'
print(res.status_code)
# 解析
parse_xml(res)
# 解析 xml 文件
def parse_xml(res):
xml_data = res.content
xml = etree.fromstring(xml_data)
divs = xml.xpath('//d')
print(len(divs))
for div in divs:
barrage = div.xpath("text()")[0]
print(div.xpath("text()")[0])
barrages.append(barrage)
# 持久化存储
def save_txt(bv):
with open(f'./{bv}.txt','w',encoding='utf-8')as f:
for line in barrages:
f.writelines(line+'\n')
def main(bv):
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36",
'Referer': f'https://www.bilibili.com/video/{bv}/?'
}
get_barrage_data(bv,headers)
save_txt(bv)
if __name__ == '__main__':
bv = 'BV137411B7mQ'
main(bv)