-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
增加微信文章明细获取 #190
增加微信文章明细获取 #190
Conversation
mx472756841
commented
Apr 8, 2018
- 增加微信文章明细获取
- 增加微信文章明细获取测试案例
wechatsogou/structuring.py
Outdated
def get_article_detail(text, del_qqmusic=True, del_voice=True): | ||
""" | ||
|
||
根据微信文章的临时链接获取明细 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
根据微信文章的临时链接获取明细
这句话放在"""
后面,然后空一行写详细注释
wechatsogou/structuring.py
Outdated
} | ||
""" | ||
BACKGROUD_IMAGE_P = re.compile('background-image:[ ]+url\(\"([\w\W]+?)\"\)') | ||
JS_CONTENT = re.compile('js_content.*?>((\s|\S)+)</div>') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regexp放在函数外面compile,17行那个位置
if del_qqmusic: | ||
qqmusic = content_text.find_all('qqmusic') | ||
for music in qqmusic: | ||
music.parent.decompose() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果del_qqmusic为false,那么是否可以获取到music的link list呢,那个voice同
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
qqmusic可以获取到源地址,但这个地址不会过期,和插入的qq视频类似,都会跳转到腾讯的播放器中
mpvoice的源地址是/cgi-bin/readtemplate?t=tmpl/audio_tmpl&name=%E6%97%A9%E5%AE%89%E6%AD%A6%E6%B1%893%E6%9C%8819%E6%97%A5&play_length=04:43的形式
并且这两个内容如果是爬取下来,没有办法使用,需要定制相应的播放器,所以此处是直接删除,如果保留的话,就是可以看到,但是不可以播放
wechatsogou/structuring.py
Outdated
|
||
# 5. 返回数据 | ||
|
||
all_img_list = list(all_img_set) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个空行多了
wechatsogou/structuring.py
Outdated
all_img_list = list(all_img_set) | ||
content_html = content_text.prettify() | ||
# 去除div[id=js_content] | ||
content_html = re.findall(JS_CONTENT, content_html)[0][0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
确保这个[0][0]不会溢出
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
微信内容不改版不会溢出。
如果改版,此处都需要重构
wechatsogou/structuring.py
Outdated
content_html = re.findall(JS_CONTENT, content_html)[0][0] | ||
content_info['content_html'] = content_html | ||
content_info['content_img_list'] = all_img_list | ||
return content_info |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
直接
return {
'content_html': content_html,
'content_img_list': all_img_list
}
可以吗,我看到content_info只有这里用到了
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可以
wechatsogou/api.py
Outdated
unlock_callback : callable | ||
处理出现 历史页 的时候出现验证码的函数,参见 unlock_callback_example | ||
identify_image_callback : callable | ||
处理 历史页 的时候处理验证码函数,输入验证码二进制数据,输出文字,参见 identify_image_callback_example |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的注释里面,历史页应该不准确吧
wechatsogou/api.py
Outdated
identify_image_callback : callable | ||
处理 历史页 的时候处理验证码函数,输入验证码二进制数据,输出文字,参见 identify_image_callback_example | ||
hosting_callback: callable | ||
将微信采集的文章托管到7牛或者阿里云回调函数 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
七牛
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
说明一下入参和返回的参数
wechatsogou/api.py
Outdated
|
||
Parameters | ||
---------- | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
参数注释
wechatsogou/api.py
Outdated
content_html = content_info.pop("content_html") | ||
for idx, img_url in enumerate(content_img_list): | ||
hosting_img_url = hosting_callback(img_url) | ||
assert hosting_img_url is None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这是为什么
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
应该是is not None, hosting_callback必须返回一个链接
wechatsogou/api.py
Outdated
resp.encoding = 'utf-8' | ||
if '链接已过期' in resp.text: | ||
raise WechatSogouException('get_article_content 链接 [{}] 已过期'.format(url)) | ||
content_info = WechatSogouStructuring.get_article_detail(resp.text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_article_content 需要加上get_article_detail的可选参数
assert_equal(len(article_detail['content_img_list']), 2, article_detail) | ||
assert_true('data-wxurl' not in article_detail['content_html'], article_detail['content_html']) | ||
assert_true('qqmusic' not in article_detail['content_html'], article_detail['content_html']) | ||
assert_true('mpvoice' not in article_detail['content_html'], article_detail['content_html']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert_not_in
thanks |