Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

增加微信文章明细获取 #190

Merged
merged 5 commits into from
Apr 9, 2018
Merged

增加微信文章明细获取 #190

merged 5 commits into from
Apr 9, 2018

Conversation

mx472756841
Copy link
Contributor

  1. 增加微信文章明细获取
  2. 增加微信文章明细获取测试案例

def get_article_detail(text, del_qqmusic=True, del_voice=True):
"""

根据微信文章的临时链接获取明细
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

根据微信文章的临时链接获取明细这句话放在"""后面,然后空一行写详细注释

}
"""
BACKGROUD_IMAGE_P = re.compile('background-image:[ ]+url\(\"([\w\W]+?)\"\)')
JS_CONTENT = re.compile('js_content.*?>((\s|\S)+)</div>')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regexp放在函数外面compile,17行那个位置

if del_qqmusic:
qqmusic = content_text.find_all('qqmusic')
for music in qqmusic:
music.parent.decompose()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果del_qqmusic为false,那么是否可以获取到music的link list呢,那个voice同

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qqmusic可以获取到源地址,但这个地址不会过期,和插入的qq视频类似,都会跳转到腾讯的播放器中

mpvoice的源地址是/cgi-bin/readtemplate?t=tmpl/audio_tmpl&name=%E6%97%A9%E5%AE%89%E6%AD%A6%E6%B1%893%E6%9C%8819%E6%97%A5&play_length=04:43的形式

并且这两个内容如果是爬取下来,没有办法使用,需要定制相应的播放器,所以此处是直接删除,如果保留的话,就是可以看到,但是不可以播放


# 5. 返回数据

all_img_list = list(all_img_set)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个空行多了

all_img_list = list(all_img_set)
content_html = content_text.prettify()
# 去除div[id=js_content]
content_html = re.findall(JS_CONTENT, content_html)[0][0]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

确保这个[0][0]不会溢出

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

微信内容不改版不会溢出。
如果改版,此处都需要重构

content_html = re.findall(JS_CONTENT, content_html)[0][0]
content_info['content_html'] = content_html
content_info['content_img_list'] = all_img_list
return content_info
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

直接

return  {
'content_html': content_html,
'content_img_list': all_img_list
}

可以吗,我看到content_info只有这里用到了

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以

unlock_callback : callable
处理出现 历史页 的时候出现验证码的函数,参见 unlock_callback_example
identify_image_callback : callable
处理 历史页 的时候处理验证码函数,输入验证码二进制数据,输出文字,参见 identify_image_callback_example
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的注释里面,历史页应该不准确吧

identify_image_callback : callable
处理 历史页 的时候处理验证码函数,输入验证码二进制数据,输出文字,参见 identify_image_callback_example
hosting_callback: callable
将微信采集的文章托管到7牛或者阿里云回调函数
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

七牛

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

说明一下入参和返回的参数


Parameters
----------

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

参数注释

content_html = content_info.pop("content_html")
for idx, img_url in enumerate(content_img_list):
hosting_img_url = hosting_callback(img_url)
assert hosting_img_url is None
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这是为什么

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

应该是is not None, hosting_callback必须返回一个链接

resp.encoding = 'utf-8'
if '链接已过期' in resp.text:
raise WechatSogouException('get_article_content 链接 [{}] 已过期'.format(url))
content_info = WechatSogouStructuring.get_article_detail(resp.text)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_article_content 需要加上get_article_detail的可选参数

assert_equal(len(article_detail['content_img_list']), 2, article_detail)
assert_true('data-wxurl' not in article_detail['content_html'], article_detail['content_html'])
assert_true('qqmusic' not in article_detail['content_html'], article_detail['content_html'])
assert_true('mpvoice' not in article_detail['content_html'], article_detail['content_html'])
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert_not_in

@chyroc chyroc merged commit 94c1121 into chyroc:master Apr 9, 2018
@chyroc
Copy link
Owner

chyroc commented Apr 9, 2018

thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants