增加微信文章明细获取 #190

mx472756841 · 2018-04-08T07:48:17Z

增加微信文章明细获取
增加微信文章明细获取测试案例

chyroc · 2018-04-08T09:49:54Z

wechatsogou/structuring.py

+    def get_article_detail(text, del_qqmusic=True, del_voice=True):
+        """
+
+        根据微信文章的临时链接获取明细


根据微信文章的临时链接获取明细这句话放在"""后面，然后空一行写详细注释

chyroc · 2018-04-08T10:25:18Z

wechatsogou/structuring.py

+        }
+        """
+        BACKGROUD_IMAGE_P = re.compile('background-image:[ ]+url\(\"([\w\W]+?)\"\)')
+        JS_CONTENT = re.compile('js_content.*?>((\s|\S)+)</div>')


regexp放在函数外面compile，17行那个位置

chyroc · 2018-04-08T10:26:48Z

wechatsogou/structuring.py

+        if del_qqmusic:
+            qqmusic = content_text.find_all('qqmusic')
+            for music in qqmusic:
+                music.parent.decompose()


如果del_qqmusic为false，那么是否可以获取到music的link list呢，那个voice同

qqmusic可以获取到源地址，但这个地址不会过期，和插入的qq视频类似，都会跳转到腾讯的播放器中

mpvoice的源地址是/cgi-bin/readtemplate?t=tmpl/audio_tmpl&name=%E6%97%A9%E5%AE%89%E6%AD%A6%E6%B1%893%E6%9C%8819%E6%97%A5&play_length=04:43的形式

并且这两个内容如果是爬取下来，没有办法使用，需要定制相应的播放器，所以此处是直接删除，如果保留的话，就是可以看到，但是不可以播放

chyroc · 2018-04-08T10:29:55Z

wechatsogou/structuring.py

+
+        # 5. 返回数据
+
+        all_img_list = list(all_img_set)


这个空行多了

chyroc · 2018-04-08T10:30:28Z

wechatsogou/structuring.py

+        all_img_list = list(all_img_set)
+        content_html = content_text.prettify()
+        # 去除div[id=js_content]
+        content_html = re.findall(JS_CONTENT, content_html)[0][0]


确保这个[0][0]不会溢出

微信内容不改版不会溢出。
如果改版，此处都需要重构

chyroc · 2018-04-08T10:31:48Z

wechatsogou/structuring.py

+        content_html = re.findall(JS_CONTENT, content_html)[0][0]
+        content_info['content_html'] = content_html
+        content_info['content_img_list'] = all_img_list
+        return content_info


直接

return { 'content_html': content_html, 'content_img_list': all_img_list }

可以吗，我看到content_info只有这里用到了

chyroc · 2018-04-08T10:34:07Z

wechatsogou/api.py

+        unlock_callback : callable
+            处理出现 历史页 的时候出现验证码的函数，参见 unlock_callback_example
+        identify_image_callback : callable
+            处理 历史页 的时候处理验证码函数，输入验证码二进制数据，输出文字，参见 identify_image_callback_example


这里的注释里面，历史页应该不准确吧

chyroc · 2018-04-08T10:34:23Z

wechatsogou/api.py

+        identify_image_callback : callable
+            处理 历史页 的时候处理验证码函数，输入验证码二进制数据，输出文字，参见 identify_image_callback_example
+        hosting_callback: callable
+            将微信采集的文章托管到7牛或者阿里云回调函数


说明一下入参和返回的参数

chyroc · 2018-04-08T10:35:32Z

wechatsogou/api.py

+
+        Parameters
+        ----------
+


参数注释

chyroc · 2018-04-08T10:36:40Z

wechatsogou/api.py

+        content_html = content_info.pop("content_html")
+        for idx, img_url in enumerate(content_img_list):
+            hosting_img_url = hosting_callback(img_url)
+            assert hosting_img_url is None


这是为什么

应该是is not None， hosting_callback必须返回一个链接

chyroc · 2018-04-08T10:38:13Z

wechatsogou/api.py

+        resp.encoding = 'utf-8'
+        if '链接已过期' in resp.text:
+            raise WechatSogouException('get_article_content 链接 [{}] 已过期'.format(url))
+        content_info = WechatSogouStructuring.get_article_detail(resp.text)


get_article_content 需要加上get_article_detail的可选参数

chyroc · 2018-04-08T12:14:29Z

test/test_structuring.py

+        assert_equal(len(article_detail['content_img_list']), 2, article_detail)
+        assert_true('data-wxurl' not in article_detail['content_html'], article_detail['content_html'])
+        assert_true('qqmusic' not in article_detail['content_html'], article_detail['content_html'])
+        assert_true('mpvoice' not in article_detail['content_html'], article_detail['content_html'])


assert_not_in

chyroc · 2018-04-09T02:03:34Z

thanks

增加微信文章明细获取

ae480a1

chyroc reviewed Apr 8, 2018

View reviewed changes

修改注释信息

a0e54f0

chyroc reviewed Apr 8, 2018

View reviewed changes

修改部分信息

f642643

chyroc reviewed Apr 8, 2018

View reviewed changes

mengx added 2 commits April 9, 2018 09:04

修改部分信息

4d03bf6

add article_detail test

e4b440d

chyroc merged commit 94c1121 into chyroc:master Apr 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

增加微信文章明细获取 #190

增加微信文章明细获取 #190

mx472756841 commented Apr 8, 2018

chyroc Apr 8, 2018

chyroc Apr 8, 2018

chyroc Apr 8, 2018

mx472756841 Apr 8, 2018

chyroc Apr 8, 2018

chyroc Apr 8, 2018

mx472756841 Apr 8, 2018

chyroc Apr 8, 2018

mx472756841 Apr 8, 2018

chyroc Apr 8, 2018

chyroc Apr 8, 2018

chyroc Apr 8, 2018

chyroc Apr 8, 2018

chyroc Apr 8, 2018

mx472756841 Apr 8, 2018

chyroc Apr 8, 2018

chyroc Apr 8, 2018

chyroc commented Apr 9, 2018

增加微信文章明细获取 #190

增加微信文章明细获取 #190

Conversation

mx472756841 commented Apr 8, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chyroc commented Apr 9, 2018