-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathfeed.xml
3105 lines (3006 loc) · 314 KB
/
feed.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>Recollection</title><link href="https://recollection.saaj.me/" rel="alternate"></link><link href="https://recollection.saaj.me/feed.xml" rel="self"></link><id>https://recollection.saaj.me/</id><updated>2018-04-15T00:00:00+02:00</updated><entry><title>Developer-friendly job search. The tooling</title><link href="https://recollection.saaj.me/article/developer-friendly-job-search-the-tooling.html" rel="alternate"></link><published>2018-04-15T00:00:00+02:00</published><updated>2018-04-15T00:00:00+02:00</updated><author><name>saaj</name></author><id>tag:recollection.saaj.me,2018-04-15:/article/developer-friendly-job-search-the-tooling.html</id><summary type="html"><p>It is quite interesting to me to see how software job market works in the Netherlands
from an active candidate&#8217;s point of view. Unfortunately one great journey has almost
ended and there is a firm incentive to find something at least as good. Having my
current job through StackOverflow Jobs, I intended to keep using what proved to work,
like marking the profile active, entering <tt class="docutils literal">python <span class="pre">-django</span> <span class="pre">-sysadmin</span></tt> into the search
field, adding relevant filters and subscribing to the search. That, as such, of course
works, but it looks it is not the most popular software talent recruitment service
that Dutch companies use. And many companies are still quite happy to pay 20% of the
gross year salary per new hire to the volume-oriented keyword-matching salesmen,
colloquially referred as&nbsp;recruiters.</p>
</summary><content type="html"><p>It is quite interesting to me to see how software job market works in the Netherlands
from an active candidate&#8217;s point of view. Unfortunately one great journey has almost
ended and there is a firm incentive to find something at least as good. Having my
current job through StackOverflow Jobs, I intended to keep using what proved to work,
like marking the profile active, entering <tt class="docutils literal">python <span class="pre">-django</span> <span class="pre">-sysadmin</span></tt> into the search
field, adding relevant filters and subscribing to the search. That, as such, of course
works, but it looks it is not the most popular software talent recruitment service
that Dutch companies use. And many companies are still quite happy to pay 20% of the
gross year salary per new hire to the volume-oriented keyword-matching salesmen,
colloquially referred as&nbsp;recruiters.</p>
<p>On the other side of price spectrum there are free (as a beer&#8230; on a degustation)
services like Indeed and Glassdoor <a class="footnote-reference" href="#id23" id="id1">[1]</a> where employers can place their vacancies.
The two seem to me most relevant and complete, but because the latter also provides
valuable insight into company state, salary range and interview process I will
focus on&nbsp;it.</p>
<p>StackOverflow Jobs is in between. It&#8217;s definitely not free for employers. For our
small company it costed around €2k for half-year access to active candidate database
and one job posting. Most probably it&#8217;ll require a company to have a full-time
<span class="caps">HR</span> employee to manage the interactions with candidates. But hiring 4-5 developers a
year already justifies the investment, so Stack Exchange has growth potential&nbsp;here.</p>
<p>Glassdoor has poor search features, term-inclusion-only search and location, which
makes it a chore to examine the proposals&nbsp;manually.</p>
<div class="contents topic" id="contents">
<p class="topic-title first">Contents</p>
<ul class="simple">
<li><a class="reference internal" href="#problem" id="id44">Problem</a></li>
<li><a class="reference internal" href="#storage" id="id45">Storage</a><ul>
<li><a class="reference internal" href="#fts3-and-fts4" id="id46"><span class="caps">FTS3</span> and <span class="caps">FTS4</span></a></li>
<li><a class="reference internal" href="#json1" id="id47"><span class="caps">JSON1</span></a></li>
<li><a class="reference internal" href="#implementation" id="id48">Implementation</a></li>
</ul>
</li>
<li><a class="reference internal" href="#retrieval" id="id49">Retrieval</a><ul>
<li><a class="reference internal" href="#pyppeteer" id="id50">Pyppeteer</a></li>
<li><a class="reference internal" href="#iteration" id="id51">Iteration</a></li>
<li><a class="reference internal" href="#parsing" id="id52">Parsing</a></li>
</ul>
</li>
<li><a class="reference internal" href="#glue" id="id53">Glue</a></li>
<li><a class="reference internal" href="#listing-relevant-jobs" id="id54">Listing relevant&nbsp;jobs</a></li>
<li><a class="reference internal" href="#wrap-up" id="id55">Wrap-up</a></li>
</ul>
</div>
<div class="section" id="problem">
<h2><a class="toc-backref" href="#id44">Problem</a></h2>
<p>StackOverflow Jobs search features include <a class="footnote-reference" href="#id24" id="id2">[2]</a>:</p>
<ol class="arabic simple">
<li>tag inclusion and&nbsp;exclusion</li>
<li>full-text search on body, title and&nbsp;company</li>
<li>enumeration filters by contract type, industry and&nbsp;seniority</li>
<li>salary&nbsp;range</li>
<li>flag filters by remote option, visa and relocation&nbsp;sponsorship</li>
<li>posted time&nbsp;filtering</li>
</ol>
<p>My job search criteria were the&nbsp;following:</p>
<ul class="simple">
<li>Python (backend software development) job preferably in&nbsp;Amsterdam</li>
<li>The company develops own product, i.e. it&#8217;s not a&nbsp;consultancy</li>
<li>The job is available directly, i.e. it&#8217;s not an agency or any
other kind of&nbsp;intermediary</li>
</ul>
<p>At first glance it looks simple. What subset of the above list of feature
is sufficient to cover&nbsp;these?</p>
<ol class="arabic simple">
<li>Location data should be good enough for equality comparison, but generally
the country is small, very connected and geospatial search is not&nbsp;needed.</li>
<li><em>A Python job</em> is a match against <tt class="docutils literal">[python]</tt> minus some other tags (though only
if carefully tagged), but Glassdoor doesn&#8217;t have tags and implementing <span class="caps">NLP</span> to
recognise them was not feasible. Doing just <tt class="docutils literal">Python</tt> will not only match all
sorts of jobs with it in nice-to-have, but also match jobs that extensively use
Python but are not about software development as such, e.g. devops, (data)
science, test automation, screen scraping and probably others. Also don&#8217;t forget
full-stack and junior/intern positions (no dedicated seniority field on
Glassdoor&nbsp;either).</li>
<li>Solving consultancies and intermediaries should be possible with a blacklist,
but needs full-text index on company name to mitigate dealing with optionally
present suffixes like &#8220;B.V.&#8221; and &#8220;<span class="caps">BV</span>&#8221;, geographical suffixes and the like.
Using <span class="caps">NLP</span> to break down jobs into direct and indirect would be an interesting
research, but again it was not&nbsp;feasible.</li>
</ol>
<p>Items 1 and 3 look clear. 2 needs more attention. It should also be possible to
solve with smarter blacklisting. But let&#8217;s look at a sample of job titles, which
match <tt class="docutils literal">Python</tt> query:</p>
<ul class="simple">
<li>Lead Data&nbsp;Scientist</li>
<li>DevOps Engineer (Linux, <span class="caps">AWS</span>)</li>
<li>Software Engineer Embedded Smart Systems (C/C++; C#;&nbsp;Python)</li>
<li>Agile&nbsp;Tester</li>
<li>Backend Software&nbsp;Engineer</li>
<li>Experienced Python/Django Developer in Zaandam (<span class="caps">NL</span>/<span class="caps">ENG</span>)</li>
<li>Java&nbsp;Developer</li>
<li>Junior Full Stack&nbsp;Engineer</li>
<li>Medior Front-end&nbsp;Developer</li>
<li>Bioinformatic Postdoctoral&nbsp;Fellow</li>
<li>C++ Software engineer Artificial intelligence <span class="caps">DELFT</span>&nbsp;40k-70k</li>
<li>Portfolio Manager Quant&nbsp;Allocation</li>
<li>PhD Student “Modelling causal interactions between marine&nbsp;microbes”</li>
</ul>
<p>Yes, <em>all sorts of jobs</em> as promised. The good thing is that indexing title and
body separately can help a lot with blacklisting. Say, if the title includes
<tt class="docutils literal">Django</tt> it&#8217;s definitely a no-go, but if <tt class="docutils literal">Django</tt> is mentioned in the body
it depends and may be tolerable. The same way other terms, like <tt class="docutils literal">Java</tt> or
<tt class="docutils literal">Embedded</tt> can be&nbsp;handled.</p>
</div>
<div class="section" id="storage">
<h2><a class="toc-backref" href="#id45">Storage</a></h2>
<p>Thus on the level of storage/search it&#8217;s necessary to&nbsp;have:</p>
<ol class="arabic simple">
<li>primary key, original, if exposed, or hash of the whole&nbsp;record</li>
<li>title, full-text&nbsp;indexed</li>
<li>employer, full-text&nbsp;indexed</li>
<li>body, full-text&nbsp;indexed</li>
<li><span class="caps">JSON</span> field or just blob for other&nbsp;properties</li>
<li>status, just integer for own&nbsp;convenience</li>
</ol>
<p>What embedded database has full-text search support with document storage
features and easy Python interface? The latter is about <span class="caps">JSON</span>-field support
that PostgeSQL and MySQL have acquired.&nbsp;SQLite!</p>
<p>The version that I have from Ubuntu Xenial&#8217;s repository, at the time of writing
<tt class="docutils literal"><span class="pre">libsqlite3-0</span> <span class="pre">3.11.0-1ubuntu1</span></tt>, supports&nbsp;both:</p>
<div class="highlight"><pre><span></span>$ <span class="nb">echo</span> <span class="s2">&quot;PRAGMA compile_options;&quot;</span> <span class="p">|</span> sqlite3 <span class="p">|</span> grep -E <span class="s2">&quot;FTS|JSON&quot;</span>
ENABLE_FTS3
ENABLE_FTS3_PARENTHESIS
ENABLE_FTS4
ENABLE_JSON1
</pre></div>
<p>And Python standard library <tt class="docutils literal">sqlite3</tt> <a class="footnote-reference" href="#id25" id="id3">[3]</a> uses it! Now briefly about
the&nbsp;extensions.</p>
<div class="section" id="fts3-and-fts4">
<h3><a class="toc-backref" href="#id46"><span class="caps">FTS3</span> and <span class="caps">FTS4</span></a></h3>
<p>The documentation says <a class="footnote-reference" href="#id26" id="id4">[4]</a>:</p>
<blockquote>
<span class="caps">FTS3</span> and <span class="caps">FTS4</span> are SQLite virtual table modules that allow users to perform
full-text searches on a set of documents&#8230; <span class="caps">FTS3</span> and <span class="caps">FTS4</span> are nearly identical&#8230;
<span class="caps">FTS4</span> is an enhancement to <span class="caps">FTS3</span>. <span class="caps">FTS3</span> has been available since SQLite version
3.5.0 (2007-09-04). The enhancements for <span class="caps">FTS4</span> were added with SQLite version
3.7.4 (2010-12-07).</blockquote>
<p>A virtual table <a class="footnote-reference" href="#id27" id="id5">[5]</a> in SQLite&nbsp;is:</p>
<blockquote>
&#8230;an object that is registered with an open SQLite database connection. From
the perspective of an <span class="caps">SQL</span> statement, the virtual table object looks like any
other table or view. But behind the scenes, queries and updates on a virtual
table invoke callback methods of the virtual table object instead of reading
and writing on the database file.</blockquote>
<p>In case of <span class="caps">FTS4</span> it looks like a view that behind the scenes operates 5
<em>shadow tables</em> <a class="footnote-reference" href="#id28" id="id6">[6]</a> which are stored within the database file. And a note
about <tt class="docutils literal">ENABLE_FTS3_PARENTHESIS</tt> and query&nbsp;syntax:</p>
<blockquote>
New applications should also define the SQLITE_ENABLE_FTS3_PARENTHESIS macro
to enable the enhanced query syntax <a class="footnote-reference" href="#id29" id="id7">[7]</a>.</blockquote>
<p>It means that <tt class="docutils literal">python <span class="pre">-django</span></tt> won&#8217;t work. <tt class="docutils literal">python <span class="caps">AND</span> <span class="caps">NOT</span> django</tt>
(or <tt class="docutils literal">python <span class="caps">NOT</span> django</tt> because <tt class="docutils literal"><span class="caps">AND</span></tt> is the implicit operator) should be
used&nbsp;instead.</p>
<p>To better understand how full-text search is going to contribute to solving the
problem, here is simplified code that represents the data structure <a class="footnote-reference" href="#id28" id="id8">[6]</a>, inverted
index, that is used by the SQLite&nbsp;extension.</p>
<div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">collections</span> <span class="k">import</span> <span class="n">defaultdict</span>
<span class="n">docs</span> <span class="o">=</span> <span class="p">[</span>
<span class="p">(</span><span class="mi">100</span><span class="p">,</span> <span class="s1">&#39;alpha beta beta gamma gamma gamma&#39;</span><span class="p">),</span>
<span class="p">(</span><span class="mi">101</span><span class="p">,</span> <span class="s1">&#39;gamma delta&#39;</span><span class="p">),</span>
<span class="p">]</span>
<span class="n">invertedIndex</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">list</span><span class="p">)</span>
<span class="k">for</span> <span class="n">docId</span><span class="p">,</span> <span class="n">text</span> <span class="ow">in</span> <span class="n">docs</span><span class="p">:</span>
<span class="n">docIndex</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">list</span><span class="p">)</span>
<span class="k">for</span> <span class="n">offset</span><span class="p">,</span> <span class="n">term</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">text</span><span class="o">.</span><span class="n">lower</span><span class="p">()</span><span class="o">.</span><span class="n">split</span><span class="p">()):</span>
<span class="n">docIndex</span><span class="p">[</span><span class="n">term</span><span class="p">]</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">offset</span><span class="p">)</span>
<span class="k">for</span> <span class="n">term</span><span class="p">,</span> <span class="n">offsetList</span> <span class="ow">in</span> <span class="n">docIndex</span><span class="o">.</span><span class="n">items</span><span class="p">():</span>
<span class="n">invertedIndex</span><span class="p">[</span><span class="n">term</span><span class="p">]</span><span class="o">.</span><span class="n">append</span><span class="p">((</span><span class="n">docId</span><span class="p">,</span> <span class="n">offsetList</span><span class="p">))</span>
</pre></div>
<p>Result <tt class="docutils literal">invertedIndex</tt> looks&nbsp;like:</p>
<div class="highlight"><pre><span></span><span class="p">{</span>
<span class="s1">&#39;alpha&#39;</span> <span class="p">:</span> <span class="p">[(</span><span class="mi">100</span><span class="p">,</span> <span class="p">[</span><span class="mi">0</span><span class="p">])],</span>
<span class="s1">&#39;beta&#39;</span> <span class="p">:</span> <span class="p">[(</span><span class="mi">100</span><span class="p">,</span> <span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">])],</span>
<span class="s1">&#39;delta&#39;</span> <span class="p">:</span> <span class="p">[(</span><span class="mi">101</span><span class="p">,</span> <span class="p">[</span><span class="mi">1</span><span class="p">])],</span>
<span class="s1">&#39;gamma&#39;</span> <span class="p">:</span> <span class="p">[(</span><span class="mi">100</span><span class="p">,</span> <span class="p">[</span><span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">]),</span> <span class="p">(</span><span class="mi">101</span><span class="p">,</span> <span class="p">[</span><span class="mi">0</span><span class="p">])]</span>
<span class="p">}</span>
</pre></div>
<p>Using this structure <span class="caps">FTS3</span> and <span class="caps">FTS4</span> serve the queries. <tt class="docutils literal">str.lower</tt> and <tt class="docutils literal">str.split</tt>
in the snippet represent tokenisation of the text. Here no stemming is needed
because most of technical terms are present intact, so I&#8217;ll use <tt class="docutils literal">unicode61</tt> for
the virtual&nbsp;table.</p>
<blockquote>
The &#8220;unicode61&#8221; tokeniser is available beginning with SQLite version 3.7.13
(2012-06-11). Unicode61 works very much like &#8220;simple&#8221; except that it does
simple unicode case folding according to rules in Unicode Version 6.1 and
it recognises unicode space and punctuation characters and uses those to
separate tokens. The simple tokeniser only does case folding of <span class="caps">ASCII</span>
characters and only recognises <span class="caps">ASCII</span> space and punctuation characters as
token separators&#8230; By default, &#8220;unicode61&#8221; also removes all diacritics
from Latin script characters.</blockquote>
<p>One thing to point out is that &#8220;C++&#8221; and &#8220;C#&#8221; will be indistinguishable from &#8220;C&#8221;
because non-alphanumeric characters will be treated as separators. SQLite
doesn&#8217;t limit minimal term length or have built-in stop-word list. So to solve
this <tt class="docutils literal">tokenchars</tt> is&nbsp;enough:</p>
<blockquote>
&#8230;&#8221;tokenchars=&#8221; option may be used to specify one or more extra characters
that should be treated as part of tokens instead of as separator characters.</blockquote>
<p>I also want to exclude &#8220;.<span class="caps">NET</span>&#8221; and adding &#8220;.&#8221; to the <tt class="docutils literal">tokenchars</tt> will cause
issue with words on the end of a sentence, but it&#8217;s negligible for the use case.
Alternatively <tt class="docutils literal"><span class="caps">LIKE</span></tt> can be safely used as a post-filter condition, or
generally it&#8217;s possible to implement tokenisers in Python with <tt class="docutils literal">sqlitefts</tt> <a class="footnote-reference" href="#id34" id="id9">[12]</a>.</p>
</div>
<div class="section" id="json1">
<h3><a class="toc-backref" href="#id47"><span class="caps">JSON1</span></a></h3>
<p>The documentation says <a class="footnote-reference" href="#id30" id="id10">[8]</a>:</p>
<blockquote>
The json1 extension is a loadable extension that implements fifteen
application-defined <span class="caps">SQL</span> functions and two table-valued functions that are useful
for managing <span class="caps">JSON</span> content stored in an SQLite database&#8230; The json1 extension
(currently) stores <span class="caps">JSON</span> as ordinary text&#8230; The json1 extension uses the
sqlite3_value_subtype() and sqlite3_result_subtype() interfaces that were
introduced with SQLite version 3.9.0 (2015-10-14).</blockquote>
<p>Out of the 15 function I&#8217;ll use only <tt class="docutils literal">json_extract</tt> <a class="footnote-reference" href="#id31" id="id11">[9]</a>. Even though on storage
level of <tt class="docutils literal"><span class="caps">JSON1</span></tt> <span class="caps">JSON</span> values are just string blobs, on logical level given set of
function extends relation model <a class="footnote-reference" href="#id32" id="id12">[10]</a> of SQLite with document storage model <a class="footnote-reference" href="#id33" id="id13">[11]</a>.
This provides great flexibility in data modelling, where one can take the best out
of the two models, and still use declarative query language, <span class="caps">SQL</span>.</p>
</div>
<div class="section" id="implementation">
<h3><a class="toc-backref" href="#id48">Implementation</a></h3>
<div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">json</span>
<span class="kn">import</span> <span class="nn">sqlite3</span>
<span class="k">class</span> <span class="nc">JobStorage</span><span class="p">:</span>
<span class="n">_conn</span> <span class="o">=</span> <span class="kc">None</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_conn</span> <span class="o">=</span> <span class="n">sqlite3</span><span class="o">.</span><span class="n">connect</span><span class="p">(</span><span class="s1">&#39;db.sqlite&#39;</span><span class="p">,</span> <span class="n">isolation_level</span> <span class="o">=</span> <span class="kc">None</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_setup</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">_setup</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="n">sql</span> <span class="o">=</span> <span class="s1">&#39;&#39;&#39;</span>
<span class="s1"> CREATE VIRTUAL TABLE IF NOT EXISTS job USING fts4(</span>
<span class="s1"> properties, status, employer, title, body,</span>
<span class="s1"> notindexed=properties, notindexed=status,</span>
<span class="s1"> tokenize=unicode61 &quot;tokenchars=.+#&quot;)</span>
<span class="s1"> &#39;&#39;&#39;</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_conn</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="n">sql</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">save</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">job</span><span class="p">):</span>
<span class="n">sql</span> <span class="o">=</span> <span class="s1">&#39;&#39;&#39;</span>
<span class="s1"> INSERT OR IGNORE INTO job(docid, properties, status, employer, title, body)</span>
<span class="s1"> VALUES(:docid, :properties, :status, :employer, :title, :body)</span>
<span class="s1"> &#39;&#39;&#39;</span>
<span class="n">title</span> <span class="o">=</span> <span class="n">job</span><span class="o">.</span><span class="n">pop</span><span class="p">(</span><span class="s1">&#39;title&#39;</span><span class="p">)</span>
<span class="n">body</span> <span class="o">=</span> <span class="n">job</span><span class="o">.</span><span class="n">pop</span><span class="p">(</span><span class="s1">&#39;body&#39;</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_conn</span><span class="o">.</span><span class="n">execute</span><span class="p">(</span><span class="n">sql</span><span class="p">,</span> <span class="p">{</span>
<span class="s1">&#39;docid&#39;</span> <span class="p">:</span> <span class="n">job</span><span class="p">[</span><span class="s1">&#39;dataset&#39;</span><span class="p">][</span><span class="s1">&#39;id&#39;</span><span class="p">],</span>
<span class="s1">&#39;employer&#39;</span> <span class="p">:</span> <span class="n">job</span><span class="p">[</span><span class="s1">&#39;employer&#39;</span><span class="p">],</span>
<span class="s1">&#39;properties&#39;</span> <span class="p">:</span> <span class="n">json</span><span class="o">.</span><span class="n">dumps</span><span class="p">(</span><span class="n">job</span><span class="p">,</span> <span class="n">default</span> <span class="o">=</span> <span class="nb">str</span><span class="p">),</span>
<span class="s1">&#39;status&#39;</span> <span class="p">:</span> <span class="mi">1</span><span class="p">,</span>
<span class="s1">&#39;title&#39;</span> <span class="p">:</span> <span class="n">title</span><span class="p">,</span>
<span class="s1">&#39;body&#39;</span> <span class="p">:</span> <span class="n">body</span>
<span class="p">})</span>
</pre></div>
<p><tt class="docutils literal">isolation_level = None</tt> re-enables auto-commit (for simplicity, and generally
to let transactions be an explicit measure, because explicit is better than implicit).
<tt class="docutils literal"><span class="pre">job['dataset']</span></tt> will be described in the next&nbsp;section.</p>
</div>
</div>
<div class="section" id="retrieval">
<h2><a class="toc-backref" href="#id49">Retrieval</a></h2>
<p>Data retrieval in context of the web or how it&#8217;s called in less neutral terms &#8212; screen
scraping &#8212; is usually implemented with <span class="caps">HTTP</span> client using a <span class="caps">HTML</span> parser that is capable of
either XPath queries or <span class="caps">CSS</span> selectors. This time I wanted to try something new and the
time was right. These days we see a new wave of headless browsers <a class="footnote-reference" href="#id35" id="id14">[13]</a>. I&#8217;m not talking
about PhantomJS or the like, but rather ubiquitous desktop browsers that have gained
official automation APIs. For instance, Chromium since version 59 supports DevTools
Protocol (a.k.a. Remote Debugging Protocol) <a class="footnote-reference" href="#id36" id="id15">[14]</a> in&nbsp;Linux.</p>
<p>DevTools Protocol is a WebSocket-based protocol and Google maintains an official client
for it, <tt class="docutils literal">puppeteer</tt> <a class="footnote-reference" href="#id37" id="id16">[15]</a>, written in JavaScript. What&#8217;s better that there&#8217;s a port to
Python 3.6+. It&#8217;s called <tt class="docutils literal">pyppeteer</tt> <a class="footnote-reference" href="#id38" id="id17">[16]</a>.</p>
<div class="section" id="pyppeteer">
<h3><a class="toc-backref" href="#id50">Pyppeteer</a></h3>
<blockquote>
Most things that you can do manually in the browser can be done using Puppeteer!</blockquote>
<p>As it should be with Pyppeteer. Basically the protocol provides WebSocket-based <span class="caps">RPC</span> to
<span class="caps">BOM</span>, <span class="caps">DOM</span> and user objects, and the implementation provides more-or-less transparent proxy
object <span class="caps">API</span>. In case of Pyppeteer it&#8217;s <tt class="docutils literal">asyncio</tt> <span class="caps">API</span>.</p>
<p>Here it a demonstration of the proxy behaviour. It&#8217;ll use
<tt class="docutils literal">pyppeteer.element_handle.ElementHandle</tt> <a class="footnote-reference" href="#id40" id="id18">[18]</a> that represents <span class="caps">DOM</span> element. Here&#8217;s
a JavaScript that one can run in Chromium&#8217;s Developer&nbsp;Tools.</p>
<div class="highlight"><pre><span></span><span class="kd">var</span> <span class="nx">a</span> <span class="o">=</span> <span class="nb">document</span><span class="p">.</span><span class="nx">querySelector</span><span class="p">(</span><span class="s1">&#39;div div.flexbox div a&#39;</span><span class="p">);</span>
<span class="kd">var</span> <span class="nx">hrefProperty</span> <span class="o">=</span> <span class="nx">a</span><span class="p">.</span><span class="nx">href</span><span class="p">;</span>
<span class="kd">var</span> <span class="nx">href</span> <span class="o">=</span> <span class="nx">hrefProperty</span><span class="p">.</span><span class="nx">valueOf</span><span class="p">();</span>
<span class="kd">var</span> <span class="nx">titleProperty</span> <span class="o">=</span> <span class="nx">a</span><span class="p">.</span><span class="nx">textContent</span><span class="p">;</span>
<span class="kd">var</span> <span class="nx">title</span> <span class="o">=</span> <span class="nx">titleProperty</span><span class="p">.</span><span class="nx">valueOf</span><span class="p">();</span>
<span class="nx">console</span><span class="p">.</span><span class="nx">log</span><span class="p">(</span><span class="nx">href</span><span class="p">,</span> <span class="nx">title</span><span class="p">);</span>
</pre></div>
<p>And the same thing with&nbsp;Pyppeteer:</p>
<div class="highlight"><pre><span></span><span class="n">a</span> <span class="o">=</span> <span class="k">await</span> <span class="n">page</span><span class="o">.</span><span class="n">querySelector</span><span class="p">(</span><span class="s1">&#39;div div.flexbox div a&#39;</span><span class="p">)</span>
<span class="n">hrefProperty</span> <span class="o">=</span> <span class="k">await</span> <span class="n">a</span><span class="o">.</span><span class="n">getProperty</span><span class="p">(</span><span class="s1">&#39;href&#39;</span><span class="p">)</span>
<span class="n">href</span> <span class="o">=</span> <span class="k">await</span> <span class="n">hrefProperty</span><span class="o">.</span><span class="n">jsonValue</span><span class="p">()</span>
<span class="n">titleProperty</span> <span class="o">=</span> <span class="k">await</span> <span class="n">a</span><span class="o">.</span><span class="n">getProperty</span><span class="p">(</span><span class="s1">&#39;textContent&#39;</span><span class="p">)</span>
<span class="n">title</span> <span class="o">=</span> <span class="k">await</span> <span class="n">titleProperty</span><span class="o">.</span><span class="n">jsonValue</span><span class="p">()</span>
<span class="nb">print</span><span class="p">(</span><span class="n">href</span><span class="p">,</span> <span class="n">title</span><span class="p">)</span>
</pre></div>
<p>There&#8217;re two must-know things about Pyppeteer. The first is that every attribute
and value access needs to be <tt class="docutils literal">await</tt>&#8216;ed. The second is that <tt class="docutils literal">pyppeteer.page.Page.click</tt>
and <tt class="docutils literal">pyppeteer.page.Page.waitForNavigation</tt> have a race condition and should be
awaited simultaneously <a class="footnote-reference" href="#id41" id="id19">[19]</a>,&nbsp;like:</p>
<div class="highlight"><pre><span></span><span class="k">await</span> <span class="n">asyncio</span><span class="o">.</span><span class="n">gather</span><span class="p">(</span><span class="n">page</span><span class="o">.</span><span class="n">click</span><span class="p">(</span><span class="s1">&#39;#HeroSearchButton&#39;</span><span class="p">),</span> <span class="n">page</span><span class="o">.</span><span class="n">waitForNavigation</span><span class="p">())</span>
</pre></div>
<p>Towards the implementation, the idea for job retrieval <span class="caps">API</span> is to have an asynchronous
iterator object implemented as asynchronous generator (one that mixes <tt class="docutils literal">yield</tt>
and <tt class="docutils literal">await</tt>) as defined in <span class="caps">PEP</span> 525 <a class="footnote-reference" href="#id39" id="id20">[17]</a>. It will encapsulate browser preparation,
pagination, addtional page interactions, throttling sleeps and at the same time in
the calling code there will be just an <tt class="docutils literal">async for</tt> loop. I&#8217;ll be&nbsp;using:</p>
<ul class="simple">
<li>Python&nbsp;3.6.5</li>
<li>Pyppeteer&nbsp;0.0.16</li>
<li>Chromium&nbsp;64.0.3282.167</li>
</ul>
</div>
<div class="section" id="iteration">
<h3><a class="toc-backref" href="#id51">Iteration</a></h3>
<div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">asyncio</span>
<span class="kn">import</span> <span class="nn">logging</span>
<span class="k">class</span> <span class="nc">JobIter</span><span class="p">:</span>
<span class="n">_page</span> <span class="o">=</span> <span class="kc">None</span>
<span class="n">_query</span> <span class="o">=</span> <span class="kc">None</span>
<span class="n">_location</span> <span class="o">=</span> <span class="kc">None</span>
<span class="n">_days</span> <span class="o">=</span> <span class="kc">None</span>
<span class="n">_prepared</span> <span class="o">=</span> <span class="kc">False</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">page</span><span class="p">,</span> <span class="n">query</span><span class="p">,</span> <span class="n">location</span><span class="p">,</span> <span class="n">days</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_page</span> <span class="o">=</span> <span class="n">page</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_query</span> <span class="o">=</span> <span class="n">query</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_location</span> <span class="o">=</span> <span class="n">location</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_days</span> <span class="o">=</span> <span class="n">days</span>
<span class="k">async</span> <span class="k">def</span> <span class="nf">_prepare</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="n">logging</span><span class="o">.</span><span class="n">debug</span><span class="p">(</span><span class="s1">&#39;Opening main page&#39;</span><span class="p">)</span>
<span class="k">await</span> <span class="bp">self</span><span class="o">.</span><span class="n">_page</span><span class="o">.</span><span class="n">goto</span><span class="p">(</span><span class="s1">&#39;https://www.glassdoor.nl/index.htm&#39;</span><span class="p">)</span>
<span class="n">logging</span><span class="o">.</span><span class="n">debug</span><span class="p">(</span><span class="s1">&#39;Entering query&#39;</span><span class="p">)</span>
<span class="k">await</span> <span class="bp">self</span><span class="o">.</span><span class="n">_page</span><span class="o">.</span><span class="n">type</span><span class="p">(</span><span class="s1">&#39;#KeywordSearch&#39;</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">_query</span><span class="p">)</span>
<span class="k">await</span> <span class="bp">self</span><span class="o">.</span><span class="n">_page</span><span class="o">.</span><span class="n">evaluate</span><span class="p">(</span>
<span class="s1">&#39;function(id, loc) {document.getElementById(id).value = loc}&#39;</span><span class="p">,</span>
<span class="s1">&#39;LocationSearch&#39;</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">_location</span><span class="p">)</span>
<span class="n">logging</span><span class="o">.</span><span class="n">debug</span><span class="p">(</span><span class="s1">&#39;Opening result page&#39;</span><span class="p">)</span>
<span class="k">await</span> <span class="n">asyncio</span><span class="o">.</span><span class="n">gather</span><span class="p">(</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_page</span><span class="o">.</span><span class="n">click</span><span class="p">(</span><span class="s1">&#39;#HeroSearchButton&#39;</span><span class="p">),</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_page</span><span class="o">.</span><span class="n">waitForNavigation</span><span class="p">()</span>
<span class="p">)</span>
<span class="n">logging</span><span class="o">.</span><span class="n">debug</span><span class="p">(</span><span class="s1">&#39;Selecting posted date range&#39;</span><span class="p">)</span>
<span class="n">datePosted</span> <span class="o">=</span> <span class="s1">&#39;#DKFilters &gt; div &gt; div &gt; div:nth-child(2)&#39;</span>
<span class="k">await</span> <span class="bp">self</span><span class="o">.</span><span class="n">_page</span><span class="o">.</span><span class="n">click</span><span class="p">(</span><span class="n">datePosted</span><span class="p">)</span>
<span class="n">daysAgo</span> <span class="o">=</span> <span class="s1">&#39;#DKFilters .filter.expanded ul li[value=&quot;</span><span class="si">{}</span><span class="s1">&quot;]&#39;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">_days</span><span class="p">)</span>
<span class="k">await</span> <span class="bp">self</span><span class="o">.</span><span class="n">_page</span><span class="o">.</span><span class="n">waitForSelector</span><span class="p">(</span><span class="n">daysAgo</span><span class="p">)</span>
<span class="k">await</span> <span class="bp">self</span><span class="o">.</span><span class="n">_page</span><span class="o">.</span><span class="n">click</span><span class="p">(</span><span class="n">daysAgo</span><span class="p">)</span>
<span class="n">logging</span><span class="o">.</span><span class="n">debug</span><span class="p">(</span><span class="s1">&#39;Waiting for list to update&#39;</span><span class="p">)</span>
<span class="k">await</span> <span class="bp">self</span><span class="o">.</span><span class="n">_page</span><span class="o">.</span><span class="n">waitForSelector</span><span class="p">(</span><span class="s1">&#39;.jlGrid.updating&#39;</span><span class="p">)</span>
<span class="k">await</span> <span class="bp">self</span><span class="o">.</span><span class="n">_page</span><span class="o">.</span><span class="n">waitForSelector</span><span class="p">(</span><span class="s1">&#39;.jlGrid.updating&#39;</span><span class="p">,</span> <span class="n">hidden</span> <span class="o">=</span> <span class="kc">True</span><span class="p">)</span>
<span class="k">async</span> <span class="k">def</span> <span class="nf">_dismiss</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="n">button</span> <span class="o">=</span> <span class="k">await</span> <span class="bp">self</span><span class="o">.</span><span class="n">_page</span><span class="o">.</span><span class="n">querySelector</span><span class="p">(</span><span class="s1">&#39;button.mfp-close&#39;</span><span class="p">)</span>
<span class="k">if</span> <span class="n">button</span><span class="p">:</span>
<span class="k">await</span> <span class="n">button</span><span class="o">.</span><span class="n">click</span><span class="p">()</span>
<span class="k">async</span> <span class="k">def</span> <span class="nf">__aiter__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">if</span> <span class="ow">not</span> <span class="bp">self</span><span class="o">.</span><span class="n">_prepared</span><span class="p">:</span>
<span class="k">await</span> <span class="bp">self</span><span class="o">.</span><span class="n">_prepare</span><span class="p">()</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_prepared</span> <span class="o">=</span> <span class="kc">True</span>
<span class="k">while</span> <span class="kc">True</span><span class="p">:</span>
<span class="k">await</span> <span class="bp">self</span><span class="o">.</span><span class="n">_dismiss</span><span class="p">()</span>
<span class="n">button</span> <span class="o">=</span> <span class="k">await</span> <span class="bp">self</span><span class="o">.</span><span class="n">_page</span><span class="o">.</span><span class="n">querySelector</span><span class="p">(</span><span class="s1">&#39;#FooterPageNav .page.current&#39;</span><span class="p">)</span>
<span class="n">p</span> <span class="o">=</span> <span class="k">await</span> <span class="n">button</span><span class="o">.</span><span class="n">getProperty</span><span class="p">(</span><span class="s1">&#39;textContent&#39;</span><span class="p">)</span>
<span class="n">current</span> <span class="o">=</span> <span class="p">(</span><span class="k">await</span> <span class="n">p</span><span class="o">.</span><span class="n">jsonValue</span><span class="p">())</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span>
<span class="n">logging</span><span class="o">.</span><span class="n">debug</span><span class="p">(</span><span class="s1">&#39;Processing page </span><span class="si">%s</span><span class="s1">&#39;</span><span class="p">,</span> <span class="n">current</span><span class="p">)</span>
<span class="n">jobList</span> <span class="o">=</span> <span class="k">await</span> <span class="bp">self</span><span class="o">.</span><span class="n">_page</span><span class="o">.</span><span class="n">querySelectorAll</span><span class="p">(</span><span class="s1">&#39;ul.jlGrid &gt; li&#39;</span><span class="p">)</span>
<span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="n">jobList</span><span class="p">:</span>
<span class="n">parser</span> <span class="o">=</span> <span class="n">JobParser</span><span class="p">(</span><span class="n">j</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">_page</span><span class="p">)</span>
<span class="n">job</span> <span class="o">=</span> <span class="k">await</span> <span class="n">parser</span><span class="o">.</span><span class="n">parse</span><span class="p">()</span>
<span class="n">logging</span><span class="o">.</span><span class="n">debug</span><span class="p">(</span><span class="s1">&#39;Processed job </span><span class="si">%s</span><span class="s1">&#39;</span><span class="p">,</span> <span class="n">job</span><span class="p">[</span><span class="s1">&#39;title&#39;</span><span class="p">])</span>
<span class="k">yield</span> <span class="n">job</span>
<span class="k">await</span> <span class="n">asyncio</span><span class="o">.</span><span class="n">sleep</span><span class="p">(</span><span class="mf">0.5</span><span class="p">)</span>
<span class="n">button</span> <span class="o">=</span> <span class="k">await</span> <span class="bp">self</span><span class="o">.</span><span class="n">_page</span><span class="o">.</span><span class="n">querySelector</span><span class="p">(</span><span class="s1">&#39;#FooterPageNav .next&#39;</span><span class="p">)</span>
<span class="n">disabled</span> <span class="o">=</span> <span class="k">await</span> <span class="n">button</span><span class="o">.</span><span class="n">querySelector</span><span class="p">(</span><span class="s1">&#39;.disabled&#39;</span><span class="p">)</span>
<span class="k">if</span> <span class="n">disabled</span><span class="p">:</span>
<span class="n">logging</span><span class="o">.</span><span class="n">debug</span><span class="p">(</span><span class="s1">&#39;Last page has been reached&#39;</span><span class="p">)</span>
<span class="k">return</span>
<span class="k">await</span> <span class="n">asyncio</span><span class="o">.</span><span class="n">gather</span><span class="p">(</span><span class="n">button</span><span class="o">.</span><span class="n">click</span><span class="p">(),</span> <span class="bp">self</span><span class="o">.</span><span class="n">_page</span><span class="o">.</span><span class="n">waitForNavigation</span><span class="p">())</span>
<span class="k">await</span> <span class="n">asyncio</span><span class="o">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
</pre></div>
</div>
<div class="section" id="parsing">
<h3><a class="toc-backref" href="#id52">Parsing</a></h3>
<p>Root element of a job listing is <span class="caps">HTML</span> list item tag which looks&nbsp;like:</p>
<div class="highlight"><pre><span></span><span class="o">&lt;</span><span class="n">li</span> <span class="n">class</span><span class="o">=</span><span class="s2">&quot;jl selected&quot;</span>
<span class="n">data</span><span class="o">-</span><span class="nb">id</span><span class="o">=</span><span class="s2">&quot;2618878823&quot;</span> <span class="n">data</span><span class="o">-</span><span class="n">emp</span><span class="o">-</span><span class="nb">id</span><span class="o">=</span><span class="s2">&quot;262109&quot;</span> <span class="n">data</span><span class="o">-</span><span class="ow">is</span><span class="o">-</span><span class="n">organic</span><span class="o">-</span><span class="n">job</span><span class="o">=</span><span class="s2">&quot;false&quot;</span>
<span class="n">data</span><span class="o">-</span><span class="n">sgoc</span><span class="o">-</span><span class="nb">id</span><span class="o">=</span><span class="s2">&quot;-1&quot;</span> <span class="n">data</span><span class="o">-</span><span class="n">purchase</span><span class="o">-</span><span class="n">ad</span><span class="o">-</span><span class="n">order</span><span class="o">-</span><span class="nb">id</span><span class="o">=</span><span class="s2">&quot;0&quot;</span> <span class="n">data</span><span class="o">-</span><span class="ow">is</span><span class="o">-</span><span class="n">easy</span><span class="o">-</span><span class="n">apply</span><span class="o">=</span><span class="s2">&quot;false&quot;</span>
<span class="n">data</span><span class="o">-</span><span class="n">njslv</span><span class="o">=</span><span class="s2">&quot;false&quot;</span><span class="o">/&gt;</span>
</pre></div>
<p>There&#8217;re several data attributes, notably <tt class="docutils literal">id</tt>:</p>
<ol class="arabic simple">
<li><tt class="docutils literal">id</tt> is internal Glassdoor identifier of a job which will
help deduplicate jobs on subsequent&nbsp;runs,</li>
<li><tt class="docutils literal">empId</tt> is an employer identifier, it&#8217;s is somewhat&nbsp;useful,</li>
<li><tt class="docutils literal">isOrganicJob</tt>, initially I thought it the flag to filter out
intermediaries, but unfortunately I didn&#8217;t grasp how exactly it is&nbsp;set.</li>
</ol>
<p>The attributes can be accessed as a dictionary under <tt class="docutils literal">dataset</tt> property, see
<tt class="docutils literal">_getDataset</tt> below.</p>
<div class="highlight"><pre><span></span><span class="kn">import</span> <span class="nn">datetime</span>
<span class="kn">from</span> <span class="nn">decimal</span> <span class="k">import</span> <span class="n">Decimal</span>
<span class="k">class</span> <span class="nc">JobParser</span><span class="p">:</span>
<span class="n">_elementHandle</span> <span class="o">=</span> <span class="kc">None</span>
<span class="n">_page</span> <span class="o">=</span> <span class="kc">None</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">eh</span><span class="p">,</span> <span class="n">page</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_elementHandle</span> <span class="o">=</span> <span class="n">eh</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_page</span> <span class="o">=</span> <span class="n">page</span>
<span class="k">async</span> <span class="k">def</span> <span class="nf">_getDataset</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="n">p</span> <span class="o">=</span> <span class="k">await</span> <span class="bp">self</span><span class="o">.</span><span class="n">_elementHandle</span><span class="o">.</span><span class="n">getProperty</span><span class="p">(</span><span class="s1">&#39;dataset&#39;</span><span class="p">)</span>
<span class="n">d</span> <span class="o">=</span> <span class="k">await</span> <span class="n">p</span><span class="o">.</span><span class="n">getProperties</span><span class="p">()</span>
<span class="n">dataset</span> <span class="o">=</span> <span class="p">{}</span>
<span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">jsh</span> <span class="ow">in</span> <span class="n">d</span><span class="o">.</span><span class="n">items</span><span class="p">():</span>
<span class="n">dataset</span><span class="p">[</span><span class="n">k</span><span class="p">]</span> <span class="o">=</span> <span class="n">json</span><span class="o">.</span><span class="n">loads</span><span class="p">(</span><span class="k">await</span> <span class="n">jsh</span><span class="o">.</span><span class="n">jsonValue</span><span class="p">())</span>
<span class="k">return</span> <span class="n">dataset</span>
<span class="k">async</span> <span class="k">def</span> <span class="nf">_getRating</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="n">span</span> <span class="o">=</span> <span class="k">await</span> <span class="bp">self</span><span class="o">.</span><span class="n">_elementHandle</span><span class="o">.</span><span class="n">querySelector</span><span class="p">(</span><span class="s1">&#39;.compactStars&#39;</span><span class="p">)</span>
<span class="k">if</span> <span class="n">span</span><span class="p">:</span>
<span class="n">p</span> <span class="o">=</span> <span class="k">await</span> <span class="n">span</span><span class="o">.</span><span class="n">getProperty</span><span class="p">(</span><span class="s1">&#39;textContent&#39;</span><span class="p">)</span>
<span class="k">return</span> <span class="n">Decimal</span><span class="p">((</span><span class="k">await</span> <span class="n">p</span><span class="o">.</span><span class="n">jsonValue</span><span class="p">())</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s1">&#39;,&#39;</span><span class="p">,</span> <span class="s1">&#39;.&#39;</span><span class="p">))</span>
<span class="k">async</span> <span class="k">def</span> <span class="nf">_getTitleLink</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="n">a</span> <span class="o">=</span> <span class="k">await</span> <span class="bp">self</span><span class="o">.</span><span class="n">_elementHandle</span><span class="o">.</span><span class="n">querySelector</span><span class="p">(</span><span class="s1">&#39;div div.flexbox div a&#39;</span><span class="p">)</span>
<span class="n">p</span> <span class="o">=</span> <span class="k">await</span> <span class="n">a</span><span class="o">.</span><span class="n">getProperty</span><span class="p">(</span><span class="s1">&#39;href&#39;</span><span class="p">)</span>
<span class="n">href</span> <span class="o">=</span> <span class="k">await</span> <span class="n">p</span><span class="o">.</span><span class="n">jsonValue</span><span class="p">()</span>
<span class="n">p</span> <span class="o">=</span> <span class="k">await</span> <span class="n">a</span><span class="o">.</span><span class="n">getProperty</span><span class="p">(</span><span class="s1">&#39;textContent&#39;</span><span class="p">)</span>
<span class="n">title</span> <span class="o">=</span> <span class="p">(</span><span class="k">await</span> <span class="n">p</span><span class="o">.</span><span class="n">jsonValue</span><span class="p">())</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span>
<span class="k">return</span> <span class="n">title</span><span class="p">,</span> <span class="n">href</span>
<span class="k">async</span> <span class="k">def</span> <span class="nf">_getEmployer</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="n">div</span> <span class="o">=</span> <span class="k">await</span> <span class="bp">self</span><span class="o">.</span><span class="n">_elementHandle</span><span class="o">.</span><span class="n">querySelector</span><span class="p">(</span><span class="s1">&#39;.empLoc div:nth-child(1)&#39;</span><span class="p">)</span>
<span class="n">firstNode</span> <span class="o">=</span> <span class="k">await</span> <span class="n">div</span><span class="o">.</span><span class="n">getProperty</span><span class="p">(</span><span class="s1">&#39;firstChild&#39;</span><span class="p">)</span>
<span class="n">p</span> <span class="o">=</span> <span class="k">await</span> <span class="n">firstNode</span><span class="o">.</span><span class="n">getProperty</span><span class="p">(</span><span class="s1">&#39;nodeValue&#39;</span><span class="p">)</span>
<span class="k">return</span> <span class="p">(</span><span class="k">await</span> <span class="n">p</span><span class="o">.</span><span class="n">jsonValue</span><span class="p">())</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span><span class="o">.</span><span class="n">strip</span><span class="p">(</span><span class="s1">&#39;–&#39;</span><span class="p">)</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span>
<span class="k">async</span> <span class="k">def</span> <span class="nf">_getCity</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="n">span</span> <span class="o">=</span> <span class="k">await</span> <span class="bp">self</span><span class="o">.</span><span class="n">_elementHandle</span><span class="o">.</span><span class="n">querySelector</span><span class="p">(</span><span class="s1">&#39;.subtle.loc&#39;</span><span class="p">)</span>
<span class="n">p</span> <span class="o">=</span> <span class="k">await</span> <span class="n">span</span><span class="o">.</span><span class="n">getProperty</span><span class="p">(</span><span class="s1">&#39;textContent&#39;</span><span class="p">)</span>
<span class="k">return</span> <span class="p">(</span><span class="k">await</span> <span class="n">p</span><span class="o">.</span><span class="n">jsonValue</span><span class="p">())</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span>
<span class="k">async</span> <span class="k">def</span> <span class="nf">_getDate</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="n">div</span> <span class="o">=</span> <span class="k">await</span> <span class="bp">self</span><span class="o">.</span><span class="n">_elementHandle</span><span class="o">.</span><span class="n">querySelector</span><span class="p">(</span><span class="s1">&#39;.hotListing&#39;</span><span class="p">)</span>
<span class="n">ts</span> <span class="o">=</span> <span class="n">datetime</span><span class="o">.</span><span class="n">datetime</span><span class="o">.</span><span class="n">now</span><span class="p">()</span>
<span class="k">if</span> <span class="n">div</span><span class="p">:</span>
<span class="k">return</span> <span class="n">ts</span>
<span class="n">span</span> <span class="o">=</span> <span class="k">await</span> <span class="bp">self</span><span class="o">.</span><span class="n">_elementHandle</span><span class="o">.</span><span class="n">querySelector</span><span class="p">(</span><span class="s1">&#39;.showHH span&#39;</span><span class="p">)</span>
<span class="n">p</span> <span class="o">=</span> <span class="k">await</span> <span class="n">span</span><span class="o">.</span><span class="n">getProperty</span><span class="p">(</span><span class="s1">&#39;textContent&#39;</span><span class="p">)</span>
<span class="n">value</span><span class="p">,</span> <span class="n">unit</span> <span class="o">=</span> <span class="p">(</span><span class="k">await</span> <span class="n">p</span><span class="o">.</span><span class="n">jsonValue</span><span class="p">())</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="n">maxsplit</span> <span class="o">=</span> <span class="mi">1</span><span class="p">)</span>
<span class="n">value</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">value</span><span class="p">)</span>
<span class="k">if</span> <span class="n">unit</span> <span class="o">==</span> <span class="s1">&#39;d&#39;</span><span class="p">:</span>
<span class="n">ts</span> <span class="o">-=</span> <span class="n">datetime</span><span class="o">.</span><span class="n">timedelta</span><span class="p">(</span><span class="n">days</span> <span class="o">=</span> <span class="n">value</span><span class="p">)</span>
<span class="k">elif</span> <span class="n">unit</span> <span class="o">==</span> <span class="s1">&#39;u&#39;</span><span class="p">:</span>
<span class="n">ts</span> <span class="o">-=</span> <span class="n">datetime</span><span class="o">.</span><span class="n">timedelta</span><span class="p">(</span><span class="n">hours</span> <span class="o">=</span> <span class="n">value</span><span class="p">)</span>
<span class="k">elif</span> <span class="n">unit</span> <span class="o">==</span> <span class="s1">&#39;min&#39;</span><span class="p">:</span>
<span class="n">ts</span> <span class="o">-=</span> <span class="n">datetime</span><span class="o">.</span><span class="n">timedelta</span><span class="p">(</span><span class="n">minutes</span> <span class="o">=</span> <span class="n">value</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="k">raise</span> <span class="ne">ValueError</span><span class="p">(</span><span class="s1">&#39;Unsupported unit </span><span class="si">{}</span><span class="s1">&#39;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">unit</span><span class="p">))</span>
<span class="k">return</span> <span class="n">ts</span>
<span class="k">async</span> <span class="k">def</span> <span class="nf">_getBody</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="n">p</span> <span class="o">=</span> <span class="k">await</span> <span class="bp">self</span><span class="o">.</span><span class="n">_elementHandle</span><span class="o">.</span><span class="n">getProperty</span><span class="p">(</span><span class="s1">&#39;classList&#39;</span><span class="p">)</span>
<span class="k">await</span> <span class="bp">self</span><span class="o">.</span><span class="n">_elementHandle</span><span class="o">.</span><span class="n">click</span><span class="p">()</span>
<span class="n">spinner</span> <span class="o">=</span> <span class="k">await</span> <span class="bp">self</span><span class="o">.</span><span class="n">_page</span><span class="o">.</span><span class="n">querySelector</span><span class="p">(</span><span class="s1">&#39;.jobDetails .ajaxSpinner&#39;</span><span class="p">)</span>
<span class="k">if</span> <span class="n">spinner</span><span class="p">:</span>
<span class="n">spinner</span> <span class="o">=</span> <span class="k">await</span> <span class="bp">self</span><span class="o">.</span><span class="n">_page</span><span class="o">.</span><span class="n">waitForSelector</span><span class="p">(</span><span class="s1">&#39;.jobDetails .ajaxSpinner&#39;</span><span class="p">,</span> <span class="n">hidden</span> <span class="o">=</span> <span class="kc">True</span><span class="p">)</span>
<span class="n">div</span> <span class="o">=</span> <span class="k">await</span> <span class="bp">self</span><span class="o">.</span><span class="n">_page</span><span class="o">.</span><span class="n">querySelector</span><span class="p">(</span><span class="s1">&#39;.jobDetails .jobDescriptionContent&#39;</span><span class="p">)</span>
<span class="n">p</span> <span class="o">=</span> <span class="k">await</span> <span class="n">div</span><span class="o">.</span><span class="n">getProperty</span><span class="p">(</span><span class="s1">&#39;textContent&#39;</span><span class="p">)</span>
<span class="k">return</span> <span class="p">(</span><span class="k">await</span> <span class="n">p</span><span class="o">.</span><span class="n">jsonValue</span><span class="p">())</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span>
<span class="k">async</span> <span class="k">def</span> <span class="nf">parse</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="n">title</span><span class="p">,</span> <span class="n">url</span> <span class="o">=</span> <span class="k">await</span> <span class="bp">self</span><span class="o">.</span><span class="n">_getTitleLink</span><span class="p">()</span>
<span class="n">dataset</span><span class="p">,</span> <span class="n">rating</span><span class="p">,</span> <span class="n">employer</span><span class="p">,</span> <span class="n">city</span><span class="p">,</span> <span class="n">date</span> <span class="o">=</span> <span class="k">await</span> <span class="n">asyncio</span><span class="o">.</span><span class="n">gather</span><span class="p">(</span>
<span class="bp">self</span><span class="o">.</span><span class="n">_getDataset</span><span class="p">(),</span> <span class="bp">self</span><span class="o">.</span><span class="n">_getRating</span><span class="p">(),</span> <span class="bp">self</span><span class="o">.</span><span class="n">_getEmployer</span><span class="p">(),</span> <span class="bp">self</span><span class="o">.</span><span class="n">_getCity</span><span class="p">(),</span> <span class="bp">self</span><span class="o">.</span><span class="n">_getDate</span><span class="p">())</span>
<span class="n">body</span> <span class="o">=</span> <span class="k">await</span> <span class="bp">self</span><span class="o">.</span><span class="n">_getBody</span><span class="p">()</span>
<span class="k">return</span> <span class="p">{</span>
<span class="s1">&#39;dataset&#39;</span> <span class="p">:</span> <span class="n">dataset</span><span class="p">,</span>
<span class="s1">&#39;rating&#39;</span> <span class="p">:</span> <span class="n">rating</span><span class="p">,</span>
<span class="s1">&#39;url&#39;</span> <span class="p">:</span> <span class="n">url</span><span class="p">,</span>
<span class="s1">&#39;employer&#39;</span> <span class="p">:</span> <span class="n">employer</span><span class="p">,</span>
<span class="s1">&#39;city&#39;</span> <span class="p">:</span> <span class="n">city</span><span class="p">,</span>
<span class="s1">&#39;date&#39;</span> <span class="p">:</span> <span class="n">date</span><span class="p">,</span>
<span class="s1">&#39;title&#39;</span> <span class="p">:</span> <span class="n">title</span><span class="p">,</span>
<span class="s1">&#39;body&#39;</span> <span class="p">:</span> <span class="n">body</span>
<span class="p">}</span>
</pre></div>
</div>
</div>
<div class="section" id="glue">
<h2><a class="toc-backref" href="#id53">Glue</a></h2>
<p>By default on first run Pyppeteer downloads Chromium build that is guaranteed to work.
But because I already had compatible version pre-installed I explicitly provided
<tt class="docutils literal">executablePath</tt> which disables the behaviour. <tt class="docutils literal">headless = False</tt> is useful for&nbsp;debugging.</p>
<div class="highlight"><pre><span></span><span class="kn">from</span> <span class="nn">pyppeteer</span> <span class="k">import</span> <span class="n">launch</span>
<span class="k">async</span> <span class="k">def</span> <span class="nf">main</span><span class="p">(</span><span class="n">query</span><span class="p">,</span> <span class="n">location</span><span class="p">,</span> <span class="n">days</span><span class="p">):</span>
<span class="n">browser</span> <span class="o">=</span> <span class="k">await</span> <span class="n">launch</span><span class="p">(</span><span class="n">executablePath</span> <span class="o">=</span> <span class="s1">&#39;/usr/bin/chromium-browser&#39;</span><span class="p">,</span> <span class="n">headless</span> <span class="o">=</span> <span class="kc">True</span><span class="p">)</span>
<span class="n">page</span> <span class="o">=</span> <span class="k">await</span> <span class="n">browser</span><span class="o">.</span><span class="n">newPage</span><span class="p">()</span>
<span class="n">storage</span> <span class="o">=</span> <span class="n">JobStorage</span><span class="p">()</span>
<span class="k">async</span> <span class="k">for</span> <span class="n">job</span> <span class="ow">in</span> <span class="n">JobIter</span><span class="p">(</span><span class="n">page</span><span class="p">,</span> <span class="n">query</span><span class="p">,</span> <span class="n">location</span><span class="p">,</span> <span class="n">days</span><span class="p">):</span>
<span class="n">storage</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="n">job</span><span class="p">)</span>
<span class="k">await</span> <span class="n">browser</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
<span class="k">if</span> <span class="vm">__name__</span> <span class="o">==</span> <span class="s1">&#39;__main__&#39;</span><span class="p">:</span>
<span class="n">logging</span><span class="o">.</span><span class="n">basicConfig</span><span class="p">(</span>
<span class="n">level</span> <span class="o">=</span> <span class="n">logging</span><span class="o">.</span><span class="n">DEBUG</span><span class="p">,</span> <span class="nb">format</span> <span class="o">=</span> <span class="s1">&#39;</span><span class="si">%(asctime)s</span><span class="s1"> </span><span class="si">%(levelname)s</span><span class="s1"> </span><span class="si">%(name)s</span><span class="s1"> </span><span class="si">%(message)s</span><span class="s1">&#39;</span><span class="p">)</span>
<span class="k">for</span> <span class="n">l</span> <span class="ow">in</span> <span class="p">(</span><span class="s1">&#39;websockets&#39;</span><span class="p">,</span> <span class="s1">&#39;pyppeteer&#39;</span><span class="p">):</span>
<span class="n">logging</span><span class="o">.</span><span class="n">getLogger</span><span class="p">(</span><span class="n">l</span><span class="p">)</span><span class="o">.</span><span class="n">setLevel</span><span class="p">(</span><span class="n">logging</span><span class="o">.</span><span class="n">WARNING</span><span class="p">)</span>
<span class="n">asyncio</span><span class="o">.</span><span class="n">get_event_loop</span><span class="p">()</span><span class="o">.</span><span class="n">run_until_complete</span><span class="p">(</span><span class="n">main</span><span class="p">(</span><span class="s1">&#39;Python&#39;</span><span class="p">,</span> <span class="s1">&#39;Nederland&#39;</span><span class="p">,</span> <span class="mi">7</span><span class="p">))</span>
</pre></div>
</div>
<div class="section" id="listing-relevant-jobs">
<h2><a class="toc-backref" href="#id54">Listing relevant&nbsp;jobs</a></h2>
<p>And finally after executing the above code, in several minutes there&#8217;s <tt class="docutils literal">db.sqlite</tt>.
I used <tt class="docutils literal">sqliteman</tt> <a class="footnote-reference" href="#id42" id="id21">[20]</a> (available from Ubuntu Xenial&#8217;s repository) to query and
explore it, because it correctly works with <span class="caps">JSON</span> attributes and clipboard (surprisingly
SQLiteStudio, which I usually open SQLite databases with, has issues with&nbsp;both).</p>
<p>A sample record looks&nbsp;like:</p>
<div class="highlight"><pre><span></span><span class="p">{</span>
<span class="s2">&quot;dataset&quot;</span><span class="o">:</span> <span class="p">{</span>
<span class="s2">&quot;id&quot;</span> <span class="o">:</span> <span class="mi">2622677334</span><span class="p">,</span>
<span class="s2">&quot;empId&quot;</span> <span class="o">:</span> <span class="mi">823453</span><span class="p">,</span>
<span class="s2">&quot;isOrganicJob&quot;</span> <span class="o">:</span> <span class="kc">true</span><span class="p">,</span>
<span class="s2">&quot;sgocId&quot;</span> <span class="o">:</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span>
<span class="s2">&quot;purchaseAdOrderId&quot;</span> <span class="o">:</span> <span class="mi">0</span><span class="p">,</span>
<span class="s2">&quot;isEasyApply&quot;</span> <span class="o">:</span> <span class="kc">false</span><span class="p">,</span>
<span class="s2">&quot;njslv&quot;</span> <span class="o">:</span> <span class="kc">false</span>
<span class="p">},</span>
<span class="s2">&quot;rating&quot;</span> <span class="o">:</span> <span class="s2">&quot;2.9&quot;</span><span class="p">,</span>
<span class="s2">&quot;url&quot;</span> <span class="o">:</span> <span class="s2">&quot;https://www.glassdoor.nl/partner/jobListing.htm?...&quot;</span><span class="p">,</span>
<span class="s2">&quot;employer&quot;</span> <span class="o">:</span> <span class="s2">&quot;Takeaway.com&quot;</span><span class="p">,</span>
<span class="s2">&quot;city&quot;</span> <span class="o">:</span> <span class="s2">&quot;Amsterdam&quot;</span><span class="p">,</span>
<span class="s2">&quot;date&quot;</span> <span class="o">:</span> <span class="s2">&quot;2018-03-23 16:10:59.413456&quot;</span>
<span class="p">}</span>
</pre></div>
<p>Here follows the query I used. Yours most probably will be&nbsp;different.</p>
<div class="highlight"><pre><span></span><span class="k">SELECT</span>
<span class="n">title</span><span class="p">,</span>
<span class="n">status</span><span class="p">,</span>
<span class="n">round</span><span class="p">(</span><span class="n">julianday</span><span class="p">(</span><span class="s1">&#39;now&#39;</span><span class="p">)</span> <span class="o">-</span> <span class="n">julianday</span><span class="p">(</span><span class="n">json_extract</span><span class="p">(</span><span class="n">properties</span><span class="p">,</span> <span class="s1">&#39;$.date&#39;</span><span class="p">)))</span> <span class="n">ago</span><span class="p">,</span>
<span class="n">json_extract</span><span class="p">(</span><span class="n">properties</span><span class="p">,</span> <span class="s1">&#39;$.employer&#39;</span><span class="p">)</span> <span class="n">employer</span><span class="p">,</span>
<span class="n">json_extract</span><span class="p">(</span><span class="n">properties</span><span class="p">,</span> <span class="s1">&#39;$.dataset.empId&#39;</span><span class="p">)</span> <span class="n">eid</span><span class="p">,</span>
<span class="n">json_extract</span><span class="p">(</span><span class="n">properties</span><span class="p">,</span> <span class="s1">&#39;$.rating&#39;</span><span class="p">)</span> <span class="n">rating</span><span class="p">,</span>
<span class="n">json_extract</span><span class="p">(</span><span class="n">properties</span><span class="p">,</span> <span class="s1">&#39;$.city&#39;</span><span class="p">)</span> <span class="n">city</span><span class="p">,</span>
<span class="n">json_extract</span><span class="p">(</span><span class="n">properties</span><span class="p">,</span> <span class="s1">&#39;$.url&#39;</span><span class="p">)</span> <span class="n">url</span>
<span class="k">FROM</span> <span class="n">job</span>
<span class="k">WHERE</span> <span class="n">job</span> <span class="k">MATCH</span> <span class="s1">&#39;</span>
<span class="s1"> python</span>
<span class="s1"> NOT (django OR php OR java OR ruby OR perl OR C# OR .NET)</span>
<span class="s1"> NOT (title:devops OR title:analyst OR title:analist OR title:support OR title:graduation OR</span>
<span class="s1"> title:director OR title:phd OR title:researcher OR title:biologist OR</span>
<span class="s1"> title:test OR title:tester OR title: &quot;quality assurance&quot; OR title:hacker OR</span>
<span class="s1"> title:intern OR title:internship OR title:junior OR title:young OR</span>
<span class="s1"> title: &quot;full stack&quot; OR title:fullstack OR title: &quot;front end&quot; OR</span>
<span class="s1"> title:embedded OR title:firmware OR title:fpga OR title:C++ OR</span>
<span class="s1"> title:science OR title:scientist OR title:consultant OR title:coach OR</span>
<span class="s1"> title:marketing OR title:network OR title:operations OR title:student OR title:linux OR</span>
<span class="s1"> title:architect OR title:android OR title:automation OR title: &quot;ops engineer&quot; OR</span>
<span class="s1"> title:verification OR title:validation OR title:manager OR title:expert OR</span>
<span class="s1"> title:azure OR title:vmware OR title: &quot;net developer&quot; OR title:trader)</span>
<span class="s1"> NOT (employer:CareerValue OR employer: &quot;Bright Cubes&quot; OR employer:Deloitte OR</span>
<span class="s1"> employer:Cegeka OR employer: &quot;Star Apple&quot; OR employer:HUMAN-CAPITAL OR employer:Bonque OR</span>
<span class="s1"> employer:Sogeti OR employer:Yacht OR employer:iSense OR employer: &quot;Talent Relations&quot; OR</span>
<span class="s1"> employer:Place-IT OR employer:Professionals OR employer:ISAAC OR</span>
<span class="s1"> employer:Trinamics OR employer:CINQ OR employer:Teqoia OR employer:JouwICTvacature OR</span>
<span class="s1"> employer:Gazelle OR employer:Consulting OR employer:Topicus OR employer: &quot;Blue Lynx&quot; OR</span>
<span class="s1"> employer:Nobru OR employer:Qualogy OR employer:YER OR employer: &quot;IT Human Resources&quot; OR</span>
<span class="s1"> employer:Montash OR employer:CodeGuild OR employer:recruitment OR employer: &quot;hot item&quot; OR</span>
<span class="s1"> employer:MobGen OR employer:Mobiquity OR employer: &quot;Orange Quarter&quot; OR</span>
<span class="s1"> employer: &quot;You Get&quot; OR employer:ITHR)</span>
<span class="s1">&#39;</span>
<span class="k">ORDER</span> <span class="k">BY</span> <span class="n">employer</span><span class="p">,</span> <span class="nb">date</span><span class="p">(</span><span class="n">json_extract</span><span class="p">(</span><span class="n">properties</span><span class="p">,</span> <span class="s1">&#39;$.date&#39;</span><span class="p">))</span> <span class="k">DESC</span>
</pre></div>
<p>Yes, it&#8217;s pretty restrictive and thus gives relevant results, although I&#8217;m not very
happy with constant manual blacklisting (there&#8217;s really whole lot of intermediaries).
Maybe next time I&#8217;ll be looking for a job I employ some <span class="caps">NLP</span> or other machine learning.
But now now this is&nbsp;it.</p>
</div>
<div class="section" id="wrap-up">
<h2><a class="toc-backref" href="#id55">Wrap-up</a></h2>
<ol class="arabic simple">
<li>SQLite a Swiss army knife of an ad-hoc storage and&nbsp;search,</li>
<li>Ubiquitous browsers have become automatable out of the&nbsp;box,</li>
<li>Even though some vacancy search platforms provide decent search features and other
try to automate job collection, like Indeed with its <span class="caps">XML</span> job feed format <a class="footnote-reference" href="#id43" id="id22">[21]</a>, it
looks like the effort is not enough and penetration of developer-friendly automated
job search tools can be significantly&nbsp;improved,</li>
<li>Python for the&nbsp;win!</li>
</ol>
<p><span class="caps">P.S.</span> The new journey has already been scheduled&nbsp;:-)</p>
<hr class="docutils" />
<table class="docutils footnote" frame="void" id="id23" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id1">[1]</a></td><td><a class="reference external" href="https://www.glassdoor.nl/">https://www.glassdoor.nl/</a></td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id24" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id2">[2]</a></td><td><a class="reference external" href="https://stackoverflow.com/help/advanced-search-parameters-jobs">https://stackoverflow.com/help/advanced-search-parameters-jobs</a></td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id25" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id3">[3]</a></td><td><a class="reference external" href="https://docs.python.org/3/library/sqlite3.html">https://docs.python.org/3/library/sqlite3.html</a></td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id26" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id4">[4]</a></td><td><a class="reference external" href="https://sqlite.org/fts3.html">https://sqlite.org/fts3.html</a></td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id27" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id5">[5]</a></td><td><a class="reference external" href="https://sqlite.org/vtab.html">https://sqlite.org/vtab.html</a></td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id28" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label">[6]</td><td><em>(<a class="fn-backref" href="#id6">1</a>, <a class="fn-backref" href="#id8">2</a>)</em> <a class="reference external" href="https://sqlite.org/fts3.html#data_structures">https://sqlite.org/fts3.html#data_structures</a></td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id29" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id7">[7]</a></td><td><a class="reference external" href="https://sqlite.org/fts3.html#_set_operations_using_the_enhanced_query_syntax">https://sqlite.org/fts3.html#_set_operations_using_the_enhanced_query_syntax</a></td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id30" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id10">[8]</a></td><td><a class="reference external" href="https://sqlite.org/json1.html">https://sqlite.org/json1.html</a></td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id31" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id11">[9]</a></td><td><a class="reference external" href="https://sqlite.org/json1.html#jex">https://sqlite.org/json1.html#jex</a></td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id32" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id12">[10]</a></td><td><a class="reference external" href="https://en.wikipedia.org/wiki/Relational_database">https://en.wikipedia.org/wiki/Relational_database</a></td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id33" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id13">[11]</a></td><td><a class="reference external" href="https://en.wikipedia.org/wiki/Document-oriented_database">https://en.wikipedia.org/wiki/Document-oriented_database</a></td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id34" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id9">[12]</a></td><td><a class="reference external" href="https://pypi.org/project/sqlitefts/">https://pypi.org/project/sqlitefts/</a></td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id35" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id14">[13]</a></td><td><a class="reference external" href="https://en.wikipedia.org/wiki/Headless_browser">https://en.wikipedia.org/wiki/Headless_browser</a></td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id36" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id15">[14]</a></td><td><a class="reference external" href="https://github.com/ChromeDevTools/devtools-protocol">https://github.com/ChromeDevTools/devtools-protocol</a></td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id37" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id16">[15]</a></td><td><a class="reference external" href="https://github.com/GoogleChrome/puppeteer">https://github.com/GoogleChrome/puppeteer</a></td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id38" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id17">[16]</a></td><td><a class="reference external" href="https://pypi.org/project/pyppeteer/">https://pypi.org/project/pyppeteer/</a></td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id39" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id20">[17]</a></td><td><a class="reference external" href="https://www.python.org/dev/peps/pep-0525/">https://www.python.org/dev/peps/pep-0525/</a></td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id40" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id18">[18]</a></td><td><a class="reference external" href="https://miyakogi.github.io/pyppeteer/reference.html#elementhandle-class">https://miyakogi.github.io/pyppeteer/reference.html#elementhandle-class</a></td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id41" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id19">[19]</a></td><td><a class="reference external" href="https://miyakogi.github.io/pyppeteer/reference.html#pyppeteer.page.Page.click">https://miyakogi.github.io/pyppeteer/reference.html#pyppeteer.page.Page.click</a></td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id42" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id21">[20]</a></td><td><a class="reference external" href="https://github.com/pvanek/sqliteman">https://github.com/pvanek/sqliteman</a></td></tr>
</tbody>
</table>
<table class="docutils footnote" frame="void" id="id43" rules="none">
<colgroup><col class="label" /><col /></colgroup>
<tbody valign="top">
<tr><td class="label"><a class="fn-backref" href="#id22">[21]</a></td><td><a class="reference external" href="https://www.indeed.com/intl/en/xmlinfo.html">https://www.indeed.com/intl/en/xmlinfo.html</a></td></tr>
</tbody>
</table>
</div>
</content><category term="python"></category><category term="asyncio"></category><category term="browser-automation"></category><category term="sqlite"></category><category term="full-text-search"></category><category term="document-storage"></category></entry><entry><title>CherryPy questions: testing, SSL and Docker</title><link href="https://recollection.saaj.me/article/cherrypy-questions-testing-ssl-and-docker.html" rel="alternate"></link><published>2015-04-23T00:00:00+02:00</published><updated>2015-04-23T00:00:00+02:00</updated><author><name>saaj</name></author><id>tag:recollection.saaj.me,2015-04-23:/article/cherrypy-questions-testing-ssl-and-docker.html</id><summary type="html"><p>Basically this is a CherryPy work-in-progress article. However, it aims to draw the line
with testing infrastructure I was working on, as it seems to be completed. The article
covers what has been done and what issues I met along the way of reviving the test suite
to serve its purpose. The way is not traversed fully of course and there&#8217;s a lot left to
do. It also seeks to justify concerns about CherryPy <span class="caps">SSL</span> functionality&#8217;s correctness and
viability. As well, it tries to answer appeared questions about official Docker image and
its range of&nbsp;application.</p>
</summary><content type="html"><p>Basically this is a CherryPy work-in-progress article. However, it aims to draw the line
with testing infrastructure I was working on, as it seems to be completed. The article
covers what has been done and what issues I met along the way of reviving the test suite
to serve its purpose. The way is not traversed fully of course and there&#8217;s a lot left to
do. It also seeks to justify concerns about CherryPy <span class="caps">SSL</span> functionality&#8217;s correctness and
viability. As well, it tries to answer appeared questions about official Docker image and
its range of&nbsp;application.</p>
<div class="contents topic" id="contents">
<p class="topic-title first">Contents</p>
<ul class="simple">
<li><a class="reference internal" href="#testing" id="id123">Testing</a><ul>
<li><a class="reference internal" href="#test-determinism" id="id124">Test&nbsp;determinism</a></li>
<li><a class="reference internal" href="#test-dependency" id="id125">Test&nbsp;dependency</a></li>
<li><a class="reference internal" href="#test-isolation" id="id126">Test&nbsp;isolation</a></li>
<li><a class="reference internal" href="#testing-progress" id="id127">Testing&nbsp;progress</a></li>
<li><a class="reference internal" href="#tagged-tox-configuration" id="id128">Tagged Tox&nbsp;configuration</a></li>
<li><a class="reference internal" href="#tackling-entropy-of-threading-and-in-process-server-juggling" id="id129">Tackling entropy of threading and in-process server&nbsp;juggling</a></li>
<li><a class="reference internal" href="#status-indication" id="id130">Status&nbsp;indication</a></li>
<li><a class="reference internal" href="#drawing-the-line" id="id131">Drawing the&nbsp;line</a></li>
</ul>
</li>
<li><a class="reference internal" href="#ssl" id="id132"><span class="caps">SSL</span></a><ul>
<li><a class="reference internal" href="#experiment" id="id133">Experiment</a></li>
<li><a class="reference internal" href="#problem" id="id134">Problem</a></li>
<li><a class="reference internal" href="#plan" id="id135">Plan</a></li>
</ul>
</li>
<li><a class="reference internal" href="#docker" id="id136">Docker</a><ul>
<li><a class="reference internal" href="#nature" id="id137">Nature</a></li>
<li><a class="reference internal" href="#application" id="id138">Application</a></li>
</ul>
</li>
<li><a class="reference internal" href="#questions" id="id139">Questions</a></li>
</ul>
</div>
<div class="section" id="testing">
<h2><a class="toc-backref" href="#id123">Testing</a></h2>
<p>As was discovered in <a class="reference external" href="https://recollection.saaj.me/article/future-of-cherrypy-bright-and-shiny.html">Future of CherryPy</a> quality
assurance was a glaring problem. After I volunteered and started working on putting
tests back in the right way, I&#8217;ve discovered several testing methodological errors that were
made along in years of CherryPy existence. Also, back in the day CherryPy had its own test
runner and later survived migration to Nose, but lost <span class="caps">SSL</span> tests. Not that the tests were purged
away from the codebase, but werern&#8217;t possible to be run automatically and required a tester
to change code to run tests with <span class="caps">SSL</span>. Obviously no one did it, and it led to several issues
that rendered <span class="caps">SSL</span> functionality broken. Actually it still is, as at the moment of writing
in latest CherryPy release, 3.6, <span class="caps">SSL</span> doesn&#8217;t&nbsp;work.</p>
<p>Besides testing issues, there were issues with CherryPy functionality as such. For instance,
session file locking didn&#8217;t seem to be tested on machines with more than one execution unit.
This way session locks didn&#8217;t work all the time on machines with a multi-core <span class="caps">CPU</span>.</p>
<blockquote>
You may be confused to think that as long as there is <span class="caps">GIL</span> <a class="footnote-reference" href="#id67" id="id1">[1]</a>, you shouldn&#8217;t care about number
of CPUs. Unfortunately, serial execution doesn&#8217;t mean execution in one certain order. Thus on
different number of execution units and on different <span class="caps">GIL</span> implementations (Python 3.2 has new
<span class="caps">GIL</span> <a class="footnote-reference" href="#id68" id="id2">[2]</a>) a threaded Python program will likely behave differently.</blockquote>
<p>I&#8217;ve fixed file locking by postponing lock file removal to a cleanup process. Briefly covering
issues with other session backends is to say Postgres session backend was decided to be removed
because it was broken, untested and mostly undocumented. Memcached session backend was mostly
fine, except some minor <tt class="docutils literal">str</tt>/<tt class="docutils literal">bytes</tt> issue in <em>py3</em> and the fact it isn&#8217;t very useful in
its current condition. The problem is that it doesn&#8217;t implement distributed locking, so once you
have two or more CherryPy instances that use a Memcached server as a session backend and your
users are not distributed consistently across the instances, all sorts of concurrent update
issues may&nbsp;raise.</p>
<p>Substantial methodological errors in the test suite&nbsp;were:</p>
<ul class="simple">
<li>test&nbsp;determinism</li>
<li>test&nbsp;dependency</li>
<li>test&nbsp;isolation</li>
</ul>
<div class="section" id="test-determinism">
<h3><a class="toc-backref" href="#id124">Test&nbsp;determinism</a></h3>
<p>There were garbage collection tests that were automatically added to every subclass of
<tt class="docutils literal">test.helper.CPWebCase</tt>. They are non-deterministic and fail sporadically (there are
other normal tests that have non-deterministic behaviour, see below). Because CPython&#8217;s reference
counting garbage collector is deterministic, it&#8217;s a concurrency issue. What was really surprising
for me hear was explanation like, okay, if it has failed run it once again until you&#8217;re sure that
the failure persists. This <tt class="docutils literal">test_gc</tt> supplementing is disabled now until everything else is
resolved, and it is clear how to handle it&nbsp;properly.</p>
<p>Another thing that prevented certain classification to <em>pass</em> and <em>fail</em> and led to frequent
freezes were request retries and absence of <span class="caps">HTTP</span> client&#8217;s socket timeout. I can only guess about
the original reasons of such &#8220;carefulness&#8221; and not letting a test just fail. For instance, you can
take some of concurrency tests that throw out up to a hundred client threads and imagine stability
and predictability of the undertaking with 10 retires and no client timeout. These two were
also fixed &#8212; <tt class="docutils literal">test.webtest.openURL</tt> doesn&#8217;t retry on socket errors and sets 10-second timeout
to its <span class="caps">HTTP</span> connections by&nbsp;default.</p>
<blockquote>
<tt class="docutils literal">test.helper.CPWebCase</tt> starts full in-process CherryPy server and then uses real
<tt class="docutils literal">httplib.HTTPConnection</tt> to connect to it. Most of CherryPy tests are in fact either
integration <a class="footnote-reference" href="#id69" id="id3">[3]</a> or system tests <a class="footnote-reference" href="#id70" id="id4">[4]</a>.</blockquote>
</div>
<div class="section" id="test-dependency">
<h3><a class="toc-backref" href="#id125">Test&nbsp;dependency</a></h3>
<p>How do you like tests named like <tt class="docutils literal">test_0_check_something</tt>, <tt class="docutils literal">test_1_check_something_else</tt>, and
so on? And what if they were named this way intentionally, so they run in order, and fail
otherwise because latter depend on former. Specifically, you can not run test individually. All
occurrence I&#8217;ve found were renamed and freed from sibling&nbsp;shackles.</p>
</div>
<div class="section" id="test-isolation">
<h3><a class="toc-backref" href="#id126">Test&nbsp;isolation</a></h3>
<p><tt class="docutils literal">test.helper.CPWebCase</tt>&#8216;s server startup routine was made in equivalent of <tt class="docutils literal">unittest</tt>&#8216;s
<tt class="docutils literal">setUpClass</tt>, so tests in a test case were not isolated from each other, which made it
problematic to deal with the case of tests that interfere with internals as it was really
hard to say whether such test has failed on its own or as a consequence. I&#8217;ve added
<tt class="docutils literal">per_test_setup</tt> attribute to <tt class="docutils literal">CPWebCase</tt> to conditionally do the routine in <tt class="docutils literal">setUp</tt>
and accommodated tests&#8217; code. I apprehended significant performance degradation, but it has just
become around 20% slower, which was the reason for me to make it true by default. Not sure,
maybe it should&#8217;t exist at all&nbsp;now.</p>
<p>Unfortunately, this doesn&#8217;t resolve all isolation issues because having to do a complete reset
of in-process threaded server is a juggling whatsoever. When a tricky test fails badly, it still
has a side-effect (see below), like&nbsp;this:</p>
<pre class="literal-block">
./cherrypy/process/wspbus.py:233: RuntimeWarning: The main thread is exiting, but the Bus is
in the states.STARTING state; shutting it down automatically now. You must either call
bus.block() after start(), or call bus.exit() before the main thread exits.
</pre>
</div>
<div class="section" id="testing-progress">
<h3><a class="toc-backref" href="#id127">Testing&nbsp;progress</a></h3>
<p>What has just been said about fixed and improved stuff is happening in my fork <a class="footnote-reference" href="#id71" id="id5">[5]</a>, it is not yet
pushed upstream. More detailed description and discussion about current and above sections is in
the thread in CherryPy user group <a class="footnote-reference" href="#id72" id="id6">[6]</a>. Because it&#8217;s there until the moment Google decides
to shutdown Google Groups because of low traffic, abuse or any other reason they use for such
an announcement, I will summarise the achievements in improving CherryPy testing&nbsp;infrastructure.</p>
<ol class="arabic simple">
<li>Migrated tests to stdlib <tt class="docutils literal">unittest</tt>, <tt class="docutils literal">unitest2</tt> as fallback for <em>py26</em>. Nose is unfriendly
to <em>py3</em> and probably was related to some locking&nbsp;issues.</li>
<li>Overall improvement in avoiding freezes in various test cases and testing environments.
<span class="caps">TTY</span> in Docker, disabled interactive mode, various internals&#8217; tests wait in a thread, et&nbsp;cetera.</li>
<li>Implemented &#8220;tagged&#8221; Tox 1.8+ configuration for matrix of various environments (see&nbsp;below).</li>
<li>Integrated with Drone.io <a class="footnote-reference" href="#id73" id="id7">[7]</a> <span class="caps">CI</span> service and Codecov.io <a class="footnote-reference" href="#id74" id="id8">[8]</a> code coverage tracking&nbsp;service.</li>
<li>Made various changes to allow parallel environment run with Detox to fit in 15 minutes of
Drone.io free tier. It includes running tests in install directory <a class="footnote-reference" href="#id75" id="id9">[9]</a>, starting server each
time on free port provided by <span class="caps">OS</span>, locking Memcached cases with an atomic operation and&nbsp;other.</li>
<li>Removed global and persistent configuration that prevented mixing <span class="caps">HTTP</span> and <span class="caps">HTTPS</span>&nbsp;cases.</li>
<li>Made it possible to work on the test suite in PyDev. When running tests in PyDev test runner,
it adds another non-daemonic thread that tests don&#8217;t expect which leads to deadlocks or fails
for 3 tests. They are now just skipped under PyDev&nbsp;runner.</li>
</ol>
<p>Here&#8217;s Drone.io commands required to run the test suite. It uses Deadsnakes <a class="footnote-reference" href="#id76" id="id10">[10]</a> to install old
or not yet stable version of Python. Development versions are needed to build pyOpenSSL. The rest
comes with Drone.io container out of the&nbsp;box.</p>
<div class="highlight"><pre><span></span><span class="nb">echo</span> <span class="s1">&#39;debconf debconf/frontend select noninteractive&#39;</span> <span class="p">|</span> sudo debconf-set-selections
sudo add-apt-repository ppa:fkrull/deadsnakes <span class="p">&amp;</span>&gt; /dev/null
sudo apt-get update <span class="p">&amp;</span>&gt; /dev/null
sudo apt-get -y install python2.6 python3.4 <span class="p">&amp;</span>&gt; /dev/null
sudo apt-get -y install python2.6-dev python3.4-dev <span class="p">&amp;</span>&gt; /dev/null
sudo pip install --quiet detox
detox
tox -e post
</pre></div>
<p>As you can see <tt class="docutils literal">post</tt> environment closes the build. Unless the build was successful it won&#8217;t run
and no coverage will be submitted or build artifacts become available. The build artifacts, what
also needs to be listed in Drone.io, are several quality assurance reports that can be helpful for
further improving the&nbsp;codebase:</p>
<pre class="literal-block">
coverage_report.txt
coverage_report.html.tgz
maintenance_index.txt
code_complexity.txt
</pre>
<p>For Codecov.io integration to work, environment variable <tt class="docutils literal">CODECOV_TOKEN</tt> should be&nbsp;assigned.</p>
</div>
<div class="section" id="tagged-tox-configuration">
<h3><a class="toc-backref" href="#id128">Tagged Tox&nbsp;configuration</a></h3>
<p>Final Tox configuration looks like the following. It emerged through series of rewrites that were
led by tradeoffs between test duration, combined coverage report and need to test dependencies.
I came to the last design after I realised importance of support of pyOpenSSL and thus the need to
test it (see about <span class="caps">SSL</span> issues below). Then test sampling facility was easy to integrate to&nbsp;it.</p>
<div class="highlight"><pre><span></span><span class="k">[tox]</span>
<span class="na">minversion</span> <span class="o">=</span> <span class="s">1.8</span>
<span class="na">envlist</span> <span class="o">=</span> <span class="s">pre,docs,py{26-co,27-qa,33-nt,34-qa}{,-ssl,-ossl}</span>
<span class="c1"># run tests from install dir, not checkout dir</span>
<span class="c1"># http://tox.rtfd.org/en/latest/example/pytest.html#known-issues-and-limitations</span>
<span class="k">[testenv]</span>
<span class="na">changedir</span> <span class="o">=</span> <span class="s">{envsitepackagesdir}/cherrypy</span>
<span class="na">setenv</span> <span class="o">=</span><span class="s"></span>
<span class="s"> qa: COVERAGE_FILE = {toxinidir}/.coverage.{envname}</span>
<span class="s"> ssl: CHERRYPY_TEST_SSL_MODULE = builtin</span>
<span class="s"> ossl: CHERRYPY_TEST_SSL_MODULE = pyopenssl</span>
<span class="s"> sa: CHERRYPY_TEST_XML_REPORT_DIR = {toxinidir}/test-xml-data</span>
<span class="na">deps</span> <span class="o">=</span><span class="s"></span>
<span class="s"> routes</span>
<span class="s"> py26: unittest2</span>
<span class="s"> py{26,27}: python-memcached</span>
<span class="s"> py{33,34}: python3-memcached</span>
<span class="s"> qa: coverage</span>
<span class="s"> ossl: pyopenssl</span>
<span class="s"> sa: unittest-xml-reporting</span>
<span class="na">commands</span> <span class="o">=</span><span class="s"></span>
<span class="s"> python --version</span>
<span class="s"> nt: python -m unittest {posargs: discover -v cherrypy.test}</span>
<span class="s"> co: unit2 {posargs:discover -v cherrypy.test}</span>
<span class="s"> qa: coverage run --branch --source=&quot;.&quot; --omit=&quot;t*/*&quot; \</span>
<span class="s"> qa: --module unittest {posargs: discover -v cherrypy.test}</span>
<span class="s"> sa: python test/xmltestreport.py</span>
<span class="k">[testenv:docs]</span>
<span class="na">basepython</span> <span class="o">=</span> <span class="s">python2.7</span>
<span class="na">changedir</span> <span class="o">=</span> <span class="s">docs</span>
<span class="na">commands</span> <span class="o">=</span> <span class="s">sphinx-build -q -E -n -b html . build</span>
<span class="na">deps</span> <span class="o">=</span><span class="s"></span>
<span class="s"> sphinx &lt; 1.3</span>
<span class="s"> sphinx_rtd_theme</span>
<span class="k">[testenv:pre]</span>
<span class="na">changedir</span> <span class="o">=</span> <span class="s">{toxinidir}</span>
<span class="na">deps</span> <span class="o">=</span> <span class="s">coverage</span>
<span class="na">commands</span> <span class="o">=</span> <span class="s">coverage erase</span>
<span class="c1"># must be run separately after main envlist, because of detox</span>
<span class="k">[testenv:post]</span>
<span class="na">changedir</span> <span class="o">=</span> <span class="s">{toxinidir}</span>
<span class="na">deps</span> <span class="o">=</span><span class="s"></span>
<span class="s"> coverage</span>
<span class="s"> codecov</span>
<span class="s"> radon</span>
<span class="na">commands</span> <span class="o">=</span><span class="s"></span>
<span class="s"> bash -c &#39;echo -e &quot;[paths]\nsource = \n cherrypy&quot; &gt; .coveragerc&#39;</span>
<span class="s"> bash -c &#39;echo &quot; .tox/*/lib/*/site-packages/cherrypy&quot; &gt;&gt; .coveragerc&#39;</span>
<span class="s"> coverage combine</span>
<span class="s"> bash -c &#39;coverage report &gt; coverage_report.txt&#39;</span>