Skip to content
This repository has been archived by the owner on Dec 3, 2020. It is now read-only.

Investigate Fathom-based page extraction #36

Closed
Osmose opened this issue Jul 30, 2018 · 8 comments
Closed

Investigate Fathom-based page extraction #36

Osmose opened this issue Jul 30, 2018 · 8 comments
Assignees
Milestone

Comments

@Osmose
Copy link
Contributor

Osmose commented Jul 30, 2018

As per the latest mockups, we probably want to extract the following info about a product:

  • Name of shopping site (e.g. Amazon)
  • Price
  • Product name
  • Representative image of product

In lieu of a better way of verifying whether the Fathom ruleset, let's use this:

  • A training corpus of 100 pages from the 5 supported commerce sites.
  • Train the ruleset using 50 of the pages, verify it against the other 50
  • Aim for 95% accuracy
@Osmose
Copy link
Contributor Author

Osmose commented Jul 30, 2018

@javaun What 5 shopping sites do we want to target for the first round? We can expand later if necessary.

@Osmose Osmose changed the title Prototype Fathom-based page extraction Implement Fathom-based page extraction Jul 30, 2018
biancadanforth added a commit that referenced this issue Jul 31, 2018
Got Fathom working in the WebExtension based on [Swathi Iyer](https://github.com/swathiiyer2/fathom-products) and [Victor Ng’s](https://github.com/mozilla/fathom-webextension) prior work. Right now it just logs the highest scoring nodes’ values for product title, image and price in the console from background.js.

While [`jsdom`](https://www.npmjs.com/package/jsdom) is a `fathom-web` dependency, it is used only for running `fathom-web` in the Node context for testing. To avoid build errors associated with `jsdom` and its dependencies, I added a `’null-loader’` for that `require` call, which mocks the module as an empty object. This loader is also used in webpack.config.test.js, from PR#32.

Issues:
* I get a notification from Firefox at times saying the extension is slowing down the browser.
* Background.js is not always logging the results to the console with `npm run watch` enabled. I’m not yet sure why.
@biancadanforth
Copy link
Collaborator

Last night, I was able to get Fathom running in our web extension, in large part thanks to Victor Ng's and Swathi Iyer's prior work.

This currently runs Fathom in the content script for the tab, and I am getting consistent notifications from Firefox on an Amazon product page that it is slowing down Firefox.

@Osmose , would it be possible to run this script in a separate thread/process? In general, what might be some options to work through this? Currently I cannot get Fathom to run reliably without hanging.

screen shot 2018-07-31 at 9 02 59 am

biancadanforth added a commit that referenced this issue Aug 1, 2018
This patch will run Fathom against the page (not distinguishing a product from a non-product page) and log the extracted price value and page URL to the console via 'background.js'. Failing that, it will fall back to extraction via CSS selectors if any exist for the site in 'product_extraction_data.json', and failing that, it will try extraction via Open Graph meta tags.

This is heavily based on [Swathi Iyer](https://github.com/swathiiyer2/fathom-products) and [Victor Ng’s](https://github.com/mozilla/fathom-webextension) prior work. Currently, there is only one ruleset with one naive rule for one product feature, price. This initial commit is intended cover Fathom integration into the web extension. A later commit will add rules and take training data into account.

Note: The 'runRuleset' method in 'productInfo.js' returns 'NaN' if it doesn't find any elements for any of its rules.

Performance observations:
Originally, I had dumped Swathi's three rulesets (one each for product title, image and price) and tried to run them against any page, similar to Victor Ng's web extension. However, that was [freezing up the tab](#36 (comment)), and after profiling the content script Fathom was running in before and after replacing Swathi's rulesets with a single ruleset with only one rule for one attribute, I did not see any warnings from Firefox, nor detect any significant performance hits in the DevTools profiler due to Fathom. It would therefore appear the performance hit was related to the complex rulesets and not Fathom itself.

Webpack observations:
While [`jsdom`](https://www.npmjs.com/package/jsdom) is a `fathom-web` dependency, it is used only for running `fathom-web` in the Node context for testing. To avoid build errors associated with `jsdom` and its dependencies, I added a `’null-loader’` for that `require` call, which mocks the module as an empty object. This loader is also used in webpack.config.test.js, from PR #32.
@javaun
Copy link

javaun commented Aug 1, 2018

Top domains for EN-US:

  1. Amazon
  2. eBay * +
  3. Walmart
  4. Best Buy
  5. Macy's
  • eBay is a distant no. 2 in sales and may be a PITA. It's also a marketplace and a mixed bag. But they're huge (2x the size of Walmart by online sales) AND we have a rev share with them. If they turn into a Fathom pain or don't (or testing says users don't want them), we could drop them at some point.
  • Not in this list is Apple.com, who is roughly the size of Walmart in ecommerce. But they're a brand destination and probably not a great us of one of our precious 5 slots for MVP.

@javaun
Copy link

javaun commented Aug 1, 2018

My bad: for 5 let's use Homedepot.com instead of Macy's. Home Depot is slightly bigger and less likely to be roadkill like Macy's. Clothes are getting hammered. Hardware/home much more defensible

@brucko
Copy link

brucko commented Aug 1, 2018

Dropping these here if you need more inspiration

Amazon, Walmart, Best Buy, Apple, Target, Etsy, Ikea, Gap, Macys, Home Depot, Lowes, Staples, Office Depot, Zappos, B&H, Costco

biancadanforth added a commit that referenced this issue Aug 2, 2018
This patch will run Fathom against the page (not distinguishing a product from a non-product page) and log the extracted price value and page URL to the console via 'background.js'. Failing that, it will fall back to extraction via CSS selectors if any exist for the site in 'product_extraction_data.json', and failing that, it will try extraction via Open Graph meta tags.

This is heavily based on [Swathi Iyer](https://github.com/swathiiyer2/fathom-products) and [Victor Ng’s](https://github.com/mozilla/fathom-webextension) prior work. Currently, there is only one ruleset with one naive rule for one product feature, price. This initial commit is intended cover Fathom integration into the web extension. A later commit will add rules and take training data into account.

Note: The 'runRuleset' method in 'productInfo.js' returns 'NaN' if it doesn't find any elements for any of its rules.

Performance observations:
Originally, I had dumped Swathi's three rulesets (one each for product title, image and price) and tried to run them against any page, similar to Victor Ng's web extension. However, that was [freezing up the tab](#36 (comment)), and after profiling the content script Fathom was running in before and after replacing Swathi's rulesets with a single ruleset with only one rule for one attribute, I did not see any warnings from Firefox, nor detect any significant performance hits in the DevTools profiler due to Fathom. It would therefore appear the performance hit was related to the complex rulesets and not Fathom itself.

Webpack observations:
While [`jsdom`](https://www.npmjs.com/package/jsdom) is a `fathom-web` dependency, it is used only for running `fathom-web` in the Node context for testing. To avoid build errors associated with `jsdom` and its dependencies, I added a `’null-loader’` for that `require` call, which mocks the module as an empty object. This loader is also used in webpack.config.test.js, from PR #32.
biancadanforth added a commit that referenced this issue Aug 3, 2018
This commit builds off of PR #38, so that PR should merge before this.

Open questions:
* How to test interdependent rules, such as 'isNearProductImage' for the product title and product price candidate elements?
* The only feature that is pulled out accurately on my test page (an Amazon product page) is the image. What rules can I add/modify to get title and price correct?

TODO:
* Add rule to remove ancestor elements who have the same 'innerText' value.
* Consider adding image rule to see if an image element is the largest image on the page (above the fold).
* Add price rule to see if innerText starts with '$'.
biancadanforth added a commit that referenced this issue Aug 3, 2018
#36: Integrate Fathom-based page extraction with a simple ruleset.
@biancadanforth
Copy link
Collaborator

biancadanforth commented Aug 4, 2018

Re: Perceived Fathom slowness

  • This was observed when running Fathom on every page with Swathi Iyer's rulesets.
  • We profiled the extension using Fathom with and without Swathi's rulesets, and we discovered the performance issues were largely related to the complexity of her rulesets (principally, some RegExp rules), not Fathom itself.

Possible rules to consider for extracting product features (title, price, image) from a product page

Fathom scores elements in the DOM by how well they follow a given ruleset, with modifiers applied based on how good that rule seems to be at finding the correct element.

Here is a list of some rule ideas collected from swathiiyer2, Osmose, erikrose and myself by feature, ranked in order of how well they seem to match a product page for our top 5 sites (listed here and here):

All three features:

  • Above the fold
  • Clustered "nearby" one another
    • This is an interdependency rule; for example, once we are confident in an image, find title-like and price-like elements nearby
  • Has the feature name (e.g. "title" or "price") in its id
  • Has the feature name (e.g. "title" or "price") in one of its class names
    • Use includes for better perf than a RegExp.
    • Check ancestor elements for class names and add a smaller modifier.
    • Add a smaller modifier for this rule compared to the id rule above.
  • Larger font size (for "title" and "price")
  • Optimal string length (for "title" and "price")

For each feature below, these are additional, feature-specific considerations on top of the cross-feature considerations above:

Product title:

  • An h1 element

Product price:

  • A span or h2 element
    • Ignore ancestors with same innerText value
  • Has a $ in the string
  • Looks like or contains a price-like pattern, like '$XXXX.XX', with any number of Xs to the left of the decimal and exactly two to the right. X is a digit, 0-9.
    • May need to trim off substring before $
    • Home Depot notably does not match the '.XX' at the end.
      Non-default text color
    • Only seems to be the case for Amazon
  • A bunch more rules from Swathi's work

Product image:

  • An img element
  • Largest image
  • Sufficiently large image
  • Aspect ratio closest to 1:1 (image is more square in shape than rectangular)
  • A bunch more rules from Swathi's work

biancadanforth added a commit that referenced this issue Aug 4, 2018
These rules successfully pull out product title, price and image from the following product pages (one each from the 5 top sites):
* [Amazon](https://www.amazon.com/KitchenAid-KL26M1XER-Professional-6-Qt-Bowl-Lift/dp/B01LYV1U30?smid=ATVPDKIKX0DER&pf_rd_p=0c7b792f-241a-4510-94f4-dd184a76f201&pf_rd_r=AZD7BGV3JZGTB23F30X3)
* [Ebay](https://www.ebay.com/p/Best-Choice-Products-650W-6-speed-5-5QT-Kitchen-Food-Stand-Mixer-with-Stainless-Steels-Bowl-Black/3018375728?iid=253733404998)
* [Walmart](https://www.walmart.com/ip/KitchenAid-Classic-Series-4-5-Quart-Tilt-Head-Stand-Mixer-Onyx-Black-K45SSOB/29474640)
* [Best Buy](https://www.bestbuy.com/site/jbl-everest-elite-750nc-wireless-over-ear-noise-cancelling-headphones-gunmetal/5840136.p?skuId=5840136)
* [Home Depot](https://www.homedepot.com/p/Husky-SAE-Combination-Wrench-Set-10-Piece-HCW10PCSAE/202934501)

TODO:
* Create a training set with FathomFox and run these rules against them to measure their accuracy for 50 product pages (10 from each top site).
* Modify trimTitle method, so it doesn't cut off the color from the title for the product on Ebay.
* Generalize formatPrice method. @Osmose, would you have any suggestions?
biancadanforth added a commit that referenced this issue Aug 7, 2018
This commit adds 50 product pages with the correct product "title", "image" and "price" elements tagged using the [FathomFox](https://addons.mozilla.org/en-US/firefox/addon/fathomfox/) add-on for use as a training set with Fathom (e.g. with an attribute data-fathom="price").

There are 10 product pages each from our top 5 sites: Amazon, Ebay, Walmart, Best Buy and Home Depot, with one page per rough product category: Home, Food, Electronics, Books, Hardware, Clothing, Outdoors, Toys, Beauty and Crafts. There are at least a few pages that include a range of prices, a sale price, or some other common variation.

#### Page URLs:
Amazon:
* https://www.amazon.com/KitchenAid-KL26M1XER-Professional-6-Qt-Bowl-Lift/dp/B01LYV1U30?smid=ATVPDKIKX0DER&pf_rd_p=0c7b792f-241a-4510-94f4-dd184a76f201&pf_rd_r=AZD7BGV3JZGTB23F30X3
* https://www.amazon.com/dp/B077YG22Y9/ref=sspa_dk_detail_3?psc=1&pd_rd_i=B077YG22Y9&pd_rd_wg=mfQm0&pd_rd_r=K27A84Z9Y01R47H7BJWR&pd_rd_w=km01K
* https://www.amazon.com/Lindt-Assorted-Chocolate-Gourmet-Truffles/dp/B004XDMS3C/ref=sr_1_6_s_it?s=grocery&ie=UTF8&qid=1533437720&sr=1-6&keywords=chocolates&th=1
* https://www.amazon.com/MAXPOWER-8-piece-Metric-Combination-Wrench/dp/B074R5T7TB/ref=sr_1_17?s=power-hand-tools&ie=UTF8&qid=1533145143&sr=1-17&keywords=wrench
* https://www.amazon.com/Signature-Levi-Strauss-Gold-Label/dp/B073V3TQHL/ref=sr_1_6?ie=UTF8&qid=1533601875&sr=8-6&keywords=pants
* https://www.amazon.com/gp/product/B00E257T6C/ref=s9_acsd_ps_bw_c_x_2_w?pf_rd_m=ATVPDKIKX0DER&pf_rd_s=merchandised-search-3&pf_rd_r=3ZSXZBJ3FGDJ421XVQMT&pf_rd_r=3ZSXZBJ3FGDJ421XVQMT&pf_rd_t=101&pf_rd_p=fe185ec9-c8f5-44c0-897e-4c0bde93268c&pf_rd_p=fe185ec9-c8f5-44c0-897e-4c0bde93268c&pf_rd_i=283155
* https://www.amazon.com/Rene-Furterer-Naturia-Balancing-Shampoo/dp/B008K3PEAK/ref=sr_1_1_a_it?ie=UTF8&qid=1533434644&sr=8-1
* https://www.amazon.com/gp/product/B001NCDE8O/ref=s9_acsd_al_bw_c_x_3_w?pf_rd_m=ATVPDKIKX0DER&pf_rd_s=merchandised-search-9&pf_rd_r=RDKRBA4WA8KS1S18N086&pf_rd_r=RDKRBA4WA8KS1S18N086&pf_rd_t=101&pf_rd_p=9ce7601e-e994-4475-8a00-833f2e408af7&pf_rd_p=9ce7601e-e994-4475-8a00-833f2e408af7&pf_rd_i=3400371
* https://www.amazon.com/gp/product/B077T6V74Z/ref=s9_acsd_topr_hd_bw_bEKkv_c_x_2_w?pf_rd_m=ATVPDKIKX0DER&pf_rd_s=merchandised-search-4&pf_rd_r=HRV980DM7J7PFN02MGN6&pf_rd_t=101&pf_rd_p=e88c96d5-222c-5d22-af37-9401d11728f4&pf_rd_i=3416381
* https://www.amazon.com/Leather-Handle-PERSONALIZATION-Vegetable-Cotton/dp/B0718XQXHL/ref=lp_15735446011_1_1?s=handmade&ie=UTF8&qid=1533434852&sr=1-1

Ebay:
* https://www.ebay.com/p/Best-Choice-Products-650W-6-speed-5-5QT-Kitchen-Food-Stand-Mixer-with-Stainless-Steels-Bowl-Black/3018375728?iid=253733404998
* https://www.ebay.com/itm/Philadelphia-Candies-Milk-Chocolate-Covered-Assorted-Nuts-1-Pound-Gift-Box/263103368447?epid=1301555428&hash=item3d422eccff
* https://www.ebay.com/itm/Dell-Inspiron-7570-15-6-Touch-Laptop-i7-8550U-1-8GHz-8GB-1TB-NVIDIA-940MX-W10/263827294291
* https://www.ebay.com/itm/Craftsman-32-piece-Inch-and-Metric-Combination-Wrench-Set-Free-Shipping-NEW/112762904805?epid=1951893956&hash=item1a413160e5%3Ag%3AML4AAOSwxOFaYPfu&_sacat=0&_nkw=wrenches&_from=R40&rt=nc&_trksid=p2380057.m570.l1313.TR12.TRC2.A0.H0.Xwrenches.TRS0
* https://www.ebay.com/itm/NEW-Calvin-Klein-Jeans-Mens-5-Pocket-Herringbone-Straight-Pant-Variety-NWT/172602115751?hash=item282fe346a7%3Am%3AmzGTxRA3ng5fU6LdZwdQssg%3Asc%3AUSPSPriority%2194041%21US%21-1&var=472089425020&_pgn=8&_sacat=0&_nkw=pants&_from=R40&rt=nc
* https://www.ebay.com/itm/The-Eye-of-the-World-The-Wheel-of-Time-Book-1-by-Jordan-Robert/142480282854?epid=907028&hash=item212c7c94e6%3Ag%3A5AsAAOSwn3Vap4yC&_sacat=0&_nkw=book&_from=R40&rt=nc&_trksid=p2380057.m570.l1313.TR12.TRC2.A0.H0.Xbook.TRS0
* https://www.ebay.com/itm/Beard-Hair-Growther-Facial-Mustache-Growth-Fast-Grow-Rich-Texture-Natural-Oil/222292486406?epid=691256086&hash=item33c1aa2906%3Ag%3AcasAAOSwA3dYDofw&_sacat=0&_nkw=beard+oil&_from=R40&rt=nc&_trksid=p2047675.m570.l1313.TR12.TRC2.A0.H0.Xbeard+oil.TRS0
* https://www.ebay.com/itm/Portable-SAND-TENT-CAMPING-FISHING-BEACH-SHELTER-SUN-SHADE-OUTDOOR-CANOPY-NEW/252060937931?hash=item3ab000aecb%3Am%3AmINYPcBvUlI0fUVlLnP1HNQ&var=552341391997&_sacat=0&_nkw=camping&_from=R40&rt=nc&LH_TitleDesc=0
* https://www.ebay.com/itm/Magic-Track-Car-Toys-With-Flashing-Lights-Educational-Toys-For-Children-Boys-US/162809813967?hash=item25e8389bcf%3Am%3Am_1VuBe8hB83kv8pXINSySg&var=461850886879&_sacat=0&_nkw=toy&_from=R40&rt=nc&_trksid=p2047675.m570.l1313.TR12.TRC2.A0.H0.Xtoy.TRS0
* https://www.ebay.com/itm/99-9-Pure-Copper-Cu-Metal-Sheet-Foil-0-1x100x100MM-For-Handicraft-Aerospace/302042932136?epid=506287267&hash=item46532963a8%3Ag%3A6lQAAOSwxp9W7QXM&_pgn=2&_sacat=0&_nkw=handicraft&_from=R40&rt=nc

Walmart:
* https://www.walmart.com/ip/KitchenAid-Classic-Series-4-5-Quart-Tilt-Head-Stand-Mixer-Onyx-Black-K45SSOB/29474640
* https://www.walmart.com/ip/HP-15-bs212wm-15-6-Laptop-Windows-10-Intel-Celeron-N4000-Processor-4GB-Memory-500GB-Hard-Drive-DVD-Jet-Black/139270579
* https://www.walmart.com/ip/Reese-s-Kit-Kat-Chocolate-Candy-Miniatures-Assortment-40-Oz/10449915
* https://www.walmart.com/ip/Hyper-Tough-18-Piece-Combination-Wrench-Set/49701823
* https://www.walmart.com/ip/Athletic-Works-Women-s-Essential-Athleisure-Knit-Pant-Available-in-Regular-and-Petite/597959224
* https://www.walmart.com/ip/507-Mechanical-Movements-Mechanisms-and-Devices/53282684
* https://www.walmart.com/ip/Maracuja-Shea-Oils-Beard-Conditioning-Oil/177023383
* https://www.walmart.com/ip/GigaTent-Liberty-Trail-2-Dome-Tent-7-x-7-Sleeps-3/24547530
* https://www.walmart.com/ip/LEGO-Jurassic-World-Jurassic-Park-Velociraptor-Chase-75932/873652909
* https://www.walmart.com/ip/Sargent-Art-Tempera-Artist-Starter-Set/48002550

Best Buy:
* https://www.bestbuy.com/site/kitchenaid-artisan-tilt-head-stand-mixer-ocean-drive/6133651.p?skuId=6133651
* https://www.bestbuy.com/site/swiss-miss-milk-chocolate-hot-cocoa-k-cup-pods-16-pack/4700838.p?skuId=4700838
* https://www.bestbuy.com/site/jbl-everest-elite-750nc-wireless-over-ear-noise-cancelling-headphones-gunmetal/5840136.p?skuId=5840136
* https://www.bestbuy.com/site/hangman-handy-hammer-kit-silver/5609603.p?skuId=5609603
* https://www.bestbuy.com/site/bioworld-call-of-duty-retribution-t-shirt-extra-large-blue/5580896.p?skuId=5580896
* https://www.bestbuy.com/site/amazon-kindle-oasis-e-reader-7-high-resolution-display-300-ppi-waterproof-built-in-audible-32gb-wi-fi-black-and-silver/6102700.p?skuId=6102700
* https://www.bestbuy.com/site/philips-norelco-7200-beard-trimmer-silver/4820905.p?skuId=4820905
* https://www.bestbuy.com/site/char-broil-tabletop-grill-black/7141853.p?skuId=7141853
* https://www.bestbuy.com/site/mattel-pokemon-gyarados-building-set-multi/5888525.p?skuId=5888525
* https://www.bestbuy.com/site/osmo-creative-kit-2017-multi/5452007.p?skuId=5452007

Home Depot:
* https://www.homedepot.com/p/KitchenAid-Classic-4-5-Qt-Tilt-Head-White-Stand-Mixer-K45SSWH/202546032
* https://www.homedepot.com/p/Flossugar-1-2-Gal-Chocolate-93216CT/302429246
* https://www.homedepot.com/p/Commercial-Electric-Around-the-Neck-Premium-Bluetooth-Stereo-Earbuds-HD0855/300372796
* https://www.homedepot.com/p/Husky-SAE-Combination-Wrench-Set-10-Piece-HCW10PCSAE/202934501
* https://www.homedepot.com/p/Dickies-Men-s-Large-White-Pocket-T-Shirt-GS407WH-LG/203507188
* https://www.homedepot.com/p/The-Home-Depot-Tiling-2nd-Edition-0696228580/100491813
* https://www.homedepot.com/p/Dermatone-4-oz-50-SPF-Sunscreen-Tube-503172734/304343553
* https://www.homedepot.com/p/Wakeman-4-Person-Dome-Tent-M470026/302029202
* https://www.homedepot.com/p/Step2-Deluxe-Canyon-Road-Train-and-Track-Table-754700/100483237
* https://www.homedepot.com/p/ABOLOS-Handicraft-II-Brown-Orange-12-in-x-12-in-Glass-Mosaic-Tile-HMDHDCSQ-SF/303320516
biancadanforth added a commit that referenced this issue Aug 7, 2018
Product title rules previously pulled the unique 'title' element from the 'head' element on the page (part of the pages metadata). While this ostensibly requires less processing (we don't have to search the DOM or score any other elements), the title string often requires site-specific cleaning such as to remove the vendor name, and the final, cleaned up string cannot not be verified as accurate by Fathom, which only tells us if our rules picked the right element.

The alternative approach, implemented here, is to pull the title from the corresponding element in the content of the page. Since Fathom can verify that the right element was selected, and the string from this element would not require any cleaning, this approach is a much better proxy for extracting the correct product title.
biancadanforth added a commit that referenced this issue Aug 8, 2018
This enables the same script to be used for training and running in the commerce webextension.

How to train a ruleset with Fathom:
1. Follow Fathom's [instructions](https://github.com/erikrose/fathom-fox).
2. Open the [Fathom Trainees](https://github.com/mozilla/fathom-trainees) add-on in a new profile.
3. Install FathomFox in that window from AMO.
4. Drag and drop the training corpus (HTML files in ./training-set) into that window.
5. Copy ./src/fathom_ruleset.js into fathom-trainees/src/trainees.js and save over it.
6. Choose a feature to train, 'price', 'title' or 'image', and edit `trainees.set()` so that one of those features is the first argument.
7. Comment out the rules pertaining to all but that feature.
8. Click the FathomFox browserAction and select "Train"
9. Select the feature from the dropdown list and click the "Train against the tabs in this window" button.
10. You will see the accuracy based on the initial coefficients passed in, and Fathom will start generating optimized coefficients. This could take a while.
11. When Fathom is done, those coefficients will be logged to the Fathom page.
biancadanforth added a commit that referenced this issue Aug 8, 2018
Product title rules previously pulled the unique 'title' element from the 'head' element on the page (part of the pages metadata). While this ostensibly requires less processing (we don't have to search the DOM or score any other elements), the title string often requires site-specific cleaning such as to remove the vendor name, and the final, cleaned up string cannot not be verified as accurate by Fathom, which only tells us if our rules picked the right element.

The alternative approach, implemented here, is to pull the title from the corresponding element in the content of the page. Since Fathom can verify that the right element was selected, and the string from this element would not require any cleaning, this approach is a much better proxy for extracting the correct product title.
biancadanforth added a commit that referenced this issue Aug 8, 2018
This enables the same script to be used for training and running in the commerce webextension.

How to train a ruleset with Fathom:
1. Follow Fathom's [instructions](https://github.com/erikrose/fathom-fox).
2. Open the [Fathom Trainees](https://github.com/mozilla/fathom-trainees) add-on in a new profile.
3. Install FathomFox in that window from AMO.
4. Drag and drop the training corpus (HTML files in ./training-set) into that window.
5. Copy ./src/fathom_ruleset.js into fathom-trainees/src/trainees.js and save over it.
6. Choose a feature to train, 'price', 'title' or 'image', and edit `trainees.set()` so that one of those features is the first argument.
7. Comment out the rules pertaining to all but that feature.
8. Click the FathomFox browserAction and select "Train"
9. Select the feature from the dropdown list and click the "Train against the tabs in this window" button.
10. You will see the accuracy based on the initial coefficients passed in, and Fathom will start generating optimized coefficients. This could take a while.
11. When Fathom is done, those coefficients will be logged to the Fathom page.
biancadanforth added a commit that referenced this issue Aug 10, 2018
This enables the same script to be used for training and running in the commerce webextension.

How to train a ruleset with Fathom:
1. Follow Fathom's [Trainer instructions](https://github.com/erikrose/fathom-fox#the-trainer).
2. Open the [Fathom Trainees](https://github.com/mozilla/fathom-trainees) add-on in a new profile.
3. Install FathomFox in that window from AMO.
4. Drag and drop the training corpus into that window.
  - Note: The training corpus are HTML files frozen using [FathomFox's DevTools panel](https://github.com/erikrose/fathom-fox#the-developer-tools-panel); our training corpus is on the shared "commerce" Google drive.
  - Note: As of the date of this commit, the Corpus Collector is not a recommended option for building a training corpus due to a `freeze-dry` dependency bug that inserts a bunch of extra garbage when re-freezing a frozen page.
5. Copy ./src/fathom_ruleset.js into fathom-trainees/src/trainees.js and save over it.
6. Choose a feature to train, 'price', 'title' or 'image', and edit `trainees.set()` so that one of those features is the first argument.
7. Comment out the rules pertaining to all but that feature.
  - Currently, you can only train one ruleset at a time with Fathom, and only one `out` (e.g. 'title', 'image' or 'product') at a time for a given ruleset.
  - If you have multiple `out`s you'd like to train simultaneously, repeat this process for the remaining features so Fathom is running in a separate browser window for each feature and its corresponding rules.
8. Click the FathomFox browserAction and select "Train"
9. Select the feature from the dropdown list and click the "Train against the tabs in this window" button.
10. The array of coefficients displayed on the training page will update over time as Fathom optimizes them; this could take a while.
@Osmose Osmose changed the title Implement Fathom-based page extraction Investigate Fathom-based page extraction Aug 10, 2018
@Osmose
Copy link
Contributor Author

Osmose commented Aug 10, 2018

Updating this issue to reflect that this is an investigation and we haven't yet committed to using Fathom-based extraction.

@Osmose Osmose added this to the November MVP milestone Aug 10, 2018
biancadanforth added a commit that referenced this issue Aug 12, 2018
These rules and coefficients yield the following accuracy based on a training corpus of 50 product pages from our top 5 sites (Amazon, Ebay, Walmart, Best Buy and Home Depot):
* 100% for product 'image'
* 96% for product 'title'
* 94% for product 'price'
Product 'price' and 'title' features have proximity rules based on the highest scoring product 'image' element. For now, this is done by accessing the image fnode using an internal '_ruleset' object; @erikrose is working on better support for this use case in the very near future, so this implementation can be improved at that time.
@biancadanforth
Copy link
Collaborator

biancadanforth commented Aug 12, 2018

Referencing PR #45 , here are the conclusions from my Fathom assessment:

Conclusions:

  • My ruleset has an average accuracy of 98.7% across a training set of 50 product pages, 10 each from our top five sites.
    • Note that the accuracy “in the wild” will likely be lower, so ideally these rules need to be further refined and trained across more pages to get us to 100%.
  • My recommendation is to use Fathom with a CSS selector fallback. We can switch to CSS selectors as primary if there are performance or other concerns that arise with Fathom as the add-on takes shape.
  • For this approach, we will need to build up a test set of product pages to test Fathom “in the wild” and an extensive list of CSS selectors for our top five sites.

Additional information:

Osmose and I met with erikrose last Friday and outlined our needs and aspirations for Fathom. There are two action items for Erik there that he said he could complete in a matter of weeks.

biancadanforth added a commit that referenced this issue Aug 20, 2018
The first script, 'ruleset_factory.js', exports a class to create a ruleset based on a set of coefficients; instances of this class are used in production (via 'fathom_extraction.js') and for Fathom training (via 'trainees.js').
2. The second script, 'trainees.js', is used exclusively for training using the FathomFox web extension and does not ship with the commerce web extension.

Additional changes and notes:
* I chose not to make use of the 'autobind' decorator in 'ruleset_factory.js', since it is also used in the training add-on, where devDeps like 'babel-core' and 'babel-plugin-transform-decorators-legacy' do not exist.
* I also turned off an eslint rule that requires class methods to use 'this', since some methods in RulesetFactory don't require it, and it would be tedious and confusing to call some methods on the class instance and others on the class itself.
* The new training script ('trainees.js') has three elements in the map it exports, one for each product feature ('image', 'title', 'price'). This allows us to select which feature to train from a dropdown menu on FathomFox's trainer page.
* Currently, for training, four files must be copied over into the 'fathom-trainees' add-on src directory:
  * config.js
  * fathom_default_coefficients.json
  * ruleset_factory.js
  * trainees.js (overwritting the existing file)
* In a separate commit, I will put all the Fathom extraction files into an 'extraction' (or similar) subfolder.
biancadanforth added a commit that referenced this issue Aug 24, 2018
These rules successfully pull out product title, price and image from the following product pages (one each from the 5 top sites):
* [Amazon](https://www.amazon.com/KitchenAid-KL26M1XER-Professional-6-Qt-Bowl-Lift/dp/B01LYV1U30?smid=ATVPDKIKX0DER&pf_rd_p=0c7b792f-241a-4510-94f4-dd184a76f201&pf_rd_r=AZD7BGV3JZGTB23F30X3)
* [Ebay](https://www.ebay.com/p/Best-Choice-Products-650W-6-speed-5-5QT-Kitchen-Food-Stand-Mixer-with-Stainless-Steels-Bowl-Black/3018375728?iid=253733404998)
* [Walmart](https://www.walmart.com/ip/KitchenAid-Classic-Series-4-5-Quart-Tilt-Head-Stand-Mixer-Onyx-Black-K45SSOB/29474640)
* [Best Buy](https://www.bestbuy.com/site/jbl-everest-elite-750nc-wireless-over-ear-noise-cancelling-headphones-gunmetal/5840136.p?skuId=5840136)
* [Home Depot](https://www.homedepot.com/p/Husky-SAE-Combination-Wrench-Set-10-Piece-HCW10PCSAE/202934501)

TODO:
* Create a training set with FathomFox and run these rules against them to measure their accuracy for 50 product pages (10 from each top site).
* Modify trimTitle method, so it doesn't cut off the color from the title for the product on Ebay.
* Generalize formatPrice method. @Osmose, would you have any suggestions?
biancadanforth added a commit that referenced this issue Aug 24, 2018
Product title rules previously pulled the unique 'title' element from the 'head' element on the page (part of the pages metadata). While this ostensibly requires less processing (we don't have to search the DOM or score any other elements), the title string often requires site-specific cleaning such as to remove the vendor name, and the final, cleaned up string cannot not be verified as accurate by Fathom, which only tells us if our rules picked the right element.

The alternative approach, implemented here, is to pull the title from the corresponding element in the content of the page. Since Fathom can verify that the right element was selected, and the string from this element would not require any cleaning, this approach is a much better proxy for extracting the correct product title.
biancadanforth added a commit that referenced this issue Aug 24, 2018
This enables the same script to be used for training and running in the commerce webextension.

How to train a ruleset with Fathom:
1. Follow Fathom's [Trainer instructions](https://github.com/erikrose/fathom-fox#the-trainer).
2. Open the [Fathom Trainees](https://github.com/mozilla/fathom-trainees) add-on in a new profile.
3. Install FathomFox in that window from AMO.
4. Drag and drop the training corpus into that window.
  - Note: The training corpus are HTML files frozen using [FathomFox's DevTools panel](https://github.com/erikrose/fathom-fox#the-developer-tools-panel); our training corpus is on the shared "commerce" Google drive.
  - Note: As of the date of this commit, the Corpus Collector is not a recommended option for building a training corpus due to a `freeze-dry` dependency bug that inserts a bunch of extra garbage when re-freezing a frozen page.
5. Copy ./src/fathom_ruleset.js into fathom-trainees/src/trainees.js and save over it.
6. Choose a feature to train, 'price', 'title' or 'image', and edit `trainees.set()` so that one of those features is the first argument.
7. Comment out the rules pertaining to all but that feature.
  - Currently, you can only train one ruleset at a time with Fathom, and only one `out` (e.g. 'title', 'image' or 'product') at a time for a given ruleset.
  - If you have multiple `out`s you'd like to train simultaneously, repeat this process for the remaining features so Fathom is running in a separate browser window for each feature and its corresponding rules.
8. Click the FathomFox browserAction and select "Train"
9. Select the feature from the dropdown list and click the "Train against the tabs in this window" button.
10. The array of coefficients displayed on the training page will update over time as Fathom optimizes them; this could take a while.
biancadanforth added a commit that referenced this issue Aug 24, 2018
These rules and coefficients yield the following accuracy based on a training corpus of 50 product pages from our top 5 sites (Amazon, Ebay, Walmart, Best Buy and Home Depot):
* 100% for product 'image'
* 96% for product 'title'
* 94% for product 'price'
Product 'price' and 'title' features have proximity rules based on the highest scoring product 'image' element. For now, this is done by accessing the image fnode using an internal '_ruleset' object; @erikrose is working on better support for this use case in the very near future, so this implementation can be improved at that time.
biancadanforth added a commit that referenced this issue Aug 24, 2018
The first script, 'ruleset_factory.js', exports a class to create a ruleset based on a set of coefficients; instances of this class are used in production (via 'fathom_extraction.js') and for Fathom training (via 'trainees.js').
2. The second script, 'trainees.js', is used exclusively for training using the FathomFox web extension and does not ship with the commerce web extension.

Additional changes and notes:
* I chose not to make use of the 'autobind' decorator in 'ruleset_factory.js', since it is also used in the training add-on, where devDeps like 'babel-core' and 'babel-plugin-transform-decorators-legacy' do not exist.
* I also turned off an eslint rule that requires class methods to use 'this', since some methods in RulesetFactory don't require it, and it would be tedious and confusing to call some methods on the class instance and others on the class itself.
* The new training script ('trainees.js') has three elements in the map it exports, one for each product feature ('image', 'title', 'price'). This allows us to select which feature to train from a dropdown menu on FathomFox's trainer page.
* Currently, for training, four files must be copied over into the 'fathom-trainees' add-on src directory:
  * config.js
  * fathom_default_coefficients.json
  * ruleset_factory.js
  * trainees.js (overwritting the existing file)
* In a separate commit, I will put all the Fathom extraction files into an 'extraction' (or similar) subfolder.
biancadanforth added a commit that referenced this issue Aug 24, 2018
Also updated the Code Organization section of the README to include the new 'extraction' subfolder.
biancadanforth added a commit that referenced this issue Aug 27, 2018
Also updated the Code Organization section of the README to include the new 'extraction' subfolder.
biancadanforth added a commit that referenced this issue Aug 27, 2018
This const was not actually needed by more than one file, which simplifies how 'trainees.js' and its imported scripts are used for training.
biancadanforth added a commit that referenced this issue Aug 27, 2018
This const was not actually needed by more than one file, which simplifies how 'trainees.js' and its imported scripts are used for training.
biancadanforth added a commit that referenced this issue Aug 27, 2018
Fix #36: Add initial Fathom rules with 98.7% average training accuracy
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants