-
Notifications
You must be signed in to change notification settings - Fork 15
Investigate Fathom-based page extraction #36
Comments
@javaun What 5 shopping sites do we want to target for the first round? We can expand later if necessary. |
Got Fathom working in the WebExtension based on [Swathi Iyer](https://github.com/swathiiyer2/fathom-products) and [Victor Ng’s](https://github.com/mozilla/fathom-webextension) prior work. Right now it just logs the highest scoring nodes’ values for product title, image and price in the console from background.js. While [`jsdom`](https://www.npmjs.com/package/jsdom) is a `fathom-web` dependency, it is used only for running `fathom-web` in the Node context for testing. To avoid build errors associated with `jsdom` and its dependencies, I added a `’null-loader’` for that `require` call, which mocks the module as an empty object. This loader is also used in webpack.config.test.js, from PR#32. Issues: * I get a notification from Firefox at times saying the extension is slowing down the browser. * Background.js is not always logging the results to the console with `npm run watch` enabled. I’m not yet sure why.
Last night, I was able to get Fathom running in our web extension, in large part thanks to Victor Ng's and Swathi Iyer's prior work. This currently runs Fathom in the content script for the tab, and I am getting consistent notifications from Firefox on an Amazon product page that it is slowing down Firefox. @Osmose , would it be possible to run this script in a separate thread/process? In general, what might be some options to work through this? Currently I cannot get Fathom to run reliably without hanging. |
This patch will run Fathom against the page (not distinguishing a product from a non-product page) and log the extracted price value and page URL to the console via 'background.js'. Failing that, it will fall back to extraction via CSS selectors if any exist for the site in 'product_extraction_data.json', and failing that, it will try extraction via Open Graph meta tags. This is heavily based on [Swathi Iyer](https://github.com/swathiiyer2/fathom-products) and [Victor Ng’s](https://github.com/mozilla/fathom-webextension) prior work. Currently, there is only one ruleset with one naive rule for one product feature, price. This initial commit is intended cover Fathom integration into the web extension. A later commit will add rules and take training data into account. Note: The 'runRuleset' method in 'productInfo.js' returns 'NaN' if it doesn't find any elements for any of its rules. Performance observations: Originally, I had dumped Swathi's three rulesets (one each for product title, image and price) and tried to run them against any page, similar to Victor Ng's web extension. However, that was [freezing up the tab](#36 (comment)), and after profiling the content script Fathom was running in before and after replacing Swathi's rulesets with a single ruleset with only one rule for one attribute, I did not see any warnings from Firefox, nor detect any significant performance hits in the DevTools profiler due to Fathom. It would therefore appear the performance hit was related to the complex rulesets and not Fathom itself. Webpack observations: While [`jsdom`](https://www.npmjs.com/package/jsdom) is a `fathom-web` dependency, it is used only for running `fathom-web` in the Node context for testing. To avoid build errors associated with `jsdom` and its dependencies, I added a `’null-loader’` for that `require` call, which mocks the module as an empty object. This loader is also used in webpack.config.test.js, from PR #32.
Top domains for EN-US:
|
My bad: for 5 let's use Homedepot.com instead of Macy's. Home Depot is slightly bigger and less likely to be roadkill like Macy's. Clothes are getting hammered. Hardware/home much more defensible |
Dropping these here if you need more inspiration Amazon, Walmart, Best Buy, Apple, Target, Etsy, Ikea, Gap, Macys, Home Depot, Lowes, Staples, Office Depot, Zappos, B&H, Costco |
This patch will run Fathom against the page (not distinguishing a product from a non-product page) and log the extracted price value and page URL to the console via 'background.js'. Failing that, it will fall back to extraction via CSS selectors if any exist for the site in 'product_extraction_data.json', and failing that, it will try extraction via Open Graph meta tags. This is heavily based on [Swathi Iyer](https://github.com/swathiiyer2/fathom-products) and [Victor Ng’s](https://github.com/mozilla/fathom-webextension) prior work. Currently, there is only one ruleset with one naive rule for one product feature, price. This initial commit is intended cover Fathom integration into the web extension. A later commit will add rules and take training data into account. Note: The 'runRuleset' method in 'productInfo.js' returns 'NaN' if it doesn't find any elements for any of its rules. Performance observations: Originally, I had dumped Swathi's three rulesets (one each for product title, image and price) and tried to run them against any page, similar to Victor Ng's web extension. However, that was [freezing up the tab](#36 (comment)), and after profiling the content script Fathom was running in before and after replacing Swathi's rulesets with a single ruleset with only one rule for one attribute, I did not see any warnings from Firefox, nor detect any significant performance hits in the DevTools profiler due to Fathom. It would therefore appear the performance hit was related to the complex rulesets and not Fathom itself. Webpack observations: While [`jsdom`](https://www.npmjs.com/package/jsdom) is a `fathom-web` dependency, it is used only for running `fathom-web` in the Node context for testing. To avoid build errors associated with `jsdom` and its dependencies, I added a `’null-loader’` for that `require` call, which mocks the module as an empty object. This loader is also used in webpack.config.test.js, from PR #32.
This commit builds off of PR #38, so that PR should merge before this. Open questions: * How to test interdependent rules, such as 'isNearProductImage' for the product title and product price candidate elements? * The only feature that is pulled out accurately on my test page (an Amazon product page) is the image. What rules can I add/modify to get title and price correct? TODO: * Add rule to remove ancestor elements who have the same 'innerText' value. * Consider adding image rule to see if an image element is the largest image on the page (above the fold). * Add price rule to see if innerText starts with '$'.
#36: Integrate Fathom-based page extraction with a simple ruleset.
Possible rules to consider for extracting product features (title, price, image) from a product page Fathom scores elements in the DOM by how well they follow a given ruleset, with modifiers applied based on how good that rule seems to be at finding the correct element. Here is a list of some rule ideas collected from swathiiyer2, Osmose, erikrose and myself by feature, ranked in order of how well they seem to match a product page for our top 5 sites (listed here and here): All three features:
For each feature below, these are additional, feature-specific considerations on top of the cross-feature considerations above: Product title:
Product price:
Product image:
|
These rules successfully pull out product title, price and image from the following product pages (one each from the 5 top sites): * [Amazon](https://www.amazon.com/KitchenAid-KL26M1XER-Professional-6-Qt-Bowl-Lift/dp/B01LYV1U30?smid=ATVPDKIKX0DER&pf_rd_p=0c7b792f-241a-4510-94f4-dd184a76f201&pf_rd_r=AZD7BGV3JZGTB23F30X3) * [Ebay](https://www.ebay.com/p/Best-Choice-Products-650W-6-speed-5-5QT-Kitchen-Food-Stand-Mixer-with-Stainless-Steels-Bowl-Black/3018375728?iid=253733404998) * [Walmart](https://www.walmart.com/ip/KitchenAid-Classic-Series-4-5-Quart-Tilt-Head-Stand-Mixer-Onyx-Black-K45SSOB/29474640) * [Best Buy](https://www.bestbuy.com/site/jbl-everest-elite-750nc-wireless-over-ear-noise-cancelling-headphones-gunmetal/5840136.p?skuId=5840136) * [Home Depot](https://www.homedepot.com/p/Husky-SAE-Combination-Wrench-Set-10-Piece-HCW10PCSAE/202934501) TODO: * Create a training set with FathomFox and run these rules against them to measure their accuracy for 50 product pages (10 from each top site). * Modify trimTitle method, so it doesn't cut off the color from the title for the product on Ebay. * Generalize formatPrice method. @Osmose, would you have any suggestions?
This commit adds 50 product pages with the correct product "title", "image" and "price" elements tagged using the [FathomFox](https://addons.mozilla.org/en-US/firefox/addon/fathomfox/) add-on for use as a training set with Fathom (e.g. with an attribute data-fathom="price"). There are 10 product pages each from our top 5 sites: Amazon, Ebay, Walmart, Best Buy and Home Depot, with one page per rough product category: Home, Food, Electronics, Books, Hardware, Clothing, Outdoors, Toys, Beauty and Crafts. There are at least a few pages that include a range of prices, a sale price, or some other common variation. #### Page URLs: Amazon: * https://www.amazon.com/KitchenAid-KL26M1XER-Professional-6-Qt-Bowl-Lift/dp/B01LYV1U30?smid=ATVPDKIKX0DER&pf_rd_p=0c7b792f-241a-4510-94f4-dd184a76f201&pf_rd_r=AZD7BGV3JZGTB23F30X3 * https://www.amazon.com/dp/B077YG22Y9/ref=sspa_dk_detail_3?psc=1&pd_rd_i=B077YG22Y9&pd_rd_wg=mfQm0&pd_rd_r=K27A84Z9Y01R47H7BJWR&pd_rd_w=km01K * https://www.amazon.com/Lindt-Assorted-Chocolate-Gourmet-Truffles/dp/B004XDMS3C/ref=sr_1_6_s_it?s=grocery&ie=UTF8&qid=1533437720&sr=1-6&keywords=chocolates&th=1 * https://www.amazon.com/MAXPOWER-8-piece-Metric-Combination-Wrench/dp/B074R5T7TB/ref=sr_1_17?s=power-hand-tools&ie=UTF8&qid=1533145143&sr=1-17&keywords=wrench * https://www.amazon.com/Signature-Levi-Strauss-Gold-Label/dp/B073V3TQHL/ref=sr_1_6?ie=UTF8&qid=1533601875&sr=8-6&keywords=pants * https://www.amazon.com/gp/product/B00E257T6C/ref=s9_acsd_ps_bw_c_x_2_w?pf_rd_m=ATVPDKIKX0DER&pf_rd_s=merchandised-search-3&pf_rd_r=3ZSXZBJ3FGDJ421XVQMT&pf_rd_r=3ZSXZBJ3FGDJ421XVQMT&pf_rd_t=101&pf_rd_p=fe185ec9-c8f5-44c0-897e-4c0bde93268c&pf_rd_p=fe185ec9-c8f5-44c0-897e-4c0bde93268c&pf_rd_i=283155 * https://www.amazon.com/Rene-Furterer-Naturia-Balancing-Shampoo/dp/B008K3PEAK/ref=sr_1_1_a_it?ie=UTF8&qid=1533434644&sr=8-1 * https://www.amazon.com/gp/product/B001NCDE8O/ref=s9_acsd_al_bw_c_x_3_w?pf_rd_m=ATVPDKIKX0DER&pf_rd_s=merchandised-search-9&pf_rd_r=RDKRBA4WA8KS1S18N086&pf_rd_r=RDKRBA4WA8KS1S18N086&pf_rd_t=101&pf_rd_p=9ce7601e-e994-4475-8a00-833f2e408af7&pf_rd_p=9ce7601e-e994-4475-8a00-833f2e408af7&pf_rd_i=3400371 * https://www.amazon.com/gp/product/B077T6V74Z/ref=s9_acsd_topr_hd_bw_bEKkv_c_x_2_w?pf_rd_m=ATVPDKIKX0DER&pf_rd_s=merchandised-search-4&pf_rd_r=HRV980DM7J7PFN02MGN6&pf_rd_t=101&pf_rd_p=e88c96d5-222c-5d22-af37-9401d11728f4&pf_rd_i=3416381 * https://www.amazon.com/Leather-Handle-PERSONALIZATION-Vegetable-Cotton/dp/B0718XQXHL/ref=lp_15735446011_1_1?s=handmade&ie=UTF8&qid=1533434852&sr=1-1 Ebay: * https://www.ebay.com/p/Best-Choice-Products-650W-6-speed-5-5QT-Kitchen-Food-Stand-Mixer-with-Stainless-Steels-Bowl-Black/3018375728?iid=253733404998 * https://www.ebay.com/itm/Philadelphia-Candies-Milk-Chocolate-Covered-Assorted-Nuts-1-Pound-Gift-Box/263103368447?epid=1301555428&hash=item3d422eccff * https://www.ebay.com/itm/Dell-Inspiron-7570-15-6-Touch-Laptop-i7-8550U-1-8GHz-8GB-1TB-NVIDIA-940MX-W10/263827294291 * https://www.ebay.com/itm/Craftsman-32-piece-Inch-and-Metric-Combination-Wrench-Set-Free-Shipping-NEW/112762904805?epid=1951893956&hash=item1a413160e5%3Ag%3AML4AAOSwxOFaYPfu&_sacat=0&_nkw=wrenches&_from=R40&rt=nc&_trksid=p2380057.m570.l1313.TR12.TRC2.A0.H0.Xwrenches.TRS0 * https://www.ebay.com/itm/NEW-Calvin-Klein-Jeans-Mens-5-Pocket-Herringbone-Straight-Pant-Variety-NWT/172602115751?hash=item282fe346a7%3Am%3AmzGTxRA3ng5fU6LdZwdQssg%3Asc%3AUSPSPriority%2194041%21US%21-1&var=472089425020&_pgn=8&_sacat=0&_nkw=pants&_from=R40&rt=nc * https://www.ebay.com/itm/The-Eye-of-the-World-The-Wheel-of-Time-Book-1-by-Jordan-Robert/142480282854?epid=907028&hash=item212c7c94e6%3Ag%3A5AsAAOSwn3Vap4yC&_sacat=0&_nkw=book&_from=R40&rt=nc&_trksid=p2380057.m570.l1313.TR12.TRC2.A0.H0.Xbook.TRS0 * https://www.ebay.com/itm/Beard-Hair-Growther-Facial-Mustache-Growth-Fast-Grow-Rich-Texture-Natural-Oil/222292486406?epid=691256086&hash=item33c1aa2906%3Ag%3AcasAAOSwA3dYDofw&_sacat=0&_nkw=beard+oil&_from=R40&rt=nc&_trksid=p2047675.m570.l1313.TR12.TRC2.A0.H0.Xbeard+oil.TRS0 * https://www.ebay.com/itm/Portable-SAND-TENT-CAMPING-FISHING-BEACH-SHELTER-SUN-SHADE-OUTDOOR-CANOPY-NEW/252060937931?hash=item3ab000aecb%3Am%3AmINYPcBvUlI0fUVlLnP1HNQ&var=552341391997&_sacat=0&_nkw=camping&_from=R40&rt=nc&LH_TitleDesc=0 * https://www.ebay.com/itm/Magic-Track-Car-Toys-With-Flashing-Lights-Educational-Toys-For-Children-Boys-US/162809813967?hash=item25e8389bcf%3Am%3Am_1VuBe8hB83kv8pXINSySg&var=461850886879&_sacat=0&_nkw=toy&_from=R40&rt=nc&_trksid=p2047675.m570.l1313.TR12.TRC2.A0.H0.Xtoy.TRS0 * https://www.ebay.com/itm/99-9-Pure-Copper-Cu-Metal-Sheet-Foil-0-1x100x100MM-For-Handicraft-Aerospace/302042932136?epid=506287267&hash=item46532963a8%3Ag%3A6lQAAOSwxp9W7QXM&_pgn=2&_sacat=0&_nkw=handicraft&_from=R40&rt=nc Walmart: * https://www.walmart.com/ip/KitchenAid-Classic-Series-4-5-Quart-Tilt-Head-Stand-Mixer-Onyx-Black-K45SSOB/29474640 * https://www.walmart.com/ip/HP-15-bs212wm-15-6-Laptop-Windows-10-Intel-Celeron-N4000-Processor-4GB-Memory-500GB-Hard-Drive-DVD-Jet-Black/139270579 * https://www.walmart.com/ip/Reese-s-Kit-Kat-Chocolate-Candy-Miniatures-Assortment-40-Oz/10449915 * https://www.walmart.com/ip/Hyper-Tough-18-Piece-Combination-Wrench-Set/49701823 * https://www.walmart.com/ip/Athletic-Works-Women-s-Essential-Athleisure-Knit-Pant-Available-in-Regular-and-Petite/597959224 * https://www.walmart.com/ip/507-Mechanical-Movements-Mechanisms-and-Devices/53282684 * https://www.walmart.com/ip/Maracuja-Shea-Oils-Beard-Conditioning-Oil/177023383 * https://www.walmart.com/ip/GigaTent-Liberty-Trail-2-Dome-Tent-7-x-7-Sleeps-3/24547530 * https://www.walmart.com/ip/LEGO-Jurassic-World-Jurassic-Park-Velociraptor-Chase-75932/873652909 * https://www.walmart.com/ip/Sargent-Art-Tempera-Artist-Starter-Set/48002550 Best Buy: * https://www.bestbuy.com/site/kitchenaid-artisan-tilt-head-stand-mixer-ocean-drive/6133651.p?skuId=6133651 * https://www.bestbuy.com/site/swiss-miss-milk-chocolate-hot-cocoa-k-cup-pods-16-pack/4700838.p?skuId=4700838 * https://www.bestbuy.com/site/jbl-everest-elite-750nc-wireless-over-ear-noise-cancelling-headphones-gunmetal/5840136.p?skuId=5840136 * https://www.bestbuy.com/site/hangman-handy-hammer-kit-silver/5609603.p?skuId=5609603 * https://www.bestbuy.com/site/bioworld-call-of-duty-retribution-t-shirt-extra-large-blue/5580896.p?skuId=5580896 * https://www.bestbuy.com/site/amazon-kindle-oasis-e-reader-7-high-resolution-display-300-ppi-waterproof-built-in-audible-32gb-wi-fi-black-and-silver/6102700.p?skuId=6102700 * https://www.bestbuy.com/site/philips-norelco-7200-beard-trimmer-silver/4820905.p?skuId=4820905 * https://www.bestbuy.com/site/char-broil-tabletop-grill-black/7141853.p?skuId=7141853 * https://www.bestbuy.com/site/mattel-pokemon-gyarados-building-set-multi/5888525.p?skuId=5888525 * https://www.bestbuy.com/site/osmo-creative-kit-2017-multi/5452007.p?skuId=5452007 Home Depot: * https://www.homedepot.com/p/KitchenAid-Classic-4-5-Qt-Tilt-Head-White-Stand-Mixer-K45SSWH/202546032 * https://www.homedepot.com/p/Flossugar-1-2-Gal-Chocolate-93216CT/302429246 * https://www.homedepot.com/p/Commercial-Electric-Around-the-Neck-Premium-Bluetooth-Stereo-Earbuds-HD0855/300372796 * https://www.homedepot.com/p/Husky-SAE-Combination-Wrench-Set-10-Piece-HCW10PCSAE/202934501 * https://www.homedepot.com/p/Dickies-Men-s-Large-White-Pocket-T-Shirt-GS407WH-LG/203507188 * https://www.homedepot.com/p/The-Home-Depot-Tiling-2nd-Edition-0696228580/100491813 * https://www.homedepot.com/p/Dermatone-4-oz-50-SPF-Sunscreen-Tube-503172734/304343553 * https://www.homedepot.com/p/Wakeman-4-Person-Dome-Tent-M470026/302029202 * https://www.homedepot.com/p/Step2-Deluxe-Canyon-Road-Train-and-Track-Table-754700/100483237 * https://www.homedepot.com/p/ABOLOS-Handicraft-II-Brown-Orange-12-in-x-12-in-Glass-Mosaic-Tile-HMDHDCSQ-SF/303320516
Product title rules previously pulled the unique 'title' element from the 'head' element on the page (part of the pages metadata). While this ostensibly requires less processing (we don't have to search the DOM or score any other elements), the title string often requires site-specific cleaning such as to remove the vendor name, and the final, cleaned up string cannot not be verified as accurate by Fathom, which only tells us if our rules picked the right element. The alternative approach, implemented here, is to pull the title from the corresponding element in the content of the page. Since Fathom can verify that the right element was selected, and the string from this element would not require any cleaning, this approach is a much better proxy for extracting the correct product title.
This enables the same script to be used for training and running in the commerce webextension. How to train a ruleset with Fathom: 1. Follow Fathom's [instructions](https://github.com/erikrose/fathom-fox). 2. Open the [Fathom Trainees](https://github.com/mozilla/fathom-trainees) add-on in a new profile. 3. Install FathomFox in that window from AMO. 4. Drag and drop the training corpus (HTML files in ./training-set) into that window. 5. Copy ./src/fathom_ruleset.js into fathom-trainees/src/trainees.js and save over it. 6. Choose a feature to train, 'price', 'title' or 'image', and edit `trainees.set()` so that one of those features is the first argument. 7. Comment out the rules pertaining to all but that feature. 8. Click the FathomFox browserAction and select "Train" 9. Select the feature from the dropdown list and click the "Train against the tabs in this window" button. 10. You will see the accuracy based on the initial coefficients passed in, and Fathom will start generating optimized coefficients. This could take a while. 11. When Fathom is done, those coefficients will be logged to the Fathom page.
Product title rules previously pulled the unique 'title' element from the 'head' element on the page (part of the pages metadata). While this ostensibly requires less processing (we don't have to search the DOM or score any other elements), the title string often requires site-specific cleaning such as to remove the vendor name, and the final, cleaned up string cannot not be verified as accurate by Fathom, which only tells us if our rules picked the right element. The alternative approach, implemented here, is to pull the title from the corresponding element in the content of the page. Since Fathom can verify that the right element was selected, and the string from this element would not require any cleaning, this approach is a much better proxy for extracting the correct product title.
This enables the same script to be used for training and running in the commerce webextension. How to train a ruleset with Fathom: 1. Follow Fathom's [instructions](https://github.com/erikrose/fathom-fox). 2. Open the [Fathom Trainees](https://github.com/mozilla/fathom-trainees) add-on in a new profile. 3. Install FathomFox in that window from AMO. 4. Drag and drop the training corpus (HTML files in ./training-set) into that window. 5. Copy ./src/fathom_ruleset.js into fathom-trainees/src/trainees.js and save over it. 6. Choose a feature to train, 'price', 'title' or 'image', and edit `trainees.set()` so that one of those features is the first argument. 7. Comment out the rules pertaining to all but that feature. 8. Click the FathomFox browserAction and select "Train" 9. Select the feature from the dropdown list and click the "Train against the tabs in this window" button. 10. You will see the accuracy based on the initial coefficients passed in, and Fathom will start generating optimized coefficients. This could take a while. 11. When Fathom is done, those coefficients will be logged to the Fathom page.
This enables the same script to be used for training and running in the commerce webextension. How to train a ruleset with Fathom: 1. Follow Fathom's [Trainer instructions](https://github.com/erikrose/fathom-fox#the-trainer). 2. Open the [Fathom Trainees](https://github.com/mozilla/fathom-trainees) add-on in a new profile. 3. Install FathomFox in that window from AMO. 4. Drag and drop the training corpus into that window. - Note: The training corpus are HTML files frozen using [FathomFox's DevTools panel](https://github.com/erikrose/fathom-fox#the-developer-tools-panel); our training corpus is on the shared "commerce" Google drive. - Note: As of the date of this commit, the Corpus Collector is not a recommended option for building a training corpus due to a `freeze-dry` dependency bug that inserts a bunch of extra garbage when re-freezing a frozen page. 5. Copy ./src/fathom_ruleset.js into fathom-trainees/src/trainees.js and save over it. 6. Choose a feature to train, 'price', 'title' or 'image', and edit `trainees.set()` so that one of those features is the first argument. 7. Comment out the rules pertaining to all but that feature. - Currently, you can only train one ruleset at a time with Fathom, and only one `out` (e.g. 'title', 'image' or 'product') at a time for a given ruleset. - If you have multiple `out`s you'd like to train simultaneously, repeat this process for the remaining features so Fathom is running in a separate browser window for each feature and its corresponding rules. 8. Click the FathomFox browserAction and select "Train" 9. Select the feature from the dropdown list and click the "Train against the tabs in this window" button. 10. The array of coefficients displayed on the training page will update over time as Fathom optimizes them; this could take a while.
Updating this issue to reflect that this is an investigation and we haven't yet committed to using Fathom-based extraction. |
These rules and coefficients yield the following accuracy based on a training corpus of 50 product pages from our top 5 sites (Amazon, Ebay, Walmart, Best Buy and Home Depot): * 100% for product 'image' * 96% for product 'title' * 94% for product 'price' Product 'price' and 'title' features have proximity rules based on the highest scoring product 'image' element. For now, this is done by accessing the image fnode using an internal '_ruleset' object; @erikrose is working on better support for this use case in the very near future, so this implementation can be improved at that time.
Referencing PR #45 , here are the conclusions from my Fathom assessment: Conclusions:
Additional information:Osmose and I met with erikrose last Friday and outlined our needs and aspirations for Fathom. There are two action items for Erik there that he said he could complete in a matter of weeks. |
A follow-up commit will address [this comment](https://github.com/mozilla/webext-commerce/pull/45/files#r210363008) and [this comment](https://github.com/mozilla/webext-commerce/pull/45/files#r210361004).
The first script, 'ruleset_factory.js', exports a class to create a ruleset based on a set of coefficients; instances of this class are used in production (via 'fathom_extraction.js') and for Fathom training (via 'trainees.js'). 2. The second script, 'trainees.js', is used exclusively for training using the FathomFox web extension and does not ship with the commerce web extension. Additional changes and notes: * I chose not to make use of the 'autobind' decorator in 'ruleset_factory.js', since it is also used in the training add-on, where devDeps like 'babel-core' and 'babel-plugin-transform-decorators-legacy' do not exist. * I also turned off an eslint rule that requires class methods to use 'this', since some methods in RulesetFactory don't require it, and it would be tedious and confusing to call some methods on the class instance and others on the class itself. * The new training script ('trainees.js') has three elements in the map it exports, one for each product feature ('image', 'title', 'price'). This allows us to select which feature to train from a dropdown menu on FathomFox's trainer page. * Currently, for training, four files must be copied over into the 'fathom-trainees' add-on src directory: * config.js * fathom_default_coefficients.json * ruleset_factory.js * trainees.js (overwritting the existing file) * In a separate commit, I will put all the Fathom extraction files into an 'extraction' (or similar) subfolder.
These rules successfully pull out product title, price and image from the following product pages (one each from the 5 top sites): * [Amazon](https://www.amazon.com/KitchenAid-KL26M1XER-Professional-6-Qt-Bowl-Lift/dp/B01LYV1U30?smid=ATVPDKIKX0DER&pf_rd_p=0c7b792f-241a-4510-94f4-dd184a76f201&pf_rd_r=AZD7BGV3JZGTB23F30X3) * [Ebay](https://www.ebay.com/p/Best-Choice-Products-650W-6-speed-5-5QT-Kitchen-Food-Stand-Mixer-with-Stainless-Steels-Bowl-Black/3018375728?iid=253733404998) * [Walmart](https://www.walmart.com/ip/KitchenAid-Classic-Series-4-5-Quart-Tilt-Head-Stand-Mixer-Onyx-Black-K45SSOB/29474640) * [Best Buy](https://www.bestbuy.com/site/jbl-everest-elite-750nc-wireless-over-ear-noise-cancelling-headphones-gunmetal/5840136.p?skuId=5840136) * [Home Depot](https://www.homedepot.com/p/Husky-SAE-Combination-Wrench-Set-10-Piece-HCW10PCSAE/202934501) TODO: * Create a training set with FathomFox and run these rules against them to measure their accuracy for 50 product pages (10 from each top site). * Modify trimTitle method, so it doesn't cut off the color from the title for the product on Ebay. * Generalize formatPrice method. @Osmose, would you have any suggestions?
Product title rules previously pulled the unique 'title' element from the 'head' element on the page (part of the pages metadata). While this ostensibly requires less processing (we don't have to search the DOM or score any other elements), the title string often requires site-specific cleaning such as to remove the vendor name, and the final, cleaned up string cannot not be verified as accurate by Fathom, which only tells us if our rules picked the right element. The alternative approach, implemented here, is to pull the title from the corresponding element in the content of the page. Since Fathom can verify that the right element was selected, and the string from this element would not require any cleaning, this approach is a much better proxy for extracting the correct product title.
This enables the same script to be used for training and running in the commerce webextension. How to train a ruleset with Fathom: 1. Follow Fathom's [Trainer instructions](https://github.com/erikrose/fathom-fox#the-trainer). 2. Open the [Fathom Trainees](https://github.com/mozilla/fathom-trainees) add-on in a new profile. 3. Install FathomFox in that window from AMO. 4. Drag and drop the training corpus into that window. - Note: The training corpus are HTML files frozen using [FathomFox's DevTools panel](https://github.com/erikrose/fathom-fox#the-developer-tools-panel); our training corpus is on the shared "commerce" Google drive. - Note: As of the date of this commit, the Corpus Collector is not a recommended option for building a training corpus due to a `freeze-dry` dependency bug that inserts a bunch of extra garbage when re-freezing a frozen page. 5. Copy ./src/fathom_ruleset.js into fathom-trainees/src/trainees.js and save over it. 6. Choose a feature to train, 'price', 'title' or 'image', and edit `trainees.set()` so that one of those features is the first argument. 7. Comment out the rules pertaining to all but that feature. - Currently, you can only train one ruleset at a time with Fathom, and only one `out` (e.g. 'title', 'image' or 'product') at a time for a given ruleset. - If you have multiple `out`s you'd like to train simultaneously, repeat this process for the remaining features so Fathom is running in a separate browser window for each feature and its corresponding rules. 8. Click the FathomFox browserAction and select "Train" 9. Select the feature from the dropdown list and click the "Train against the tabs in this window" button. 10. The array of coefficients displayed on the training page will update over time as Fathom optimizes them; this could take a while.
These rules and coefficients yield the following accuracy based on a training corpus of 50 product pages from our top 5 sites (Amazon, Ebay, Walmart, Best Buy and Home Depot): * 100% for product 'image' * 96% for product 'title' * 94% for product 'price' Product 'price' and 'title' features have proximity rules based on the highest scoring product 'image' element. For now, this is done by accessing the image fnode using an internal '_ruleset' object; @erikrose is working on better support for this use case in the very near future, so this implementation can be improved at that time.
A follow-up commit will address [this comment](https://github.com/mozilla/webext-commerce/pull/45/files#r210363008) and [this comment](https://github.com/mozilla/webext-commerce/pull/45/files#r210361004).
The first script, 'ruleset_factory.js', exports a class to create a ruleset based on a set of coefficients; instances of this class are used in production (via 'fathom_extraction.js') and for Fathom training (via 'trainees.js'). 2. The second script, 'trainees.js', is used exclusively for training using the FathomFox web extension and does not ship with the commerce web extension. Additional changes and notes: * I chose not to make use of the 'autobind' decorator in 'ruleset_factory.js', since it is also used in the training add-on, where devDeps like 'babel-core' and 'babel-plugin-transform-decorators-legacy' do not exist. * I also turned off an eslint rule that requires class methods to use 'this', since some methods in RulesetFactory don't require it, and it would be tedious and confusing to call some methods on the class instance and others on the class itself. * The new training script ('trainees.js') has three elements in the map it exports, one for each product feature ('image', 'title', 'price'). This allows us to select which feature to train from a dropdown menu on FathomFox's trainer page. * Currently, for training, four files must be copied over into the 'fathom-trainees' add-on src directory: * config.js * fathom_default_coefficients.json * ruleset_factory.js * trainees.js (overwritting the existing file) * In a separate commit, I will put all the Fathom extraction files into an 'extraction' (or similar) subfolder.
Also updated the Code Organization section of the README to include the new 'extraction' subfolder.
Also updated the Code Organization section of the README to include the new 'extraction' subfolder.
This const was not actually needed by more than one file, which simplifies how 'trainees.js' and its imported scripts are used for training.
This const was not actually needed by more than one file, which simplifies how 'trainees.js' and its imported scripts are used for training.
Fix #36: Add initial Fathom rules with 98.7% average training accuracy
As per the latest mockups, we probably want to extract the following info about a product:
Amazon
)In lieu of a better way of verifying whether the Fathom ruleset, let's use this:
The text was updated successfully, but these errors were encountered: