#36: Integrate Fathom-based page extraction with a simple ruleset. #38

biancadanforth · 2018-08-01T00:04:10Z

@Osmose , ready for your review.

Follow-up PR:
Add a ruleset (and coefficients) to extract product information as described in #36 with the target accuracy based on a training set.

Possible follow-up issue:
Currently, if Fathom finds no price elements and extractData is executed on a non-product page belonging to a domain we have CSS selectors for in product_extraction_data.json (e.g. amazon.com), it will try to find elements based on the selectors, find nothing, and will throw an error. I.e. in this situation, extractData will return nothing (not even null) to getProductInfo. Is that the behavior that we want? This smells like a potential follow-up issue.
One possible fix (and something we will likely end up doing anyway for performance considerations) is to create a ruleset for identifying a product page, so the extraction code only executes on those types of pages.

Osmose

Thanks for the PR!

Along with the feedback you'll want to rebase, since the sidebar removal landed.

Osmose · 2018-08-01T17:47:40Z

src/fathom_ruleset.js

+ * License, v. 2.0. If a copy of the MPL was not distributed with this
+ * file, You can obtain one at http://mozilla.org/MPL/2.0/. */
+
+/*


Double asterisk for the doc comment.

This doc comment is an excellent summary of the file!

Osmose · 2018-08-01T19:07:02Z

src/fathom_ruleset.js

+ */
+function formatPrice(priceString) {
+  const formattedPriceStr = priceString.replace('$', '').replace(/([\s]|[^0-9$.-])/g, '');
+  return parseFloat(formattedPriceStr.substr(formattedPriceStr.indexOf('$') + 1));


Floats are unsuitable for representing price because they cannot accurately represent decimal numbers (that article is a bit wordy, you don't necessarily need to read all of it).

We're not actually doing anything with the prices yet anyway; for now just returning the string matched by Fathom is plenty.

I just removed this function entirely for now.

Osmose · 2018-08-01T19:12:26Z

src/product_info.js

+    from: 'content',
+    subject: 'ready',
+    data: {
+      price: runRuleset('price', tuningRoutines.price.coeffs) || extractData().price,


It may behoove us to define an API for price extraction that's common between the two different methods, and use that here instead of hardcoding them. Something like:

const extractedProduct = fathomExtraction.extractProduct(window.document); // extractedProduct == {price: '$4.95'}

That would make coding a fallback a bit easier if we want it, and also let us configure different extraction method later on without modifying the rest of the code.

Osmose · 2018-08-01T19:18:41Z

src/product_info.js

+  let gotText;
+  if (feature === 'price') {
+    // strip whitespace, dollar sign, words, and trailing zeros when comparing price
+    gotText = tuningRoutine(...coeffs)(window.document).map(fnode => fnode.element)[0];


This line is really difficult to follow. You have to follow the execution through tuningRoutine returning a function that takes a document object and returns a list of fnodes (which are a fathom-specific thing), which we then extract dom nodes from, and then grab just the first one (can more than one even be returned?).

I'm not a fan of the design of this API. there's no reason for runRuleset to take the coefficients as an argument; in this WebExtension we're never going to call that function with different coefficients. Also the coefficients are being imported from the same file, why couldn't the tuning routine just use them directly?

I think this file should be loosely coupled with the details of extraction. This file is more concerned with triggering the extraction and then sending that data back to the background script.

I updated the exported method from fathom_ruleset.js so that it simply returns the highest scoring element for the given feature in the given doc.

and then grab just the first one (can more than one even be returned?

Yes. Fathom is returning all elements with the highest score. Because our ruleset (with it's one naive rule) is really bad right now, and on some pages, there is no div element with a class of "price", it is returning all div elements in that doc.

If the list has multiple elements, then I imagine we are not confident in Fathom's results, and should use our fallback method, where applicable? That's what I have now in the updated PR. What would you recommend @Osmose ?

There should be a way to get the scores of the dom elements returned. We could have a minimum threshold below which we consider the matches to not be worth returning; for our simple ruleset this can just be whatever the score of a matching element would be.

Depending on the rules, maybe we also want to rule out ancestors? I could see a parent having the same innerText as a child and getting a similar score depending on the ruleset.

I suspect if we have a threshold, return the maximum-ranking results, and still find more than one matching dom element, it's prudent to just choose the first matching one.

Okay thanks for these thoughts.

I added a threshold based on the score for the element, and return the first element in the list if more than one exceed it.

Re: this comment:

Depending on the rules, maybe we also want to rule out ancestors? I could see a parent having the same innerText as a child and getting a similar score depending on the ruleset.

I agree that's a good consideration to include, but I am thinking that's best for the follow-up PR where I add a more detailed ruleset, since there is currently no innerText rule in this PR's simple ruleset.

Osmose · 2018-08-01T19:26:59Z

src/fathom_ruleset.js

+
+import {dom, out, rule, ruleset, score, type} from 'fathom-web';
+
+const tuningRoutines = {


This assumes that we have a different ruleset per attribute; I thought we were going to use a single ruleset with multiple types?

Osmose · 2018-08-01T19:30:29Z

src/fathom_ruleset.js

+import {dom, out, rule, ruleset, score, type} from 'fathom-web';
+
+const tuningRoutines = {
+  price: {routine: tunedPriceFnodes, coeffs: [2]},


I'm weary about coefficients being stored as an ordered list instead of, say, a named mapping. The reason Swathi did this in her code is because by default the simulated annealer deals with lists of coefficients, but when we add in annealing scripts to the codebase we can use Maps to get around that while still storing coefficients as named values.

What do you think about loading the coefficients from a JSON file?

{ "hasDivWithPriceClass": 1 }

How would Maps be better than just importing the JSON object and getting the corresponding coefficient such as coeff.hasDivWithPriceClass?

You're right that we should just import the JSON class here. What I meant was that the annealing API deals with lists of coefficients. Because key ordering in JS objects isn't guaranteed, you have to do some work to make sure the coefficients you're optimizing and put in and taken out of the lists that the optimization works with in the same order.

Maps could theoretically help since their keys are ordered by insertion, but @erikrose mentioned that Maps might be a perf hit during the annealing process depending on how we use them.

In any case, this is a moot point until we get around to adding in an optimization script to the repo, so we can just ignore Maps for now.

FWIW, I don't think you'll need to add any sort of optimization script to your repo; that's what the FathomFox Trainer is for. (The randomTransition() function of the optimizer will probably have to become pluggable or at least selectable, however, depending on what sort of math you're doing with your coefficients. Right now, it can reach integral values, even ones <0, which suits 0..1 bounded scores coming out of each scoring callback.)

Osmose · 2018-08-01T19:33:43Z

src/fathom_ruleset.js

+ * Ruleset for product prices
+ */
+function tunedPriceFnodes(coeffHasDivWithPriceClass = 1) {
+  function hasDivWithPriceClass(fnode) {


Defining these inside another function make them impossible to test. When we get into using regexes and other more complex rules I suspect unit tests of rules may have some value. Can we define these at a global level somehow?

Osmose · 2018-08-01T19:34:56Z

src/fathom_ruleset.js

+  return tuningRoutine;
+}
+
+export {tuningRoutines, formatPrice};


We should be able to just put an export keyword next to each of these instead of having an export line at the bottom.

Also, actually, I wasn't aware this syntax worked; I would have expected it to complain that we had an unnamed, non-default export. Interesting.

Osmose · 2018-08-01T19:36:21Z

src/product_info.js

+    subject: 'ready',
+    data: {
+      price: runRuleset('price', tuningRoutines.price.coeffs) || extractData().price,
+      url: window.document.URL,


Huh, are URL objects serializable such that they can be sent in a postMessage? TIL

This patch will run Fathom against the page (not distinguishing a product from a non-product page) and log the extracted price value and page URL to the console via 'background.js'. Failing that, it will fall back to extraction via CSS selectors if any exist for the site in 'product_extraction_data.json', and failing that, it will try extraction via Open Graph meta tags. This is heavily based on [Swathi Iyer](https://github.com/swathiiyer2/fathom-products) and [Victor Ng’s](https://github.com/mozilla/fathom-webextension) prior work. Currently, there is only one ruleset with one naive rule for one product feature, price. This initial commit is intended cover Fathom integration into the web extension. A later commit will add rules and take training data into account. Note: The 'runRuleset' method in 'productInfo.js' returns 'NaN' if it doesn't find any elements for any of its rules. Performance observations: Originally, I had dumped Swathi's three rulesets (one each for product title, image and price) and tried to run them against any page, similar to Victor Ng's web extension. However, that was [freezing up the tab](#36 (comment)), and after profiling the content script Fathom was running in before and after replacing Swathi's rulesets with a single ruleset with only one rule for one attribute, I did not see any warnings from Firefox, nor detect any significant performance hits in the DevTools profiler due to Fathom. It would therefore appear the performance hit was related to the complex rulesets and not Fathom itself. Webpack observations: While [`jsdom`](https://www.npmjs.com/package/jsdom) is a `fathom-web` dependency, it is used only for running `fathom-web` in the Node context for testing. To avoid build errors associated with `jsdom` and its dependencies, I added a `’null-loader’` for that `require` call, which mocks the module as an empty object. This loader is also used in webpack.config.test.js, from PR #32.

biancadanforth

Some replies and questions for you for my second commit.

biancadanforth · 2018-08-01T23:02:17Z

src/fathom_ruleset.js

+import {dom, out, rule, ruleset, score, type} from 'fathom-web';
+
+const tuningRoutines = {
+  price: {routine: tunedPriceFnodes, coeffs: [2]},


How would Maps be better than just importing the JSON object and getting the corresponding coefficient such as coeff.hasDivWithPriceClass?

biancadanforth · 2018-08-01T23:45:57Z

src/fathom_ruleset.js

+ */
+function formatPrice(priceString) {
+  const formattedPriceStr = priceString.replace('$', '').replace(/([\s]|[^0-9$.-])/g, '');
+  return parseFloat(formattedPriceStr.substr(formattedPriceStr.indexOf('$') + 1));


I just removed this function entirely for now.

biancadanforth · 2018-08-02T00:25:15Z

src/product_info.js

+  let gotText;
+  if (feature === 'price') {
+    // strip whitespace, dollar sign, words, and trailing zeros when comparing price
+    gotText = tuningRoutine(...coeffs)(window.document).map(fnode => fnode.element)[0];


I updated the exported method from fathom_ruleset.js so that it simply returns the highest scoring element for the given feature in the given doc.

biancadanforth · 2018-08-02T00:30:50Z

src/product_info.js

+  let gotText;
+  if (feature === 'price') {
+    // strip whitespace, dollar sign, words, and trailing zeros when comparing price
+    gotText = tuningRoutine(...coeffs)(window.document).map(fnode => fnode.element)[0];


and then grab just the first one (can more than one even be returned?

Yes. Fathom is returning all elements with the highest score. Because our ruleset (with it's one naive rule) is really bad right now, and on some pages, there is no div element with a class of "price", it is returning all div elements in that doc.

If the list has multiple elements, then I imagine we are not confident in Fathom's results, and should use our fallback method, where applicable? That's what I have now in the updated PR. What would you recommend @Osmose ?

biancadanforth · 2018-08-02T00:52:50Z

@Osmose , Ready for another look. The biggest open question is what to do when Fathom returns a multi-item list of elements, all having the highest score for a given product feature; see my Comment Review for more details.

Osmose

Thanks for this changes, this is getting pretty close. Nice work!

Osmose · 2018-08-02T17:15:37Z

src/fathom_ruleset.js

+ * Extracts the highest scoring element for a given feature contained
+ * in a page's HTML document.
+ */
+export default function runTuningRoutine(doc) {


The ruleset isn't really a "tuning routine". Maybe rename this runRuleset?

Yeah I'm not really sure what "tuningRoutine" means in this context, but we certainly aren't tuning anything here. This was a relic of Swathi's/Victor's code.

Osmose · 2018-08-02T17:20:06Z

src/fathom_ruleset.js

+);
+
+/**
+ * Extracts the highest scoring element for a given feature contained


The "given feature" bit of this comment doesn't really apply anymore. There's no feature being passed in anymore, it just returns all the features we currently know how to extract.

Osmose · 2018-08-02T17:21:35Z

src/product_info.js

-  for (const [vendor, attributeInfo] of Object.entries(extractionData)) {
-    if (hostname.includes(vendor)) {
-      return attributeInfo;
+const fallbackExtraction = {


Can we define each of these extraction objects in separate files and import them? That'd help focus this module a bit more.

For the Fathom one, you can probably just define that in the existing fathom file and export it instead of the "return DOM nodes" function.

Osmose · 2018-08-02T17:22:53Z

src/product_info.js

-      const metaEle = document.querySelector(`meta[property='${value}']`);
-      if (metaEle) {
-        data[key] = metaEle.getAttribute('content');
+    data.url = window.document.URL;


This extraction doesn't change per-extraction-method, so we can just add it in in getProductInfo instead of duplicating it.

Since it's possible for multiple elements to be returned by Fathom for a given ruleset applied to a page, I applied a threshold (currently just set as the score for the only rule we have, 'hasDivWithPriceClass') to help provide an additional measure of confidence. If more than one element meet or exceed the threshold, the first element in the list is returned.

biancadanforth · 2018-08-02T18:23:58Z

Okay @Osmose , added the score threshold; ready for another pass from you.

biancadanforth · 2018-08-02T19:54:15Z

Sorry for the confusion @Osmose ; this should be ready to go now.

The last two commits should incorporate all of your feedback from comments last night and the second review this morning.

This commit builds off of PR #38, so that PR should merge before this. Open questions: * How to test interdependent rules, such as 'isNearProductImage' for the product title and product price candidate elements? * The only feature that is pulled out accurately on my test page (an Amazon product page) is the image. What rules can I add/modify to get title and price correct? TODO: * Add rule to remove ancestor elements who have the same 'innerText' value. * Consider adding image rule to see if an image element is the largest image on the page (above the fold). * Add price rule to see if innerText starts with '$'.

Osmose

LGTM, nice work!

Osmose · 2018-08-03T17:56:27Z

src/fathom_extraction.js

+const SCORE_THRESHOLD = fathomCoeffs.hasDivWithPriceClass;
+
+/**
+ * Checks to see if an element is a <div> with a class of "price".


This description doesn't need to be quite so long. It's pretty clear that rules return coefficients in the context of Fathom, and this doesn't only score <div>s, it scores any DOM element. A one-liner is fine: Scores fnodes with a "price" class.

Probably also want to update the name of this to hasPriceClass or similar.

biancadanforth requested a review from Osmose August 1, 2018 00:44

Osmose suggested changes Aug 1, 2018

View reviewed changes

biancadanforth added 2 commits August 1, 2018 15:14

Incorporate Osmose's feedback and rebase.

d364866

biancadanforth force-pushed the 36-fathom branch from 2d56603 to d364866 Compare August 2, 2018 00:49

biancadanforth commented Aug 2, 2018

View reviewed changes

Osmose suggested changes Aug 2, 2018

View reviewed changes

biancadanforth force-pushed the 36-fathom branch from 45455d4 to afcb0bc Compare August 2, 2018 19:48

Move Fathom and fallback extraction to separate modules.

a99d3ba

biancadanforth force-pushed the 36-fathom branch from afcb0bc to a99d3ba Compare August 2, 2018 19:50

Osmose approved these changes Aug 3, 2018

View reviewed changes

Rename hasDivWithPriceClass to hasPriceClass and update description.

505c16d

biancadanforth force-pushed the 36-fathom branch from 0a1b328 to 505c16d Compare August 3, 2018 21:56

biancadanforth merged commit 6391a7d into master Aug 3, 2018

biancadanforth deleted the 36-fathom branch August 3, 2018 21:58

biancadanforth mentioned this pull request Aug 12, 2018

Fix #36: Add initial Fathom rules with 98.7% average training accuracy #45

Merged


		import {dom, out, rule, ruleset, score, type} from 'fathom-web';

		const tuningRoutines = {

#36: Integrate Fathom-based page extraction with a simple ruleset. #38

#36: Integrate Fathom-based page extraction with a simple ruleset. #38

Conversation

biancadanforth commented Aug 1, 2018 • edited Loading

Osmose left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

biancadanforth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

biancadanforth commented Aug 2, 2018

Osmose left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

biancadanforth Aug 2, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

biancadanforth commented Aug 2, 2018

biancadanforth commented Aug 2, 2018

Osmose left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

biancadanforth commented Aug 1, 2018 •

edited

Loading

biancadanforth Aug 2, 2018 •

edited

Loading