92% price accuracy #275

erikrose · 2018-11-16T21:19:48Z

As per this morning's email, this bumps the price accuracy up to 92% and the title to 100%, for a total of 8 percentage points improvement.

New testing results (samples 26-75):
Entity        Accuracy       95% Confidence Interval
----------------------------------------------------
title         100%           100.0 - 100%
image         100%           100.0 - 100%
price          92%            84.5 - 99.5%

It's also tested on a corpus twice as big:

Old testing results:
Entity        Accuracy       95% Confidence Interval
----------------------------------------------------
image          96%            88.3 - 100%
title         100%           100.0 - 100%
price          88%            75.3 - 100%

Though my goal was a price accuracy bump, my strategy was to attack image recognition first, since everything else hinges on that. And it worked out pretty well! I added rules to…

Consider images found as the background-image of divs
Punish images with extreme aspect ratios
Punish images with "background" in nearby IDs

I also re-expressed (and sometimes rewrote) the rules to compute [0, 1] fuzzy-logic confidences, from which we might be able to draw overall confidences that a given page is a product one, toward making the menu enable and disable more intelligently.

Please excuse the weird metadata on the commits. These are painstakingly extracted from the fathom-trainees fork where I developed them and retconned into this repo to preserve as much history as I could. One consequence of all this cherry-picking is that I change some stuff in the first few commits to glue it into fathom-trainees the way I wanted it, and then I change it back in the last "Put the glue code back" commit. So maybe look at the last commit before picking nits with the first few. :-)

Note that reality has suddenly shifted under my feet. I'd love for someone to softlink the trainees.js, coefficients.json, and ruleset_factory.js into this from a recent fork of fathom-trainees, as everything has started returning really low numbers on my machine in the past hour or so. I can't even repro my earlier training runs. I suspect (and hope) I broke something local to my machine. I had mistranscribed a number. See ba89ed4. That's why this is now 92% accurate instead of 90%.

Basically, copy ruleset_factory.js over from webext-commerce, copy the coeffs out of fathom_default_coefficients.json, and write a line or 2 of glue.

…ner menu. ...rather than having to edit the code between training or testing runs.

I'm hoping this sort of thing will help penalties like the cart one have a more consistent effect, since we'll no longer be blowing up bonuses without bound.

…Images still score 100%.

I'm not sure what scaling by the viewport size was getting us before. I could just be dense. Put the coefficients vector back; I had used a shrunken one for faster training when working on image rules.

Title and image are unaffected, since they don't really have many rules to balance atm.

…onfidences. Also the $ rule. Also break up both "price" rules into separate ones for parent and the actual element so the trainer can come up with optimal coeffs, rather than what I assume is a human guess of .75x difference. Start those new parent coeffs out at as close to .75x as I could.

Add a trapezoid function, mostly to clamp it to 0..1 before we raise it to the coeff's power. Re-spell largerImage() for consistency. ("Is" is what we've been using for fuzzy-truth values elsewhere, so we adopt it here.)

Express it in terms of fuzzy truth. Also greatly simplify. I'd like to test whether this does as well or better than the overlap-testing method. In doing so, change getHighestScoringImage() to return its fnode, not its element.

Also don't require the price to come at the end, in service of the "US $5.00 plus free shipping" use case which I misread "'US $5.00' on eBay" as. Let's see how that goes.

Price, title, image: all 100%!

…in anymore.

One of those showed up in 4.html. Rename imageIsBig() because we're calling it on things other than images now, and it was always general anyway.

…ground images. This gets us to 100% on the training corpus of 75.

… doesn't make a different answer come out.

… all named samples).

…the JSON file. Order the coeffs alphabetically in the ruleset factory so getCoeffsInOrder() works once more. Upgrade to Fathom 2.8 so FathomFox's good/back-clicking features work when you symlink from a fathom-trainees fork to here. (Yarn looks in pricewise, not in fathom-trainees, for a copy of fathom.)

Osmose · 2018-11-19T22:57:18Z

So I'm trying to understand how exponentiation works for coefficients:

Base: Zeroish (0.08)
Positive exponent: Reduces score (1.0 > score > 0)
Negative exponent: Increases score greatly (0.08 ^ -1 == 12.5)

Base: Oneish (0.9)
Positive exponent: Reduces score weakly
Negative exponent: Increases score weakly

So a rule with a positive exponent will always reduce the score, and rules with a negative exponent will always increase it, and rules are simply expected to return ONEISH for weak matches and ZEROISH for strong matches?

(Which is weird, because the code seems to work the other way around?)

If I'm getting this correctly, that means exponentiation does not allow for a rule that increases the score for matches, and decreases it for elements that don't match, right?

Osmose

I am a fan of your work. I am a fan of resolving dozens of issues at once, if we're lucky with this PR. I'm gonna test this against some of the pages we've had reports for during my next review pass to see if it fixes any of them.

Thanks!

Osmose · 2018-11-19T23:11:00Z

src/extraction/fathom/ruleset_factory.js

+ * make a falling trapezoid, where the result is ONEISH to the left and ZEROISH
+ * to the right, use a zeroAt greater than oneAt.
+ */
+function trapezoid(number, zeroAt, oneAt) {


trapezoid is an unclear name for what this is doing, or at least it requires some explanation; the first time I ran into this particular function I had to read it like 3 times to understand what it was doing. A signature like linearScale(number, {zeroAt: ZEROISH, oneAt: ONEISH}) would go a long way towards making this understandable.

Alternatively, we could go with the more semantic meaning of this, which is something like:

return STRONG_MATCH; // rename ONEISH to STRONG_MATCH return WEAK_MATCH; // rename ZEROISH to WEAK_MATCH return scaledMatch(number, {weak: 15, strong: 7});

I like your linearScale spelling. Changing. I'm not sure what you have in mind with defaulting the zeroAt param to ZEROISH and oneAt to ONEISH. ZEROISH and ONEISH are the output values of linearScale, not typically the input ones.

I didn't mean to default them, rather was just putting them in as rando arguments.

Osmose · 2018-11-19T23:37:18Z

src/extraction/fathom/ruleset_factory.js

-      return 0.75 * this.hasPriceInIDCoeff;
-    }
-    return DEFAULT_SCORE;
+  contains(haystack, needle, coeff) {


This is a tricky naming problem, huh. iincludes is a bad name, but I'm wary of shifting away from JavaScript's nomenclature too.

Maybe rename doesContain to icontains (or maybe just contains) and contains to containsScore? I mostly just want to make it clear that one is a standard contains and the other returns a score.

That's a good call. I named those super-fast and never made a second pass to clean them up.

What's the i for in iincludes and icontains?

case-insensitive. But I ended up spelling it out.

Osmose · 2018-11-19T23:37:42Z

src/extraction/fathom/ruleset_factory.js

+    return haystack.toLowerCase().includes(needle);
+  }
+
+  /** Return a weighted confidence of whether a substring is within a given


We do a newline after the double stars for multiline doc comments.

Osmose · 2018-11-19T23:39:02Z

src/extraction/fathom/ruleset_factory.js

+  }
+
+  hasPriceInParentID(fnode) {
+    return this.contains(fnode.element.parentElement.id, 'price', this.hasPriceInParentIDCoeff);


parentElement cold be null, right? Does the code fail in that case? I guess the old code had this problem too.

It actually comes out as "", so it's fine. :-D

Osmose · 2018-11-19T23:39:29Z

src/extraction/fathom/ruleset_factory.js

+    return this.contains(fnode.element.className, 'price', this.hasPriceInClassNameCoeff);
+  }
+
+  /** Scores fnode with 'price' in its class name */


In its parent's class name

Osmose · 2018-11-19T23:56:08Z

src/extraction/fathom/ruleset_factory.js

+      rule(type('imageish'), score(this.isBig.bind(this))),
+      // punishment for extreme aspect ratios, to filter out banners or nav elements
+      rule(type('imageish'), score(fnode => trapezoid(this.aspectRatio(fnode.element), 10, 5)
+                                            ** this.extremeAspectCoeff)),


Can we move this into a standalone function to be consistent? Same with the other inline functions.

Sure. I was experimenting to see if it was clearer inline, and I actually thought it was, but I'll revert for now. I think the New Math will be a good time to re-examine.

Osmose · 2018-11-20T00:11:00Z

src/extraction/fathom/ruleset_factory.js

-    }
-    return DEFAULT_SCORE;
+  isNearImage(fnode) {
+    const image = this.getHighestScoringImage(fnode);


Maybe we should be explicit about types?

const imageFnode = this.getHighestScoringImageFnode(fnode);

Osmose · 2018-11-20T00:15:44Z

src/extraction/fathom/ruleset_factory.js

-      return this.hasPriceishPatternCoeff;
-    }
-    return DEFAULT_SCORE;
+    const regExp = /\$?\d+\.\d{2}(?![0-9])/;


Should the negative lookahead be optional, for price strings at the end of the element text?

No, because the negative lookahead succeeds when the string ends. (After all, there is indeed no digit there.)

So this addition, (?![0-9]), is allowing any non-digits to follow the two digits after the decimal point, instead of requiring that the end of the string be those two digits after the decimal point?

// Trying it out in a REPL... const regExpBefore = /\${0,1}\d+\.\d{2}$/; const regExpAfter = /\$?\d+\.\d{2}(?![0-9])/; const sample = 'US $4.99e'; regExpBefore.test(sample); // false regExpAfter.test(sample); // true

Osmose · 2018-11-20T00:54:11Z

src/extraction/fathom/ruleset_factory.js

@@ -264,6 +235,20 @@ export default class RulesetFactory {
    return true;
  }

+  hasBackgroundImage(fnode) {
+    const bgImage = getComputedStyle(fnode.element)['background-image'];
+    return !!bgImage && bgImage !== 'none';


The !! should be unnecessary.

Good point!

Osmose · 2018-11-20T01:11:20Z

src/extraction/fathom/ruleset_factory.js

@@ -275,40 +260,54 @@ export default class RulesetFactory {
       */
      // consider all visible img elements
      rule(dom('img').when(this.isVisible.bind(this)), type('imageish')),
+      // and divs, which sometimes have CSS background-images
+      // TODO: Consider a bonus for <img> tags.
+      rule(dom('div').when(fnode => this.isVisible(fnode) && this.hasBackgroundImage(fnode)), type('imageish')),


Do we actually successfully pull the background image from these elements? I think the wrapper code around this factory assumes image matches are img tags and pull their src.

Good question! I had tunnel vision and didn't look beyond upping the recognizer's accuracy. Let me see if I can figure it out.

erikrose · 2018-11-20T15:25:12Z

First, let me say I am not in love with the current math and, as I said in IRC yesterday, I have some new, simpler, hopefully much more intuitive and optimization-friendly math coming up. Addition and thresholds-via-sigmoids; no exponentiation. That said, I will lay out how things stand right now. Of course, what's most important is that the math produces accurate recognition. :-)

So a rule with a positive exponent will always reduce the score, and rules with a negative exponent will always increase it,

Right. I was hoping to avoid negatives in practice, since they violate the 0..1 range post-weighting, but it didn't turn out that way.

and rules are simply expected to return ONEISH for weak matches and ZEROISH for strong matches?

The other way around. ONEISH indicates a high confidence that the rule's signal was detected, while ZEROISH indicates a high confidence that it wasn't.

If I'm getting this correctly, that means exponentiation does not allow for a rule that increases the score for matches, and decreases it for elements that don't match, right?

Correct. And this is actually going in the right direction, I believe. Originally, using Fathom's built-in multiplication-based score combination with no auxiliary math, you could scale either way, and I considered that a feature. However, if you think about it in terms of individual rules, rules are simple. They generally can't say anything about the overall confidence that an element is, say, the product's price. Their purview is one tiny dimension: whether a font size is big or a color is bright or that there are decimal digits present. If a color is bright, great. Bonus for that element. But if it isn't, there shouldn't necessarily be a penalty assessed. What do you do in that case? Can't return 0-ish; that'd be a penalty. Can't return 1; that's as high as you can go in Fuzzy Logic Land and should thus represent the highest bonus. And we don't want to go back outside 0..1, because then we're dealing with unbounded ranges and no longer have stable thresholds to pin our confidences on.

So the right answer, I believe, is to have dedicated rules that emit bonuses and other rules that emit penalties and tell the framework which is which. The New Math is going to be that way, and the exponentiation-based scheme we're temporarily using here models it decently as well. Think of the exponentiation-based fuzzy-logic scheme as a multiplication of doubts. If a rule returns 0.9, it doubts that this is the price a little; 0.1, a lot. (Notice that we're operating under the false idea that a rule has purview over the whole type's confidence, which I now believe is a stupid idea.) Those get weighted and then multiplied together, and the result is probably a very small number, but what really matters is the comparison of them, so it all shakes out in the end.

To get intuitively scaled confidences out of this, I had planned to take the nth root of the final result, n being the sum of all the coefficients. This would yield a weighted geometric mean, which is close enough to an arithmetic mean for the purpose of intuitive interpretation. (The whole current math is just a big weighted geometric mean.) However, I'm not sure that works with negative coefficients. At any rate, I want to pave over the whole mess with simpler, more intuitive New Math, which I look forward to running by both you and the optimizer.

Fortunately, the optimizer uses a general-purpose algorithm, so it doesn't really matter what we do, for the purposes of accuracy. The transition in this PR to 0..1 ranges will be required in the New Math anyway; they'll just have different coeffs computed to go along with them.

If that doesn't make it clearer, at least it should give you a good Wikipedia hole to go down. :-)

…terisks for multi-line ones.

…eturns scores.

erikrose · 2018-11-20T17:15:27Z

Aw, phooey. As I work on getting background-image extraction going, I see you guys are using scoring thresholds as well (SCORE_THRESHOLD). I'll have to derive a new one of those, since I (1) added new rules and (2) threw all the cards up in the air with the 0..1 ranging.

We don't handle non-url() specifications of images. Extraction will just fail nicely in the (hopefully uncommon) cases of image-sets and such.

erikrose · 2018-11-20T17:48:46Z

Re-ran testing run. Still 92%. Didn't screw anything up on that end. :-)

Osmose

r+wc. @biancadanforth is also doing a review pass, once she's done and you land the changes I think this is good to land.

Issues that seem to be resolved with this patch:

Suffice it to say, this is excellent work.

src/extraction/fathom/ruleset_factory.js

Osmose · 2018-11-20T18:41:27Z

src/extraction/fathom/index.js

+      }
+      // The other thing the ruleset can return is an arbitrary element with
+      // a CSS background image.
+      return urlFromCssDeclaration(getComputedStyle(element)['background-image']);


A bit of a nit, but I've always found that function declarations at the top break the flow of the function and make it harder to read. The function name helps the readability, but a variable can do just as well without interrupting the order:

const backgroundImage = getComputedStyle(element)['background-image']; return backgroundImage.substring(5, background.length - 2);

I wanted this to be ScreechinglyObviousCode. With magic numbers like 5 and -2, it's not otherwise screechingly obvious that what we're trying to do is pull the param out of url("…").

An extra variable name, then?

const backgroundImage = getComputedStyle(element)['background-image']; const backgroundImageUrl = backgroundImage.substring(5, background.length - 2); // "url('<image_url>')" return backgroundImageUrl;

(Or not. This is very much yak shaving at this point.)

src/extraction/fathom/index.js

biancadanforth

I know this has already merged, but I went through it anyway and can double confirm that this is an improvement from our baseline extraction based on the original 50-page training corpus, with additional confidence provided by the 25-page test set at https://github.com/mozilla/webext-commerce-corpus/. Thank you for these improvements!

In doing the training runs, I noticed you added several changes to Fathom that improved the process from my original notes with Fathom 2.3.0, so that was great to see as well.

I asked a couple clarifying questions here on the ruleset, and I will follow up with you separately on a couple points related to Fathom/FathomFox/Fathom Trainees, etc.

biancadanforth · 2018-11-19T19:39:39Z

src/extraction/fathom/ruleset_factory.js

+  /** Return whether the computed font size of an element is big. */
+  fontIsBig(fnode) {
+    const size = parseInt(getComputedStyle(fnode.element).fontSize, 10);
+    return trapezoid(size, 14, 50) ** this.bigFontCoeff;


Nit: It would be easier here if these threshold values (14, 50) were stored in const variables at the top for future tweaking. That way they could also be described in terms of what units they're in, and potentially used by other rules.

biancadanforth · 2018-11-19T19:52:42Z

src/extraction/fathom/ruleset_factory.js

+  }
+
+  /** Scores fnode with 'price' in its class name */
+  hasPriceInParentClassName(fnode) {


Good idea to break these each out into their own rules with their own coefficients for node versus parent node.

biancadanforth · 2018-11-19T20:01:09Z

src/extraction/fathom/ruleset_factory.js

-    if (top <= viewportHeight) {
-      return ONEISH * featureCoeff;
-    }
+    const viewportHeight = 950;


The viewport height and/or width shows up in a lot of different places (here, trainees.js, price_updates.js)... Maybe we could have a Fathom config in ./src/extraction/fathom with these values? Not sure what else would go in there though -- what do you think @Osmose ?

I don't think we need a separate config for them, but putting them in the main config seems fine to me.

biancadanforth · 2018-11-21T23:51:51Z

src/extraction/fathom/ruleset_factory.js

+
+    // Test nodeRect.top. They're probably not side by side with the title at
+    // the bottom. Rather, title will be below image.
+    const bottomDistance = Math.abs(imageRect.bottom - nodeRect.top);


I can't think of a product page I have seen where the title is below the image -- did you find some from developing your test corpus?

biancadanforth · 2018-11-22T00:04:45Z

src/extraction/fathom/ruleset_factory.js

-      return this.hasPriceishPatternCoeff;
-    }
-    return DEFAULT_SCORE;
+    const regExp = /\$?\d+\.\d{2}(?![0-9])/;


So this addition, (?![0-9]), is allowing any non-digits to follow the two digits after the decimal point, instead of requiring that the end of the string be those two digits after the decimal point?

// Trying it out in a REPL... const regExpBefore = /\${0,1}\d+\.\d{2}$/; const regExpAfter = /\$?\d+\.\d{2}(?![0-9])/; const sample = 'US $4.99e'; regExpBefore.test(sample); // false regExpAfter.test(sample); // true

erikrose added 29 commits November 16, 2018 14:56

Get trainer running on image out-rule from webext-commerce.

2558ea5

Basically, copy ruleset_factory.js over from webext-commerce, copy the coeffs out of fathom_default_coefficients.json, and write a line or 2 of glue.

Add a trainee for each out() rule so we can choose them from the Trai…

ca8d4c8

…ner menu. ...rather than having to edit the code between training or testing runs.

Respell a regex for clarity.

ce27b0d

Bring up to date with 9813ba8 in https://github.com/mozilla/price-wise.

d8e6f43

Rewrite isAboveTheFold using trapezoid() and fuzzy-logic scores.

46b61d9

I'm hoping this sort of thing will help penalties like the cart one have a more consistent effect, since we'll no longer be blowing up bonuses without bound.

Refactor largerImage as well. This completes the image coefficients. …

2eb12b8

…Images still score 100%.

Rewrite image y-axis scorer to constrain to 0..1 and for simplicity.

51e6b30

I'm not sure what scaling by the viewport size was getting us before. I could just be dense. Put the coefficients vector back; I had used a shrunken one for faster training when working on image rules.

Retrain to fix the priceish coeff for isAboutTheFold().

6909520

Title and image are unaffected, since they don't really have many rules to balance atm.

Re-express font-size rule as a confidence.

525bd54

Add a trapezoid function, mostly to clamp it to 0..1 before we raise it to the coeff's power. Re-spell largerImage() for consistency. ("Is" is what we've been using for fuzzy-truth values elsewhere, so we adopt it here.)

Express hasPriceishPattern as a fuzzy truth.

90423da

Also don't require the price to come at the end, in service of the "US $5.00 plus free shipping" use case which I misread "'US $5.00' on eBay" as. Let's see how that goes.

Fix the bugs that immediately kept the trainer from training.

74658e9

Remove a now-unused constant and an out-of-date comment.

c52af18

Add new coeffs to get to 100% on the training set!

737ca67

Price, title, image: all 100%!

Rename hasPriceIn, since it doesn't actually have "price" hard-coded …

47bbc7a

…in anymore.

Consider divs with background images as well as img tags.

307fa3d

One of those showed up in 4.html. Rename imageIsBig() because we're calling it on things other than images now, and it was always general anyway.

Typos

e1f4479

There's no need to say "node". All scoring functions take nodes.

f0eba0d

Add a rule to punish extreme aspect ratios and another to punish back…

3fc5856

…ground images. This gets us to 100% on the training corpus of 75.

Hard-code a height for aboveTheFold so a user's different window size…

dd16bf0

… doesn't make a different answer come out.

Make the image trainee train only the image-affecting coeffs, for speed.

418a9cf

Move tuned image coeffs into master vector.

3486879

Make a more efficient training vector for price.

9217d83

Improve price coeffs: 100% on 12-16.

89da723

Improve price coeffs. 93.3% on 12-16, 1-10. 92% on 1-25.

0c888cb

Improve price coeffs: 98.7% on all current training samples (1-25 and…

1735fe3

… all named samples).

Copy tuned price coeffs to master vector.

82709cf

erikrose changed the title ~~90% price~~ 90% price accuracy Nov 16, 2018

erikrose requested review from Osmose and biancadanforth November 19, 2018 15:45

Osmose suggested changes Nov 20, 2018

View reviewed changes

erikrose added 7 commits November 20, 2018 10:55

Rename trapezoid() to linearScale().

6ce3ea9

Use single-line doclets where possible. Put a newline after double as…

5d55d10

…terisks for multi-line ones.

Rename contains() functions to indicate what returns bools and what r…

45db392

…eturns scores.

Un-inline the aspect ratio rule.

e9edf39

Un-inline hasBackgroundInID().

c0570d3

Stick types in local names.

a57656a

Remove unneeded !!.

4cca17a

Teach the application bits how to extract from CSS background-images.

ea275aa

We don't handle non-url() specifications of images. Extraction will just fail nicely in the (hopefully uncommon) cases of image-sets and such.

Damn you, linter.

57ca295

Osmose approved these changes Nov 20, 2018

View reviewed changes

erikrose and others added 2 commits November 20, 2018 15:22

Use slice() for brevity and great justice.

aa4bd88

Merge branch 'master' into 90%-price

3c6649a

Osmose merged commit ffa8242 into mozilla:master Nov 20, 2018

biancadanforth reviewed Nov 22, 2018

View reviewed changes

biancadanforth mentioned this pull request Apr 17, 2019

Receiving notifications of dramatically incorrect price change on Amazon #313

Open

92% price accuracy #275

92% price accuracy #275

Conversation

erikrose commented Nov 16, 2018 • edited Loading

Osmose commented Nov 19, 2018 • edited Loading

Osmose left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erikrose commented Nov 20, 2018

erikrose commented Nov 20, 2018 • edited Loading

erikrose commented Nov 20, 2018

Osmose left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

biancadanforth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erikrose commented Nov 16, 2018 •

edited

Loading

Osmose commented Nov 19, 2018 •

edited

Loading

erikrose commented Nov 20, 2018 •

edited

Loading