-
Notifications
You must be signed in to change notification settings - Fork 15
Conversation
Basically, copy ruleset_factory.js over from webext-commerce, copy the coeffs out of fathom_default_coefficients.json, and write a line or 2 of glue.
…ner menu. ...rather than having to edit the code between training or testing runs.
I'm hoping this sort of thing will help penalties like the cart one have a more consistent effect, since we'll no longer be blowing up bonuses without bound.
…Images still score 100%.
I'm not sure what scaling by the viewport size was getting us before. I could just be dense. Put the coefficients vector back; I had used a shrunken one for faster training when working on image rules.
Title and image are unaffected, since they don't really have many rules to balance atm.
…onfidences. Also the $ rule. Also break up both "price" rules into separate ones for parent and the actual element so the trainer can come up with optimal coeffs, rather than what I assume is a human guess of .75x difference. Start those new parent coeffs out at as close to .75x as I could.
Add a trapezoid function, mostly to clamp it to 0..1 before we raise it to the coeff's power. Re-spell largerImage() for consistency. ("Is" is what we've been using for fuzzy-truth values elsewhere, so we adopt it here.)
Express it in terms of fuzzy truth. Also greatly simplify. I'd like to test whether this does as well or better than the overlap-testing method. In doing so, change getHighestScoringImage() to return its fnode, not its element.
Also don't require the price to come at the end, in service of the "US $5.00 plus free shipping" use case which I misread "'US $5.00' on eBay" as. Let's see how that goes.
Price, title, image: all 100%!
One of those showed up in 4.html. Rename imageIsBig() because we're calling it on things other than images now, and it was always general anyway.
…ground images. This gets us to 100% on the training corpus of 75.
… doesn't make a different answer come out.
… all named samples).
…the JSON file. Order the coeffs alphabetically in the ruleset factory so getCoeffsInOrder() works once more. Upgrade to Fathom 2.8 so FathomFox's good/back-clicking features work when you symlink from a fathom-trainees fork to here. (Yarn looks in pricewise, not in fathom-trainees, for a copy of fathom.)
So I'm trying to understand how exponentiation works for coefficients: Base: Zeroish (0.08) Base: Oneish (0.9) So a rule with a positive exponent will always reduce the score, and rules with a negative exponent will always increase it, and rules are simply expected to return (Which is weird, because the code seems to work the other way around?) If I'm getting this correctly, that means exponentiation does not allow for a rule that increases the score for matches, and decreases it for elements that don't match, right? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a fan of your work. I am a fan of resolving dozens of issues at once, if we're lucky with this PR. I'm gonna test this against some of the pages we've had reports for during my next review pass to see if it fixes any of them.
Thanks!
* make a falling trapezoid, where the result is ONEISH to the left and ZEROISH | ||
* to the right, use a zeroAt greater than oneAt. | ||
*/ | ||
function trapezoid(number, zeroAt, oneAt) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trapezoid
is an unclear name for what this is doing, or at least it requires some explanation; the first time I ran into this particular function I had to read it like 3 times to understand what it was doing. A signature like linearScale(number, {zeroAt: ZEROISH, oneAt: ONEISH})
would go a long way towards making this understandable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we could go with the more semantic meaning of this, which is something like:
return STRONG_MATCH; // rename ONEISH to STRONG_MATCH
return WEAK_MATCH; // rename ZEROISH to WEAK_MATCH
return scaledMatch(number, {weak: 15, strong: 7});
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like your linearScale spelling. Changing. I'm not sure what you have in mind with defaulting the zeroAt param to ZEROISH and oneAt to ONEISH. ZEROISH and ONEISH are the output values of linearScale, not typically the input ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't mean to default them, rather was just putting them in as rando arguments.
return 0.75 * this.hasPriceInIDCoeff; | ||
} | ||
return DEFAULT_SCORE; | ||
contains(haystack, needle, coeff) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a tricky naming problem, huh. iincludes
is a bad name, but I'm wary of shifting away from JavaScript's nomenclature too.
Maybe rename doesContain
to icontains
(or maybe just contains
) and contains
to containsScore
? I mostly just want to make it clear that one is a standard contains and the other returns a score.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good call. I named those super-fast and never made a second pass to clean them up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the i
for in iincludes
and icontains
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
case-insensitive. But I ended up spelling it out.
return haystack.toLowerCase().includes(needle); | ||
} | ||
|
||
/** Return a weighted confidence of whether a substring is within a given |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do a newline after the double stars for multiline doc comments.
} | ||
|
||
hasPriceInParentID(fnode) { | ||
return this.contains(fnode.element.parentElement.id, 'price', this.hasPriceInParentIDCoeff); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parentElement
cold be null
, right? Does the code fail in that case? I guess the old code had this problem too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It actually comes out as "", so it's fine. :-D
return this.contains(fnode.element.className, 'price', this.hasPriceInClassNameCoeff); | ||
} | ||
|
||
/** Scores fnode with 'price' in its class name */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In its parent's class name
rule(type('imageish'), score(this.isBig.bind(this))), | ||
// punishment for extreme aspect ratios, to filter out banners or nav elements | ||
rule(type('imageish'), score(fnode => trapezoid(this.aspectRatio(fnode.element), 10, 5) | ||
** this.extremeAspectCoeff)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this into a standalone function to be consistent? Same with the other inline functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I was experimenting to see if it was clearer inline, and I actually thought it was, but I'll revert for now. I think the New Math will be a good time to re-examine.
} | ||
return DEFAULT_SCORE; | ||
isNearImage(fnode) { | ||
const image = this.getHighestScoringImage(fnode); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should be explicit about types?
const imageFnode = this.getHighestScoringImageFnode(fnode);
return this.hasPriceishPatternCoeff; | ||
} | ||
return DEFAULT_SCORE; | ||
const regExp = /\$?\d+\.\d{2}(?![0-9])/; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the negative lookahead be optional, for price strings at the end of the element text?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, because the negative lookahead succeeds when the string ends. (After all, there is indeed no digit there.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this addition, (?![0-9])
, is allowing any non-digits to follow the two digits after the decimal point, instead of requiring that the end of the string be those two digits after the decimal point?
// Trying it out in a REPL...
const regExpBefore = /\${0,1}\d+\.\d{2}$/;
const regExpAfter = /\$?\d+\.\d{2}(?![0-9])/;
const sample = 'US $4.99e';
regExpBefore.test(sample); // false
regExpAfter.test(sample); // true
@@ -264,6 +235,20 @@ export default class RulesetFactory { | |||
return true; | |||
} | |||
|
|||
hasBackgroundImage(fnode) { | |||
const bgImage = getComputedStyle(fnode.element)['background-image']; | |||
return !!bgImage && bgImage !== 'none'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The !!
should be unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point!
@@ -275,40 +260,54 @@ export default class RulesetFactory { | |||
*/ | |||
// consider all visible img elements | |||
rule(dom('img').when(this.isVisible.bind(this)), type('imageish')), | |||
// and divs, which sometimes have CSS background-images | |||
// TODO: Consider a bonus for <img> tags. | |||
rule(dom('div').when(fnode => this.isVisible(fnode) && this.hasBackgroundImage(fnode)), type('imageish')), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually successfully pull the background image from these elements? I think the wrapper code around this factory assumes image matches are img
tags and pull their src
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! I had tunnel vision and didn't look beyond upping the recognizer's accuracy. Let me see if I can figure it out.
First, let me say I am not in love with the current math and, as I said in IRC yesterday, I have some new, simpler, hopefully much more intuitive and optimization-friendly math coming up. Addition and thresholds-via-sigmoids; no exponentiation. That said, I will lay out how things stand right now. Of course, what's most important is that the math produces accurate recognition. :-)
Right. I was hoping to avoid negatives in practice, since they violate the 0..1 range post-weighting, but it didn't turn out that way.
The other way around. ONEISH indicates a high confidence that the rule's signal was detected, while ZEROISH indicates a high confidence that it wasn't.
Correct. And this is actually going in the right direction, I believe. Originally, using Fathom's built-in multiplication-based score combination with no auxiliary math, you could scale either way, and I considered that a feature. However, if you think about it in terms of individual rules, rules are simple. They generally can't say anything about the overall confidence that an element is, say, the product's price. Their purview is one tiny dimension: whether a font size is big or a color is bright or that there are decimal digits present. If a color is bright, great. Bonus for that element. But if it isn't, there shouldn't necessarily be a penalty assessed. What do you do in that case? Can't return 0-ish; that'd be a penalty. Can't return 1; that's as high as you can go in Fuzzy Logic Land and should thus represent the highest bonus. And we don't want to go back outside 0..1, because then we're dealing with unbounded ranges and no longer have stable thresholds to pin our confidences on. So the right answer, I believe, is to have dedicated rules that emit bonuses and other rules that emit penalties and tell the framework which is which. The New Math is going to be that way, and the exponentiation-based scheme we're temporarily using here models it decently as well. Think of the exponentiation-based fuzzy-logic scheme as a multiplication of doubts. If a rule returns 0.9, it doubts that this is the price a little; 0.1, a lot. (Notice that we're operating under the false idea that a rule has purview over the whole type's confidence, which I now believe is a stupid idea.) Those get weighted and then multiplied together, and the result is probably a very small number, but what really matters is the comparison of them, so it all shakes out in the end. To get intuitively scaled confidences out of this, I had planned to take the nth root of the final result, n being the sum of all the coefficients. This would yield a weighted geometric mean, which is close enough to an arithmetic mean for the purpose of intuitive interpretation. (The whole current math is just a big weighted geometric mean.) However, I'm not sure that works with negative coefficients. At any rate, I want to pave over the whole mess with simpler, more intuitive New Math, which I look forward to running by both you and the optimizer. Fortunately, the optimizer uses a general-purpose algorithm, so it doesn't really matter what we do, for the purposes of accuracy. The transition in this PR to 0..1 ranges will be required in the New Math anyway; they'll just have different coeffs computed to go along with them. If that doesn't make it clearer, at least it should give you a good Wikipedia hole to go down. :-) |
…terisks for multi-line ones.
Aw, phooey. As I work on getting background-image extraction going, I see you guys are using scoring thresholds as well ( |
We don't handle non-url() specifications of images. Extraction will just fail nicely in the (hopefully uncommon) cases of image-sets and such.
Re-ran testing run. Still 92%. Didn't screw anything up on that end. :-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r+wc. @biancadanforth is also doing a review pass, once she's done and you land the changes I think this is good to land.
Issues that seem to be resolved with this patch:
- Walmart: Toilet Paper not listed as product #81
- Best Buy: TV product page not recognized #82
- Home Depot: Draft stopper product not recognized #132
- The "Add This Product" button is wrongly actionable on Walmart shopping cart page when the cart has saved items #181
- Wrong item description is displayed when adding an alert #236 (search page no longer actionable)
- Wrong item description is displayed when adding an alert #236
- Price of wrong version displayed #263
- Not able to add product to price wise, add button is grey #265
- Amazon Entry shows price drop, but Amazon site does not #273
Suffice it to say, this is excellent work.
} | ||
// The other thing the ruleset can return is an arbitrary element with | ||
// a CSS background image. | ||
return urlFromCssDeclaration(getComputedStyle(element)['background-image']); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit of a nit, but I've always found that function declarations at the top break the flow of the function and make it harder to read. The function name helps the readability, but a variable can do just as well without interrupting the order:
const backgroundImage = getComputedStyle(element)['background-image'];
return backgroundImage.substring(5, background.length - 2);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted this to be ScreechinglyObviousCode. With magic numbers like 5 and -2, it's not otherwise screechingly obvious that what we're trying to do is pull the param out of url("…")
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An extra variable name, then?
const backgroundImage = getComputedStyle(element)['background-image'];
const backgroundImageUrl = backgroundImage.substring(5, background.length - 2); // "url('<image_url>')"
return backgroundImageUrl;
(Or not. This is very much yak shaving at this point.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this has already merged, but I went through it anyway and can double confirm that this is an improvement from our baseline extraction based on the original 50-page training corpus, with additional confidence provided by the 25-page test set at https://github.com/mozilla/webext-commerce-corpus/. Thank you for these improvements!
In doing the training runs, I noticed you added several changes to Fathom that improved the process from my original notes with Fathom 2.3.0, so that was great to see as well.
I asked a couple clarifying questions here on the ruleset, and I will follow up with you separately on a couple points related to Fathom/FathomFox/Fathom Trainees, etc.
/** Return whether the computed font size of an element is big. */ | ||
fontIsBig(fnode) { | ||
const size = parseInt(getComputedStyle(fnode.element).fontSize, 10); | ||
return trapezoid(size, 14, 50) ** this.bigFontCoeff; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: It would be easier here if these threshold values (14, 50) were stored in const
variables at the top for future tweaking. That way they could also be described in terms of what units they're in, and potentially used by other rules.
} | ||
|
||
/** Scores fnode with 'price' in its class name */ | ||
hasPriceInParentClassName(fnode) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea to break these each out into their own rules with their own coefficients for node versus parent node.
if (top <= viewportHeight) { | ||
return ONEISH * featureCoeff; | ||
} | ||
const viewportHeight = 950; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The viewport height and/or width shows up in a lot of different places (here, trainees.js
, price_updates.js
)... Maybe we could have a Fathom config
in ./src/extraction/fathom
with these values? Not sure what else would go in there though -- what do you think @Osmose ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need a separate config for them, but putting them in the main config seems fine to me.
|
||
// Test nodeRect.top. They're probably not side by side with the title at | ||
// the bottom. Rather, title will be below image. | ||
const bottomDistance = Math.abs(imageRect.bottom - nodeRect.top); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't think of a product page I have seen where the title is below the image -- did you find some from developing your test corpus?
return this.hasPriceishPatternCoeff; | ||
} | ||
return DEFAULT_SCORE; | ||
const regExp = /\$?\d+\.\d{2}(?![0-9])/; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this addition, (?![0-9])
, is allowing any non-digits to follow the two digits after the decimal point, instead of requiring that the end of the string be those two digits after the decimal point?
// Trying it out in a REPL...
const regExpBefore = /\${0,1}\d+\.\d{2}$/;
const regExpAfter = /\$?\d+\.\d{2}(?![0-9])/;
const sample = 'US $4.99e';
regExpBefore.test(sample); // false
regExpAfter.test(sample); // true
As per this morning's email, this bumps the price accuracy up to 92% and the title to 100%, for a total of 8 percentage points improvement.
It's also tested on a corpus twice as big:
Though my goal was a price accuracy bump, my strategy was to attack image recognition first, since everything else hinges on that. And it worked out pretty well! I added rules to…
I also re-expressed (and sometimes rewrote) the rules to compute [0, 1] fuzzy-logic confidences, from which we might be able to draw overall confidences that a given page is a product one, toward making the menu enable and disable more intelligently.
Please excuse the weird metadata on the commits. These are painstakingly extracted from the fathom-trainees fork where I developed them and retconned into this repo to preserve as much history as I could. One consequence of all this cherry-picking is that I change some stuff in the first few commits to glue it into fathom-trainees the way I wanted it, and then I change it back in the last "Put the glue code back" commit. So maybe look at the last commit before picking nits with the first few. :-)
Note that reality has suddenly shifted under my feet. I'd love for someone to softlink the trainees.js, coefficients.json, and ruleset_factory.js into this from a recent fork of fathom-trainees, as everything has started returning really low numbers on my machine in the past hour or so. I can't even repro my earlier training runs. I suspect (and hope) I broke something local to my machine.I had mistranscribed a number. See ba89ed4. That's why this is now 92% accurate instead of 90%.