Skip to content
This repository has been archived by the owner on Dec 3, 2020. It is now read-only.

92% price accuracy #275

Merged
merged 42 commits into from
Nov 20, 2018
Merged
Show file tree
Hide file tree
Changes from 16 commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
2558ea5
Get trainer running on image out-rule from webext-commerce.
erikrose Oct 5, 2018
ca8d4c8
Add a trainee for each out() rule so we can choose them from the Trai…
erikrose Oct 19, 2018
ce27b0d
Respell a regex for clarity.
erikrose Oct 25, 2018
d8e6f43
Bring up to date with 9813ba8b59e6125b9ab18f51499e47bb2ec55745 in htt…
erikrose Oct 25, 2018
46b61d9
Rewrite isAboveTheFold using trapezoid() and fuzzy-logic scores.
erikrose Oct 25, 2018
2eb12b8
Refactor largerImage as well. This completes the image coefficients. …
erikrose Oct 25, 2018
51e6b30
Rewrite image y-axis scorer to constrain to 0..1 and for simplicity.
erikrose Oct 26, 2018
6909520
Retrain to fix the priceish coeff for isAboutTheFold().
erikrose Oct 29, 2018
8d38da1
Change rules that look for "price" in IDs and classes to emit fuzzy c…
erikrose Oct 29, 2018
525bd54
Re-express font-size rule as a confidence.
erikrose Oct 29, 2018
b567dd9
Rewrite rule that give a bonus to prices near the winning image.
erikrose Oct 29, 2018
90423da
Express hasPriceishPattern as a fuzzy truth.
erikrose Oct 29, 2018
74658e9
Fix the bugs that immediately kept the trainer from training.
erikrose Oct 29, 2018
c52af18
Remove a now-unused constant and an out-of-date comment.
erikrose Oct 30, 2018
737ca67
Add new coeffs to get to 100% on the training set!
erikrose Oct 30, 2018
47bbc7a
Rename hasPriceIn, since it doesn't actually have "price" hard-coded …
erikrose Nov 14, 2018
307fa3d
Consider divs with background images as well as img tags.
erikrose Nov 14, 2018
e1f4479
Typos
erikrose Nov 14, 2018
f0eba0d
There's no need to say "node". All scoring functions take nodes.
erikrose Nov 14, 2018
3fc5856
Add a rule to punish extreme aspect ratios and another to punish back…
erikrose Nov 14, 2018
dd16bf0
Hard-code a height for aboveTheFold so a user's different window size…
erikrose Nov 14, 2018
418a9cf
Make the image trainee train only the image-affecting coeffs, for speed.
erikrose Nov 14, 2018
3486879
Move tuned image coeffs into master vector.
erikrose Nov 15, 2018
9217d83
Make a more efficient training vector for price.
erikrose Nov 15, 2018
89da723
Improve price coeffs: 100% on 12-16.
erikrose Nov 15, 2018
0c888cb
Improve price coeffs. 93.3% on 12-16, 1-10. 92% on 1-25.
erikrose Nov 15, 2018
1735fe3
Improve price coeffs: 98.7% on all current training samples (1-25 and…
erikrose Nov 15, 2018
82709cf
Copy tuned price coeffs to master vector.
erikrose Nov 16, 2018
ac1d304
Put the glue code back how I found it, and move the coeffs back into …
erikrose Nov 16, 2018
ba89ed4
Fix a mistranscribed coefficient.
erikrose Nov 16, 2018
fd6b043
Make linter happy.
erikrose Nov 16, 2018
6ce3ea9
Rename trapezoid() to linearScale().
erikrose Nov 20, 2018
5d55d10
Use single-line doclets where possible. Put a newline after double as…
erikrose Nov 20, 2018
45db392
Rename contains() functions to indicate what returns bools and what r…
erikrose Nov 20, 2018
e9edf39
Un-inline the aspect ratio rule.
erikrose Nov 20, 2018
c0570d3
Un-inline hasBackgroundInID().
erikrose Nov 20, 2018
a57656a
Stick types in local names.
erikrose Nov 20, 2018
4cca17a
Remove unneeded !!.
erikrose Nov 20, 2018
ea275aa
Teach the application bits how to extract from CSS background-images.
erikrose Nov 20, 2018
57ca295
Damn you, linter.
erikrose Nov 20, 2018
aa4bd88
Use slice() for brevity and great justice.
erikrose Nov 20, 2018
3c6649a
Merge branch 'master' into 90%-price
Osmose Nov 20, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
225 changes: 109 additions & 116 deletions src/extraction/fathom/ruleset_factory.js
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,9 @@
* file, You can obtain one at http://mozilla.org/MPL/2.0/. */

import {dom, out, rule, ruleset, score, type} from 'fathom-web';
// Since the fathom-trainees add-on currently uses a submodule of Fathom, for
// training, replace 'utils' with 'utilsForFrontend'
import {ancestors} from 'fathom-web/utils';
import {ancestors} from 'fathom-web/utilsForFrontend';
import {euclidean} from 'fathom-web/clusters';

const DEFAULT_BODY_FONT_SIZE = 14;
const DEFAULT_SCORE = 1;
const TOP_BUFFER = 150;
// From: https://github.com/mozilla/fathom-trainees/blob/master/src/trainees.js
const ZEROISH = 0.08;
Expand All @@ -26,165 +23,130 @@ export default class RulesetFactory {
[
this.hasDollarSignCoeff,
this.hasPriceInClassNameCoeff,
this.hasPriceInParentClassNameCoeff,
this.hasPriceInIDCoeff,
this.hasPriceInParentIDCoeff,
this.hasPriceishPatternCoeff,
this.isAboveTheFoldImageCoeff,
this.isAboveTheFoldPriceCoeff,
this.isNearbyImageXAxisPriceCoeff,
this.isNearImageCoeff,
this.isNearbyImageYAxisTitleCoeff,
this.largerFontSizeCoeff,
this.largerImageCoeff,
this.bigFontCoeff,
this.bigImageCoeff,
] = coefficients;
}

/**
* Scores fnode in direct proportion to its size
*/
largerImage(fnode) {
imageIsBig(fnode) {
const domRect = fnode.element.getBoundingClientRect();
const area = (domRect.width) * (domRect.height);
if (area === 0) {
return DEFAULT_SCORE;
}
return area * this.largerImageCoeff;
const area = domRect.width * domRect.height;

// Assume no product images as small as 80px^2. No further bonus over
// 1000^2. For one thing, that's getting into background image territory
// (though we should have distinct penalties for that sort of thing if we
// care). More importantly, clamp the upper bound of the score so we don't
// overcome other bonuses and penalties.
return trapezoid(area, 80 ** 2, 1000 ** 2) ** this.bigImageCoeff;
}

/**
* Scores fnode in proportion to its font size
*/
largerFontSize(fnode) {
const size = window.getComputedStyle(fnode.element).fontSize;
// Normalize the multiplier by the default font size
const sizeMultiplier = parseFloat(size, 10) / DEFAULT_BODY_FONT_SIZE;
return sizeMultiplier * this.largerFontSizeCoeff;
/** Return whether a */
fontIsBig(fnode) {
const size = parseInt(getComputedStyle(fnode.element).fontSize, 10);
return trapezoid(size, 14, 50) ** this.bigFontCoeff;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: It would be easier here if these threshold values (14, 50) were stored in const variables at the top for future tweaking. That way they could also be described in terms of what units they're in, and potentially used by other rules.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For as much as I don't like the trapezoid name, it is making some of these rules a lot easier to understand. 👍

}

/**
* Scores fnode with a '$' in its innerText
*/
hasDollarSign(fnode) {
if (fnode.element.innerText.includes('$')) {
return this.hasDollarSignCoeff;
}
return DEFAULT_SCORE;
return (fnode.element.innerText.includes('$') ? ONEISH : ZEROISH) ** this.hasDollarSignCoeff;
}

/** Return a confidence of whether "price" is a word within a given string. */
contains(haystack, needle, coeff) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a tricky naming problem, huh. iincludes is a bad name, but I'm wary of shifting away from JavaScript's nomenclature too.

Maybe rename doesContain to icontains (or maybe just contains) and contains to containsScore? I mostly just want to make it clear that one is a standard contains and the other returns a score.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good call. I named those super-fast and never made a second pass to clean them up.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the i for in iincludes and icontains?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case-insensitive. But I ended up spelling it out.

return (haystack.toLowerCase().includes(needle) ? ONEISH : ZEROISH) ** coeff;
}

/**
* Scores fnode with 'price' in its id or its parent's id
* Scores fnode with 'price' in its id
*/
hasPriceInID(fnode) {
const id = fnode.element.id;
const parentID = fnode.element.parentElement.id;
if (id.toLowerCase().includes('price')) {
return this.hasPriceInIDCoeff;
}
if (parentID.toLowerCase().includes('price')) {
return 0.75 * this.hasPriceInIDCoeff;
}
return DEFAULT_SCORE;
return this.contains(fnode.element.id, 'price', this.hasPriceInIDCoeff);
}

/**
* Scores fnode with 'price' in its class name or its parent's class name
*/
hasPriceInParentID(fnode) {
return this.contains(fnode.element.parentElement.id, 'price', this.hasPriceInParentIDCoeff);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parentElement cold be null, right? Does the code fail in that case? I guess the old code had this problem too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually comes out as "", so it's fine. :-D

}

/** Scores fnode with 'price' in its class name */
hasPriceInClassName(fnode) {
const className = fnode.element.className;
const parentClassName = fnode.element.parentElement.className;
if (className.toLowerCase().includes('price')) {
return this.hasPriceInClassNameCoeff;
}
if (parentClassName.toLowerCase().includes('price')) {
return 0.75 * this.hasPriceInClassNameCoeff;
}
return DEFAULT_SCORE;
return this.contains(fnode.element.className, 'price', this.hasPriceInClassNameCoeff);
}

/** Scores fnode with 'price' in its class name */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In its parent's class name

hasPriceInParentClassName(fnode) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea to break these each out into their own rules with their own coefficients for node versus parent node.

return this.contains(fnode.element.parentElement.className, 'price', this.hasPriceInParentClassNameCoeff);
}

/**
* Scores fnode by its vertical location relative to the fold
*/
isAboveTheFold(fnode, featureCoeff) {
const viewportHeight = window.innerHeight;
const top = fnode.element.getBoundingClientRect().top;
const upperHeightLimit = viewportHeight * 2;
const imageTop = fnode.element.getBoundingClientRect().top;

// If the node is below the fold by more than a viewport's length,
// return a low score.
if (top >= upperHeightLimit) {
return ZEROISH * featureCoeff;
}

// If the node is above the fold, return a high score.
if (top <= viewportHeight) {
return ONEISH * featureCoeff;
}

// Otherwise, scale the score linearly between the fold and a viewport's
// length below it.
const slope = (ONEISH - ZEROISH) / (viewportHeight - upperHeightLimit);
return (slope * (top - upperHeightLimit) + ZEROISH) * featureCoeff;
// Stop giving additional bonus for anything closer than 200px to the top
// of the viewport. Those are probably usually headers.
return trapezoid(imageTop, viewportHeight * 2, 200) ** featureCoeff;
}

/**
* Scores fnode based on its x distance from the highest scoring image element
* Return whether the centerpoint of the element is near that of the highest-
* scoring image.
*/
isNearbyImageXAxisPrice(fnode) {
const viewportWidth = window.innerWidth;
const eleDOMRect = fnode.element.getBoundingClientRect();
const imageElement = this.getHighestScoringImage(fnode);
const imageDOMRect = imageElement.getBoundingClientRect();
const deltaRight = eleDOMRect.left - imageDOMRect.right;
const deltaLeft = imageDOMRect.left - eleDOMRect.right;
// True if element is completely to the right or left of the image element
const noOverlap = (deltaRight > 0 || deltaLeft > 0);
let deltaX;
if (noOverlap) {
if (deltaRight > 0) {
deltaX = deltaRight;
} else {
deltaX = deltaLeft;
}
// Give a higher score the closer it is to the image, normalized by viewportWidth
return (viewportWidth / deltaX) * this.isNearbyImageXAxisPriceCoeff;
}
return DEFAULT_SCORE;
isNearImage(fnode) {
const image = this.getHighestScoringImage(fnode);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should be explicit about types?

const imageFnode = this.getHighestScoringImageFnode(fnode);

return trapezoid(euclidean(fnode, image), 1000, 0) ** this.isNearImageCoeff;
}

/**
* Scores fnode based on its y distance from the highest scoring image element
* Return whether the potential title is near the top or bottom of the
* highest-scoring image.
*
* This is a makeshift ORing 2 signals: a "near the top" and a "near the
* bottom" one.
*/
isNearbyImageYAxisTitle(fnode) {
const viewportHeight = window.innerHeight;
const DOMRect = fnode.element.getBoundingClientRect();
const imageElement = this.getHighestScoringImage(fnode);
const imageDOMRect = imageElement.getBoundingClientRect();
// Some titles (like on Ebay) are above the image, so include a top buffer
const isEleTopNearby = DOMRect.top >= (imageDOMRect.top - TOP_BUFFER);
const isEleBottomNearby = DOMRect.bottom <= imageDOMRect.bottom;
// Give elements in a specific vertical band a higher score
if (isEleTopNearby && isEleBottomNearby) {
const deltaY = Math.abs(imageDOMRect.top - DOMRect.top);
// Give a higher score the closer it is to the image, normalized by viewportHeight
return (viewportHeight / deltaY) * this.isNearbyImageYAxisTitleCoeff;
}
return DEFAULT_SCORE;
isNearImageTopOrBottom(fnode) {
const image = this.getHighestScoringImage(fnode).element;
const imageRect = image.getBoundingClientRect();
const nodeRect = fnode.element.getBoundingClientRect();

// Should cover title above image and title in a column next to image.
// Could also consider using the y-axis midpoint of title.
const topDistance = Math.abs(imageRect.top - nodeRect.top);

// Test nodeRect.top. They're probably not side by side with the title at
// the bottom. Rather, title will be below image.
const bottomDistance = Math.abs(imageRect.bottom - nodeRect.top);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of a product page I have seen where the title is below the image -- did you find some from developing your test corpus?


const shortestDistance = Math.min(topDistance, bottomDistance);
return trapezoid(shortestDistance, 200, 0) ** this.isNearbyImageYAxisTitleCoeff;
}

/**
* Scores fnode whose innerText matches a priceish RegExp pattern
* Return whether the fnode's innertext contains a dollars-and-cents number.
*/
hasPriceishPattern(fnode) {
const text = fnode.element.innerText;
/**
* With an optional '$' that doesn't necessarily have to be at the beginning
* of the string (ex: 'US $5.00' on Ebay), matches any number of digits before
* a decimal point and exactly two after, where the two digits after the decimal point
* are at the end of the string
* a decimal point and exactly two after.
*/
const regExp = /\${0,1}\d+\.\d{2}$/;
if (regExp.test(text)) {
return this.hasPriceishPatternCoeff;
}
return DEFAULT_SCORE;
const regExp = /\$?\d+\.\d{2}(?![0-9])/;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the negative lookahead be optional, for price strings at the end of the element text?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because the negative lookahead succeeds when the string ends. (After all, there is indeed no digit there.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this addition, (?![0-9]), is allowing any non-digits to follow the two digits after the decimal point, instead of requiring that the end of the string be those two digits after the decimal point?

// Trying it out in a REPL...
const regExpBefore = /\${0,1}\d+\.\d{2}$/;
const regExpAfter = /\$?\d+\.\d{2}(?![0-9])/;
const sample = 'US $4.99e';
regExpBefore.test(sample); // false
regExpAfter.test(sample); // true

return (regExp.test(text) ? ONEISH : ZEROISH) ** this.hasPriceishPatternCoeff;
}

/**
Expand Down Expand Up @@ -238,7 +200,7 @@ export default class RulesetFactory {
isNearbyImageYAxisPrice(fnode) {
const element = fnode.element;
const DOMRect = element.getBoundingClientRect();
const imageElement = this.getHighestScoringImage(fnode);
const imageElement = this.getHighestScoringImage(fnode).element;
const imageDOMRect = imageElement.getBoundingClientRect();
if (DOMRect.top >= (imageDOMRect.top - TOP_BUFFER)
&& DOMRect.bottom <= imageDOMRect.bottom) {
Expand Down Expand Up @@ -278,7 +240,7 @@ export default class RulesetFactory {
// better score the closer the element is to the top of the page
rule(type('imageish'), score(fnode => this.isAboveTheFold(fnode, this.isAboveTheFoldImageCoeff))),
// better score for larger images
rule(type('imageish'), score(this.largerImage.bind(this))),
rule(type('imageish'), score(this.imageIsBig.bind(this))),
// return image element(s) with max score
rule(type('imageish').max(), out('image')),

Expand All @@ -288,27 +250,31 @@ export default class RulesetFactory {
// consider all eligible h1 elements
rule(dom('h1').when(this.isEligibleTitle.bind(this)), type('titleish')),
// better score based on y-axis proximity to max scoring image element
rule(type('titleish'), score(this.isNearbyImageYAxisTitle.bind(this))),
rule(type('titleish'), score(this.isNearImageTopOrBottom.bind(this))),
// return title element(s) with max score
rule(type('titleish').max(), out('title')),

/**
* Price rules
*/
// 72% by itself, at [4, 4, 4, 4...]!:
// consider all eligible span and h2 elements
rule(dom('span, h2').when(this.isEligiblePrice.bind(this)), type('priceish')),
// check if the element has a '$' in its innerText
rule(type('priceish'), score(this.hasDollarSign.bind(this))),
// better score the closer the element is to the top of the page
rule(type('priceish'), score(fnode => this.isAboveTheFold(fnode, this.isAboveTheFoldPriceCoeff))),

// check if the id has "price" in it
rule(type('priceish'), score(this.hasPriceInID.bind(this))),
rule(type('priceish'), score(this.hasPriceInParentID.bind(this))),
// check if any class names have "price" in them
rule(type('priceish'), score(this.hasPriceInClassName.bind(this))),
rule(type('priceish'), score(this.hasPriceInParentClassName.bind(this))),
// better score for larger font size
rule(type('priceish'), score(this.largerFontSize.bind(this))),
rule(type('priceish'), score(this.fontIsBig.bind(this))),
// better score based on x-axis proximity to max scoring image element
rule(type('priceish'), score(this.isNearbyImageXAxisPrice.bind(this))),
rule(type('priceish'), score(this.isNearImage.bind(this))),
// check if innerText has a priceish pattern
rule(type('priceish'), score(this.hasPriceishPattern.bind(this))),
// return price element(s) with max score
Expand All @@ -331,6 +297,33 @@ export default class RulesetFactory {
}

getHighestScoringImage(fnode) {
return fnode._ruleset.get('image')[0].element; // eslint-disable-line no-underscore-dangle
return fnode._ruleset.get('image')[0]; // eslint-disable-line no-underscore-dangle
}
}

/**
* Scale a number to the range [ZEROISH, ONEISH].
*
* For a rising trapezoid, the result is ZEROISH until the input reaches
* zeroAt, then increases linearly until oneAt, at which it becomes ONEISH. To
* make a falling trapezoid, where the result is ONEISH to the left and ZEROISH
* to the right, use a zeroAt greater than oneAt.
*/
function trapezoid(number, zeroAt, oneAt) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trapezoid is an unclear name for what this is doing, or at least it requires some explanation; the first time I ran into this particular function I had to read it like 3 times to understand what it was doing. A signature like linearScale(number, {zeroAt: ZEROISH, oneAt: ONEISH}) would go a long way towards making this understandable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we could go with the more semantic meaning of this, which is something like:

return STRONG_MATCH; // rename ONEISH to STRONG_MATCH
return WEAK_MATCH; // rename ZEROISH to WEAK_MATCH
return scaledMatch(number, {weak: 15, strong: 7});

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like your linearScale spelling. Changing. I'm not sure what you have in mind with defaulting the zeroAt param to ZEROISH and oneAt to ONEISH. ZEROISH and ONEISH are the output values of linearScale, not typically the input ones.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't mean to default them, rather was just putting them in as rando arguments.

const isRising = zeroAt < oneAt;
if (isRising) {
if (number <= zeroAt) {
return ZEROISH;
} else if (number >= oneAt) {
return ONEISH;
}
} else {
if (number >= zeroAt) {
return ZEROISH;
} else if (number <= oneAt) {
return ONEISH;
}
}
const slope = (ONEISH - ZEROISH) / (oneAt - zeroAt);
return slope * (number - zeroAt) + ZEROISH;
}
Loading