Skip to content
This repository has been archived by the owner on Dec 3, 2020. It is now read-only.

92% price accuracy #275

Merged
merged 42 commits into from
Nov 20, 2018
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
2558ea5
Get trainer running on image out-rule from webext-commerce.
erikrose Oct 5, 2018
ca8d4c8
Add a trainee for each out() rule so we can choose them from the Trai…
erikrose Oct 19, 2018
ce27b0d
Respell a regex for clarity.
erikrose Oct 25, 2018
d8e6f43
Bring up to date with 9813ba8b59e6125b9ab18f51499e47bb2ec55745 in htt…
erikrose Oct 25, 2018
46b61d9
Rewrite isAboveTheFold using trapezoid() and fuzzy-logic scores.
erikrose Oct 25, 2018
2eb12b8
Refactor largerImage as well. This completes the image coefficients. …
erikrose Oct 25, 2018
51e6b30
Rewrite image y-axis scorer to constrain to 0..1 and for simplicity.
erikrose Oct 26, 2018
6909520
Retrain to fix the priceish coeff for isAboutTheFold().
erikrose Oct 29, 2018
8d38da1
Change rules that look for "price" in IDs and classes to emit fuzzy c…
erikrose Oct 29, 2018
525bd54
Re-express font-size rule as a confidence.
erikrose Oct 29, 2018
b567dd9
Rewrite rule that give a bonus to prices near the winning image.
erikrose Oct 29, 2018
90423da
Express hasPriceishPattern as a fuzzy truth.
erikrose Oct 29, 2018
74658e9
Fix the bugs that immediately kept the trainer from training.
erikrose Oct 29, 2018
c52af18
Remove a now-unused constant and an out-of-date comment.
erikrose Oct 30, 2018
737ca67
Add new coeffs to get to 100% on the training set!
erikrose Oct 30, 2018
47bbc7a
Rename hasPriceIn, since it doesn't actually have "price" hard-coded …
erikrose Nov 14, 2018
307fa3d
Consider divs with background images as well as img tags.
erikrose Nov 14, 2018
e1f4479
Typos
erikrose Nov 14, 2018
f0eba0d
There's no need to say "node". All scoring functions take nodes.
erikrose Nov 14, 2018
3fc5856
Add a rule to punish extreme aspect ratios and another to punish back…
erikrose Nov 14, 2018
dd16bf0
Hard-code a height for aboveTheFold so a user's different window size…
erikrose Nov 14, 2018
418a9cf
Make the image trainee train only the image-affecting coeffs, for speed.
erikrose Nov 14, 2018
3486879
Move tuned image coeffs into master vector.
erikrose Nov 15, 2018
9217d83
Make a more efficient training vector for price.
erikrose Nov 15, 2018
89da723
Improve price coeffs: 100% on 12-16.
erikrose Nov 15, 2018
0c888cb
Improve price coeffs. 93.3% on 12-16, 1-10. 92% on 1-25.
erikrose Nov 15, 2018
1735fe3
Improve price coeffs: 98.7% on all current training samples (1-25 and…
erikrose Nov 15, 2018
82709cf
Copy tuned price coeffs to master vector.
erikrose Nov 16, 2018
ac1d304
Put the glue code back how I found it, and move the coeffs back into …
erikrose Nov 16, 2018
ba89ed4
Fix a mistranscribed coefficient.
erikrose Nov 16, 2018
fd6b043
Make linter happy.
erikrose Nov 16, 2018
6ce3ea9
Rename trapezoid() to linearScale().
erikrose Nov 20, 2018
5d55d10
Use single-line doclets where possible. Put a newline after double as…
erikrose Nov 20, 2018
45db392
Rename contains() functions to indicate what returns bools and what r…
erikrose Nov 20, 2018
e9edf39
Un-inline the aspect ratio rule.
erikrose Nov 20, 2018
c0570d3
Un-inline hasBackgroundInID().
erikrose Nov 20, 2018
a57656a
Stick types in local names.
erikrose Nov 20, 2018
4cca17a
Remove unneeded !!.
erikrose Nov 20, 2018
ea275aa
Teach the application bits how to extract from CSS background-images.
erikrose Nov 20, 2018
57ca295
Damn you, linter.
erikrose Nov 20, 2018
aa4bd88
Use slice() for brevity and great justice.
erikrose Nov 20, 2018
3c6649a
Merge branch 'master' into 90%-price
Osmose Nov 20, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 15 additions & 1 deletion src/extraction/fathom/index.js
Original file line number Diff line number Diff line change
Expand Up @@ -32,8 +32,22 @@ const FEATURE_DEFAULTS = {
const PRODUCT_FEATURES = {
image: {
...FEATURE_DEFAULTS,
/**
* @return the URL of an image resource
*/
getValueFromElement(element) {
return element.src;
/**
* Given a CSS url() declaration 'url("http://foo")', return 'http://foo'.
*/
function urlFromCssDeclaration(declaration) {
return declaration.substring(5, declaration.length - 2);
erikrose marked this conversation as resolved.
Show resolved Hide resolved
}
if (element.tagName === 'IMG') {
return element.src;
}
// The other thing the ruleset can return is an arbitrary element with
// a CSS background image.
return urlFromCssDeclaration(getComputedStyle(element)['background-image']);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit of a nit, but I've always found that function declarations at the top break the flow of the function and make it harder to read. The function name helps the readability, but a variable can do just as well without interrupting the order:

const backgroundImage = getComputedStyle(element)['background-image'];
return backgroundImage.substring(5, background.length - 2);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted this to be ScreechinglyObviousCode. With magic numbers like 5 and -2, it's not otherwise screechingly obvious that what we're trying to do is pull the param out of url("…").

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An extra variable name, then?

const backgroundImage = getComputedStyle(element)['background-image'];
const backgroundImageUrl = backgroundImage.substring(5, background.length - 2); // "url('<image_url>')"
return backgroundImageUrl;

(Or not. This is very much yak shaving at this point.)

},
},
title: {
Expand Down
93 changes: 50 additions & 43 deletions src/extraction/fathom/ruleset_factory.js
Original file line number Diff line number Diff line change
Expand Up @@ -38,9 +38,7 @@ export default class RulesetFactory {
] = coefficients;
}

/**
* Scores fnode in direct proportion to its size
*/
/** Scores fnode in direct proportion to its size */
isBig(fnode) {
const domRect = fnode.element.getBoundingClientRect();
const area = domRect.width * domRect.height;
Expand All @@ -50,74 +48,79 @@ export default class RulesetFactory {
// (though we should have distinct penalties for that sort of thing if we
// care). More importantly, clamp the upper bound of the score so we don't
// overcome other bonuses and penalties.
return trapezoid(area, 80 ** 2, 1000 ** 2) ** this.bigImageCoeff;
return linearScale(area, 80 ** 2, 1000 ** 2) ** this.bigImageCoeff;
}

/** Return whether the computed font size of an element is big. */
fontIsBig(fnode) {
const size = parseInt(getComputedStyle(fnode.element).fontSize, 10);
return trapezoid(size, 14, 50) ** this.bigFontCoeff;
return linearScale(size, 14, 50) ** this.bigFontCoeff;
}

/**
* Scores fnode with a '$' in its innerText
*/
/** Scores fnode with a '$' in its innerText */
hasDollarSign(fnode) {
return (fnode.element.innerText.includes('$') ? ONEISH : ZEROISH) ** this.hasDollarSignCoeff;
}

/** Return whether some substring is within a given string, case insensitively. */
doesContain(haystack, needle) {
/**
* Return whether some substring is within a given string, case
* insensitively.
*/
caselessIncludes(haystack, needle) {
erikrose marked this conversation as resolved.
Show resolved Hide resolved
return haystack.toLowerCase().includes(needle);
}

/** Return a weighted confidence of whether a substring is within a given
* string, case insensitively.
/**
* Return a weighted confidence of whether a substring is within a given
* string, case insensitively.
*/
contains(haystack, needle, coeff) {
return (this.doesContain(haystack, needle) ? ONEISH : ZEROISH) ** coeff;
weightedIncludes(haystack, needle, coeff) {
return (this.caselessIncludes(haystack, needle) ? ONEISH : ZEROISH) ** coeff;
}

/**
* Scores fnode with 'price' in its id
* Punish elements with "background" in their ID. Do nothing to those without.
*/
hasBackgroundInID(fnode) {
return this.caselessIncludes(fnode.element.id, 'background') ? (ZEROISH ** this.backgroundIdImageCoeff) : 1;
}

/** Scores fnode with 'price' in its id */
hasPriceInID(fnode) {
return this.contains(fnode.element.id, 'price', this.hasPriceInIDCoeff);
return this.weightedIncludes(fnode.element.id, 'price', this.hasPriceInIDCoeff);
}

hasPriceInParentID(fnode) {
return this.contains(fnode.element.parentElement.id, 'price', this.hasPriceInParentIDCoeff);
return this.weightedIncludes(fnode.element.parentElement.id, 'price', this.hasPriceInParentIDCoeff);
}

/** Scores fnode with 'price' in its class name */
hasPriceInClassName(fnode) {
return this.contains(fnode.element.className, 'price', this.hasPriceInClassNameCoeff);
return this.weightedIncludes(fnode.element.className, 'price', this.hasPriceInClassNameCoeff);
}

/** Scores fnode with 'price' in its class name */
/** Scores fnode with 'price' in its parent's class name */
hasPriceInParentClassName(fnode) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea to break these each out into their own rules with their own coefficients for node versus parent node.

return this.contains(fnode.element.parentElement.className, 'price', this.hasPriceInParentClassNameCoeff);
return this.weightedIncludes(fnode.element.parentElement.className, 'price', this.hasPriceInParentClassNameCoeff);
}

/**
* Scores fnode by its vertical location relative to the fold
*/
/** Scores fnode by its vertical location relative to the fold */
isAboveTheFold(fnode, featureCoeff) {
const viewportHeight = 950;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The viewport height and/or width shows up in a lot of different places (here, trainees.js, price_updates.js)... Maybe we could have a Fathom config in ./src/extraction/fathom with these values? Not sure what else would go in there though -- what do you think @Osmose ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need a separate config for them, but putting them in the main config seems fine to me.

const imageTop = fnode.element.getBoundingClientRect().top;

// Stop giving additional bonus for anything closer than 200px to the top
// of the viewport. Those are probably usually headers.
return trapezoid(imageTop, viewportHeight * 2, 200) ** featureCoeff;
return linearScale(imageTop, viewportHeight * 2, 200) ** featureCoeff;
}

/**
* Return whether the centerpoint of the element is near that of the highest-
* scoring image.
*/
isNearImage(fnode) {
const image = this.getHighestScoringImage(fnode);
return trapezoid(euclidean(fnode, image), 1000, 0) ** this.isNearImageCoeff;
const imageFnode = this.getHighestScoringImage(fnode);
return linearScale(euclidean(fnode, imageFnode), 1000, 0) ** this.isNearImageCoeff;
}

/**
Expand All @@ -128,8 +131,8 @@ export default class RulesetFactory {
* bottom" one.
*/
isNearImageTopOrBottom(fnode) {
const image = this.getHighestScoringImage(fnode).element;
const imageRect = image.getBoundingClientRect();
const imageElement = this.getHighestScoringImage(fnode).element;
const imageRect = imageElement.getBoundingClientRect();
const nodeRect = fnode.element.getBoundingClientRect();

// Should cover title above image and title in a column next to image.
Expand All @@ -141,7 +144,7 @@ export default class RulesetFactory {
const bottomDistance = Math.abs(imageRect.bottom - nodeRect.top);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of a product page I have seen where the title is below the image -- did you find some from developing your test corpus?


const shortestDistance = Math.min(topDistance, bottomDistance);
return trapezoid(shortestDistance, 200, 0) ** this.isNearbyImageYAxisTitleCoeff;
return linearScale(shortestDistance, 200, 0) ** this.isNearbyImageYAxisTitleCoeff;
}

/**
Expand All @@ -158,9 +161,7 @@ export default class RulesetFactory {
return (regExp.test(text) ? ONEISH : ZEROISH) ** this.hasPriceishPatternCoeff;
}

/**
* Checks to see if a 'priceish' fnode is eligible for scoring
*/
/** Checks to see if a 'priceish' fnode is eligible for scoring */
isEligiblePrice(fnode) {
return (
this.isVisible(fnode)
Expand All @@ -169,9 +170,7 @@ export default class RulesetFactory {
);
}

/**
* Checks to see if a 'titleish' fnode is eligible for scoring
*/
/** Checks to see if a 'titleish' fnode is eligible for scoring */
isEligibleTitle(fnode) {
return (
this.isVisible(fnode)
Expand Down Expand Up @@ -237,7 +236,9 @@ export default class RulesetFactory {

hasBackgroundImage(fnode) {
const bgImage = getComputedStyle(fnode.element)['background-image'];
return !!bgImage && bgImage !== 'none';
return bgImage && bgImage.startsWith('url("') && bgImage.endsWith('")');
// The app doesn't yet know how to deal with non-url() types of background
// images.
erikrose marked this conversation as resolved.
Show resolved Hide resolved
}

/**
Expand All @@ -249,6 +250,11 @@ export default class RulesetFactory {
return (rect.width > rect.height) ? (rect.width / rect.height) : (rect.height / rect.width);
}

/** Give a bonus for elements that have a non-extreme aspect ratio. */
hasSquareAspectRatio(fnode) {
return linearScale(this.aspectRatio(fnode.element), 10, 5) ** this.extremeAspectCoeff;
}

/**
* Using coefficients passed into the constructor method, returns a weighted
* ruleset used to score elements in an HTML document.
Expand All @@ -267,13 +273,14 @@ export default class RulesetFactory {
rule(type('imageish'), score(fnode => this.isAboveTheFold(fnode, this.isAboveTheFoldImageCoeff))),
// better score for larger images
rule(type('imageish'), score(this.isBig.bind(this))),
// punishment for extreme aspect ratios, to filter out banners or nav elements
rule(type('imageish'), score(fnode => trapezoid(this.aspectRatio(fnode.element), 10, 5)
** this.extremeAspectCoeff)),
// bonus for non-extreme aspect ratios, to filter out banners or nav elements
// TODO: Meant to make this a penalty, but it turns out to work as is.
// Try as a penalty.
rule(type('imageish'), score(this.hasSquareAspectRatio.bind(this))),
// no background images, even ones that have reasonable aspect ratios
// TODO: If necessary, also look at parents. I've seen them say
// "background" in their IDs as well.
rule(type('imageish'), score(fnode => (this.doesContain(fnode.element.id, 'background') ? (ZEROISH ** this.backgroundIdImageCoeff) : 1))),
rule(type('imageish'), score(this.hasBackgroundInID.bind(this))),
// return image element(s) with max score
rule(type('imageish').max(), out('image')),

Expand Down Expand Up @@ -337,12 +344,12 @@ export default class RulesetFactory {
/**
* Scale a number to the range [ZEROISH, ONEISH].
*
* For a rising trapezoid, the result is ZEROISH until the input reaches
* For a rising line, the result is ZEROISH until the input reaches
* zeroAt, then increases linearly until oneAt, at which it becomes ONEISH. To
* make a falling trapezoid, where the result is ONEISH to the left and ZEROISH
* make a falling line, where the result is ONEISH to the left and ZEROISH
* to the right, use a zeroAt greater than oneAt.
*/
function trapezoid(number, zeroAt, oneAt) {
function linearScale(number, zeroAt, oneAt) {
erikrose marked this conversation as resolved.
Show resolved Hide resolved
const isRising = zeroAt < oneAt;
if (isRising) {
if (number <= zeroAt) {
Expand Down