-
Notifications
You must be signed in to change notification settings - Fork 15
#36: Integrate Fathom-based page extraction with a simple ruleset. #38
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
Along with the feedback you'll want to rebase, since the sidebar removal landed.
src/fathom_ruleset.js
Outdated
* License, v. 2.0. If a copy of the MPL was not distributed with this | ||
* file, You can obtain one at http://mozilla.org/MPL/2.0/. */ | ||
|
||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double asterisk for the doc comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doc comment is an excellent summary of the file!
src/fathom_ruleset.js
Outdated
*/ | ||
function formatPrice(priceString) { | ||
const formattedPriceStr = priceString.replace('$', '').replace(/([\s]|[^0-9$.-])/g, ''); | ||
return parseFloat(formattedPriceStr.substr(formattedPriceStr.indexOf('$') + 1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Floats are unsuitable for representing price because they cannot accurately represent decimal numbers (that article is a bit wordy, you don't necessarily need to read all of it).
We're not actually doing anything with the prices yet anyway; for now just returning the string matched by Fathom is plenty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just removed this function entirely for now.
src/product_info.js
Outdated
from: 'content', | ||
subject: 'ready', | ||
data: { | ||
price: runRuleset('price', tuningRoutines.price.coeffs) || extractData().price, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may behoove us to define an API for price extraction that's common between the two different methods, and use that here instead of hardcoding them. Something like:
const extractedProduct = fathomExtraction.extractProduct(window.document);
// extractedProduct == {price: '$4.95'}
That would make coding a fallback a bit easier if we want it, and also let us configure different extraction method later on without modifying the rest of the code.
src/product_info.js
Outdated
let gotText; | ||
if (feature === 'price') { | ||
// strip whitespace, dollar sign, words, and trailing zeros when comparing price | ||
gotText = tuningRoutine(...coeffs)(window.document).map(fnode => fnode.element)[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is really difficult to follow. You have to follow the execution through tuningRoutine
returning a function that takes a document object and returns a list of fnodes (which are a fathom-specific thing), which we then extract dom nodes from, and then grab just the first one (can more than one even be returned?).
I'm not a fan of the design of this API. there's no reason for runRuleset
to take the coefficients as an argument; in this WebExtension we're never going to call that function with different coefficients. Also the coefficients are being imported from the same file, why couldn't the tuning routine just use them directly?
I think this file should be loosely coupled with the details of extraction. This file is more concerned with triggering the extraction and then sending that data back to the background script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the exported method from fathom_ruleset.js
so that it simply returns the highest scoring element for the given feature in the given doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and then grab just the first one (can more than one even be returned?
Yes. Fathom is returning all elements with the highest score. Because our ruleset (with it's one naive rule) is really bad right now, and on some pages, there is no div element with a class of "price", it is returning all div elements in that doc.
If the list has multiple elements, then I imagine we are not confident in Fathom's results, and should use our fallback method, where applicable? That's what I have now in the updated PR. What would you recommend @Osmose ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be a way to get the scores of the dom elements returned. We could have a minimum threshold below which we consider the matches to not be worth returning; for our simple ruleset this can just be whatever the score of a matching element would be.
Depending on the rules, maybe we also want to rule out ancestors? I could see a parent having the same innerText
as a child and getting a similar score depending on the ruleset.
I suspect if we have a threshold, return the maximum-ranking results, and still find more than one matching dom element, it's prudent to just choose the first matching one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay thanks for these thoughts.
I added a threshold based on the score for the element, and return the first element in the list if more than one exceed it.
Re: this comment:
Depending on the rules, maybe we also want to rule out ancestors? I could see a parent having the same innerText as a child and getting a similar score depending on the ruleset.
I agree that's a good consideration to include, but I am thinking that's best for the follow-up PR where I add a more detailed ruleset, since there is currently no innerText
rule in this PR's simple ruleset.
src/fathom_ruleset.js
Outdated
|
||
import {dom, out, rule, ruleset, score, type} from 'fathom-web'; | ||
|
||
const tuningRoutines = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumes that we have a different ruleset per attribute; I thought we were going to use a single ruleset with multiple types?
src/fathom_ruleset.js
Outdated
import {dom, out, rule, ruleset, score, type} from 'fathom-web'; | ||
|
||
const tuningRoutines = { | ||
price: {routine: tunedPriceFnodes, coeffs: [2]}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm weary about coefficients being stored as an ordered list instead of, say, a named mapping. The reason Swathi did this in her code is because by default the simulated annealer deals with lists of coefficients, but when we add in annealing scripts to the codebase we can use Maps to get around that while still storing coefficients as named values.
What do you think about loading the coefficients from a JSON file?
{
"hasDivWithPriceClass": 1
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would Maps be better than just importing the JSON object and getting the corresponding coefficient such as coeff.hasDivWithPriceClass
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right that we should just import the JSON class here. What I meant was that the annealing API deals with lists of coefficients. Because key ordering in JS objects isn't guaranteed, you have to do some work to make sure the coefficients you're optimizing and put in and taken out of the lists that the optimization works with in the same order.
Maps could theoretically help since their keys are ordered by insertion, but @erikrose mentioned that Maps might be a perf hit during the annealing process depending on how we use them.
In any case, this is a moot point until we get around to adding in an optimization script to the repo, so we can just ignore Maps for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I don't think you'll need to add any sort of optimization script to your repo; that's what the FathomFox Trainer is for. (The randomTransition() function of the optimizer will probably have to become pluggable or at least selectable, however, depending on what sort of math you're doing with your coefficients. Right now, it can reach integral values, even ones <0, which suits 0..1 bounded scores coming out of each scoring callback.)
src/fathom_ruleset.js
Outdated
* Ruleset for product prices | ||
*/ | ||
function tunedPriceFnodes(coeffHasDivWithPriceClass = 1) { | ||
function hasDivWithPriceClass(fnode) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defining these inside another function make them impossible to test. When we get into using regexes and other more complex rules I suspect unit tests of rules may have some value. Can we define these at a global level somehow?
src/fathom_ruleset.js
Outdated
return tuningRoutine; | ||
} | ||
|
||
export {tuningRoutines, formatPrice}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be able to just put an export
keyword next to each of these instead of having an export line at the bottom.
Also, actually, I wasn't aware this syntax worked; I would have expected it to complain that we had an unnamed, non-default export. Interesting.
src/product_info.js
Outdated
subject: 'ready', | ||
data: { | ||
price: runRuleset('price', tuningRoutines.price.coeffs) || extractData().price, | ||
url: window.document.URL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh, are URL objects serializable such that they can be sent in a postMessage
? TIL
This patch will run Fathom against the page (not distinguishing a product from a non-product page) and log the extracted price value and page URL to the console via 'background.js'. Failing that, it will fall back to extraction via CSS selectors if any exist for the site in 'product_extraction_data.json', and failing that, it will try extraction via Open Graph meta tags. This is heavily based on [Swathi Iyer](https://github.com/swathiiyer2/fathom-products) and [Victor Ng’s](https://github.com/mozilla/fathom-webextension) prior work. Currently, there is only one ruleset with one naive rule for one product feature, price. This initial commit is intended cover Fathom integration into the web extension. A later commit will add rules and take training data into account. Note: The 'runRuleset' method in 'productInfo.js' returns 'NaN' if it doesn't find any elements for any of its rules. Performance observations: Originally, I had dumped Swathi's three rulesets (one each for product title, image and price) and tried to run them against any page, similar to Victor Ng's web extension. However, that was [freezing up the tab](#36 (comment)), and after profiling the content script Fathom was running in before and after replacing Swathi's rulesets with a single ruleset with only one rule for one attribute, I did not see any warnings from Firefox, nor detect any significant performance hits in the DevTools profiler due to Fathom. It would therefore appear the performance hit was related to the complex rulesets and not Fathom itself. Webpack observations: While [`jsdom`](https://www.npmjs.com/package/jsdom) is a `fathom-web` dependency, it is used only for running `fathom-web` in the Node context for testing. To avoid build errors associated with `jsdom` and its dependencies, I added a `’null-loader’` for that `require` call, which mocks the module as an empty object. This loader is also used in webpack.config.test.js, from PR #32.
2d56603
to
d364866
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some replies and questions for you for my second commit.
src/fathom_ruleset.js
Outdated
import {dom, out, rule, ruleset, score, type} from 'fathom-web'; | ||
|
||
const tuningRoutines = { | ||
price: {routine: tunedPriceFnodes, coeffs: [2]}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would Maps be better than just importing the JSON object and getting the corresponding coefficient such as coeff.hasDivWithPriceClass
?
src/fathom_ruleset.js
Outdated
*/ | ||
function formatPrice(priceString) { | ||
const formattedPriceStr = priceString.replace('$', '').replace(/([\s]|[^0-9$.-])/g, ''); | ||
return parseFloat(formattedPriceStr.substr(formattedPriceStr.indexOf('$') + 1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just removed this function entirely for now.
src/product_info.js
Outdated
let gotText; | ||
if (feature === 'price') { | ||
// strip whitespace, dollar sign, words, and trailing zeros when comparing price | ||
gotText = tuningRoutine(...coeffs)(window.document).map(fnode => fnode.element)[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the exported method from fathom_ruleset.js
so that it simply returns the highest scoring element for the given feature in the given doc.
src/product_info.js
Outdated
let gotText; | ||
if (feature === 'price') { | ||
// strip whitespace, dollar sign, words, and trailing zeros when comparing price | ||
gotText = tuningRoutine(...coeffs)(window.document).map(fnode => fnode.element)[0]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and then grab just the first one (can more than one even be returned?
Yes. Fathom is returning all elements with the highest score. Because our ruleset (with it's one naive rule) is really bad right now, and on some pages, there is no div element with a class of "price", it is returning all div elements in that doc.
If the list has multiple elements, then I imagine we are not confident in Fathom's results, and should use our fallback method, where applicable? That's what I have now in the updated PR. What would you recommend @Osmose ?
@Osmose , Ready for another look. The biggest open question is what to do when Fathom returns a multi-item list of elements, all having the highest score for a given product feature; see my Comment Review for more details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this changes, this is getting pretty close. Nice work!
src/fathom_ruleset.js
Outdated
* Extracts the highest scoring element for a given feature contained | ||
* in a page's HTML document. | ||
*/ | ||
export default function runTuningRoutine(doc) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ruleset isn't really a "tuning routine". Maybe rename this runRuleset
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'm not really sure what "tuningRoutine" means in this context, but we certainly aren't tuning anything here. This was a relic of Swathi's/Victor's code.
src/fathom_ruleset.js
Outdated
); | ||
|
||
/** | ||
* Extracts the highest scoring element for a given feature contained |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "given feature" bit of this comment doesn't really apply anymore. There's no feature being passed in anymore, it just returns all the features we currently know how to extract.
src/product_info.js
Outdated
for (const [vendor, attributeInfo] of Object.entries(extractionData)) { | ||
if (hostname.includes(vendor)) { | ||
return attributeInfo; | ||
const fallbackExtraction = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we define each of these extraction objects in separate files and import them? That'd help focus this module a bit more.
For the Fathom one, you can probably just define that in the existing fathom file and export it instead of the "return DOM nodes" function.
src/product_info.js
Outdated
const metaEle = document.querySelector(`meta[property='${value}']`); | ||
if (metaEle) { | ||
data[key] = metaEle.getAttribute('content'); | ||
data.url = window.document.URL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This extraction doesn't change per-extraction-method, so we can just add it in in getProductInfo
instead of duplicating it.
Since it's possible for multiple elements to be returned by Fathom for a given ruleset applied to a page, I applied a threshold (currently just set as the score for the only rule we have, 'hasDivWithPriceClass') to help provide an additional measure of confidence. If more than one element meet or exceed the threshold, the first element in the list is returned.
Okay @Osmose , added the score threshold; ready for another pass from you. |
45455d4
to
afcb0bc
Compare
afcb0bc
to
a99d3ba
Compare
Sorry for the confusion @Osmose ; this should be ready to go now. The last two commits should incorporate all of your feedback from comments last night and the second review this morning. |
This commit builds off of PR #38, so that PR should merge before this. Open questions: * How to test interdependent rules, such as 'isNearProductImage' for the product title and product price candidate elements? * The only feature that is pulled out accurately on my test page (an Amazon product page) is the image. What rules can I add/modify to get title and price correct? TODO: * Add rule to remove ancestor elements who have the same 'innerText' value. * Consider adding image rule to see if an image element is the largest image on the page (above the fold). * Add price rule to see if innerText starts with '$'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, nice work!
src/fathom_extraction.js
Outdated
const SCORE_THRESHOLD = fathomCoeffs.hasDivWithPriceClass; | ||
|
||
/** | ||
* Checks to see if an element is a <div> with a class of "price". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This description doesn't need to be quite so long. It's pretty clear that rules return coefficients in the context of Fathom, and this doesn't only score <div>
s, it scores any DOM element. A one-liner is fine: Scores fnodes with a "price" class
.
Probably also want to update the name of this to hasPriceClass
or similar.
0a1b328
to
505c16d
Compare
@Osmose , ready for your review.
Follow-up PR:
Add a ruleset (and coefficients) to extract product information as described in #36 with the target accuracy based on a training set.
Possible follow-up issue:
Currently, if Fathom finds no price elements and
extractData
is executed on a non-product page belonging to a domain we have CSS selectors for inproduct_extraction_data.json
(e.g. amazon.com), it will try to find elements based on the selectors, find nothing, and will throw an error. I.e. in this situation,extractData
will return nothing (not evennull
) togetProductInfo
. Is that the behavior that we want? This smells like a potential follow-up issue.One possible fix (and something we will likely end up doing anyway for performance considerations) is to create a ruleset for identifying a product page, so the extraction code only executes on those types of pages.