Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combine efforts with enthec/webappanalyzer #52

Open
derekperkins opened this issue Jul 25, 2024 · 9 comments
Open

Combine efforts with enthec/webappanalyzer #52

derekperkins opened this issue Jul 25, 2024 · 9 comments

Comments

@derekperkins
Copy link

derekperkins commented Jul 25, 2024

After lots of looking, the two most active forks are this one and https://github.com/enthec/webappanalyzer. I was wondering if it made sense to join forces in a single repo.

cc @enthec

@RignonNoel
Copy link

As a community project that had a really big down, I think that the flag of @derekperkins is pretty legit and important.

Maybe that @tunetheweb and @max-ostapenko that do a lot of contribution here could also make an answer ?

If we don't communicate nor collaborate, I don't see the point to re-create any fork of this tech.

@tunetheweb
Copy link
Member

We evaluated the options in the doc referenced on that linked issue including 1) licensing, 2) using any well-maintained fork that sprung up from the community, or 3) maintaining our own fork on a best-efforts basis as and when members of the community opened PRs. That last option is what we went with and has been working pretty well for us. @max-ostapenko even automated CI checks since with our WebPageTest's instance to give us further confidence in merging, and reduce the maintenance overhead.

We're not looking to maintain a Wappalyzer fork and only have the limited resources to commit to maintain this fork, as noted in that doc.

We'll be happy to evaluate any other fork that does spring up and note that Enthec's fork was one of the most active ones (though that seems to have died off a little in last few months). However, they have also made some changes that may make it incompatible with our usage. So, at present there are a few concerns with moving to that fork and, as I said, our current decision is working out well enough for us.

So, TLDR, no plans to combine efforts with others at the moment but happy to leave this issue open in case the situation changes.

@enthec-opensource
Copy link

Hi!

We see the issue and we totally understand it, we plan to keep updating it, we just have our progress -- we are doing less commits but bigger than before also trying to keep up with the original one

Not sure about how your CI works but we have fully automated it aswell following the original structure.

About those refactors they were about inconsistencies in the structure and the difficulty for strongly typed languages such as Golang to map those values properly, we decided to convert those ambiguous values to one, for example "implies" can be both string and list, so we just converted it to list even if it only has 1 element, we found a lot of those while doing CI and have had some issues opened suggesting making it easier -- whatever implementation for the original wappalyzer you used it had to work with "string or string list" so those changes shouldn't make any difference. There are some that we are unable to do without breaking old wappalyzer implementations such as "dom" so we just tried to reduce them as much as possible.

We also plan on doing libs, like the python one we already have and have someone hired publishing but thats on the todo list.

We are really interested in getting as many techs as possible to make it as accurate as possible, maybe we can work something out about syncing technologies between repos.

Happy to talk to you, lets see how this evolves, we will leave our issue open aswell just if anything changes.

@derekperkins
Copy link
Author

we decided to convert those ambiguous values to one, for example "implies" can be both string and list, so we just converted it to list even if it only has 1 element

We had an internal fork that did this, since it is annoying to deal with.

There are some that we are unable to do without breaking old wappalyzer implementations such as "dom" so we just tried to reduce them as much as possible

Are there any specifics you have in mind that aren't backwards compatible?

@tunetheweb
Copy link
Member

tunetheweb commented Nov 4, 2024

Not sure about how your CI works but we have fully automated it aswell following the original structure.

So at the moment we allow people to provide a list of example URLs in the original PR comment, then run the branch though the HTTP Archive infrastructure and spit out the results, so we can see if the detection was successful or not. We also run against the https://almanac.httparchive.org/en/2022/ site by default to ensure it does NOT detect the new detection (assuming it does not use this). See this example PR: #71 where you can see the github actions comment with the new detections.

whatever implementation for the original wappalyzer you used it had to work with "string or string list" so those changes shouldn't make any difference

Oh that's good to know!

Happy to talk to you, lets see how this evolves, we will leave our issue open as well just if anything changes.

BTW happy for you to take out detections if you can easily automate that if you want?

@enthec-opensource
Copy link

enthec-opensource commented Nov 4, 2024

Are there any specifics you have in mind that aren't backwards compatible?

Of course we are trying to keep as much legacy as possible, but the jsons were chaotic and much of those values were not documented, we have been able to normalize every field except dom as you can see here: https://github.com/enthec/webappanalyzer/blob/main/.github/workflows/scripts/technology_validator.py#L311

The dom validator: https://github.com/enthec/webappanalyzer/blob/main/.github/workflows/scripts/technology_validator.py#L218
so apparently in dom there were 3 possibilities

With this situation we are not sure how to transform the list of strings to the second chaotic version as it kind of seems more complete, probably {"a[href*='kqsdesign.pl'][target='_blank']": {"exists": ""}} but also, i dont think a lot of wappalyzer implementations handle this thing, they probably just do the upper one.

This is what we did for our python lib: https://github.com/enthec/python-webappanalyzer/blob/master/webappanalyzer/webappanalyzer.py#L275

BTW happy for you to take out detections if you can easily automate that if you want?

Oh nice! we just do static structure validation in our CI, we can report any issues with that iyw.

@tunetheweb
Copy link
Member

Definitely do please let us know if you spot any issues with the detections we've added that have slipped past us!

@max-ostapenko
Copy link

max-ostapenko commented Nov 6, 2024

Not sure about how your CI works but we have fully automated it aswell following the original structure.

@enthec-opensource I like the implementation of your schema improvements.
Does your validation has a 'fix' option to adjust all those ambiguous attributes to lists automatically?
It would be help for a sync too.

We have around 15% of technologies that are not detected on a single website out of 12M.
I think it could be interesting to compare the rules with yours before we remove them as outdated.

@enthec-opensource
Copy link

sure, just pushed a script with those fixes https://github.com/enthec/webappanalyzer/blob/main/scripts/fix.py

just run python3 scripts/fix.py from project root

Will try to sync and validate everything else next week

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants