-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workaround for the GitHub Search API limit #276
Workaround for the GitHub Search API limit #276
Conversation
Thanks for the contribution! Unfortunately we can't verify the commit author(s): Avimanyu Mukhopadhyay <a***@s***.com>. One possible solution is to add that email to your GitHub account. Alternatively you can change your commits to another email and force push the change. After getting your commits associated with your GitHub account, refresh the status of this Pull Request. |
|
||
int numOfContentsFound = contentsWithImage.getTotalCount(); | ||
int numOfContentsFound = 0; | ||
for (int i = 0; i < contentsWithImage.size(); i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A "NullPointerException" could be thrown; "contentsWithImage" is nullable here.
@@ -319,19 +342,21 @@ public boolean thisUserIsOwner(GHRepository repo) throws IOException { | |||
* @param org GitHub organization | |||
* @param img image to find | |||
*/ | |||
public Optional<PagedSearchIterable<GHContent>> getGHContents(String org, String img) | |||
public Optional<List<PagedSearchIterable<GHContent>>> getGHContents(String org, String img) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Abbreviation in name 'getGHContents' must contain no more than '2' consecutive capital letters.
@@ -20,8 +20,7 @@ | |||
import org.slf4j.LoggerFactory; | |||
|
|||
import java.io.*; | |||
import java.util.List; | |||
import java.util.Optional; | |||
import java.util.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the '.' form of import should be avoided - java.util..
@@ -83,7 +82,7 @@ public GHRepository getRepo(String repoName) throws IOException { | |||
return gitHubUtil.getRepo(repoName); | |||
} | |||
|
|||
public PagedSearchIterable<GHContent> findFilesWithImage(String image, String org) throws IOException { | |||
public List<PagedSearchIterable<GHContent>> findFilesWithImage(String image, String org) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line is longer than 100 characters (found 113).
return allFiles; | ||
} | ||
|
||
protected List<PagedSearchIterable<GHContent>> getSearchResultsForEachOrg(String image, PagedSearchIterable<GHContent> files) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line is longer than 100 characters (found 150).
@@ -20,8 +20,7 @@ | |||
import org.slf4j.LoggerFactory; | |||
|
|||
import java.io.*; | |||
import java.util.List; | |||
import java.util.Optional; | |||
import java.util.*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong lexicographical order for 'java.util.*' import. Should be before 'org.slf4j.LoggerFactory'.
@@ -30,6 +30,7 @@ | |||
import java.util.ArrayList; | |||
import java.util.List; | |||
import java.util.Optional; | |||
import java.util.concurrent.TimeUnit; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong lexicographical order for 'java.util.concurrent.TimeUnit' import. Should be before 'org.slf4j.LoggerFactory'.
@@ -58,29 +59,34 @@ public void execute(final Namespace ns, DockerfileGitHubUtil dockerfileGitHubUti | |||
new GitForkBranch(ns.get(Constants.IMG), ns.get(Constants.TAG), ns.get(Constants.GIT_BRANCH)); | |||
|
|||
log.info("Finding Dockerfiles with the given image..."); | |||
Optional<PagedSearchIterable<GHContent>> contentsWithImage = dockerfileGitHubUtil.getGHContents(ns.get(Constants.GIT_ORG), img); | |||
Optional<List<PagedSearchIterable<GHContent>>> contentsWithImage = dockerfileGitHubUtil.getGHContents(ns.get(Constants.GIT_ORG), img); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line is longer than 100 characters (found 142).
List<PagedSearchIterable<GHContent>> contentsFoundWithImage = contentsWithImage.get(); | ||
for (int i = 0; i < contentsFoundWithImage.size(); i++ ) { | ||
Multimap<String, GitHubContentToProcess> pathToDockerfilesInParentRepo = | ||
pullRequestSender.forkRepositoriesFoundAndGetPathToDockerfiles(contentsFoundWithImage.get(i), gitForkBranch); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line is longer than 100 characters (found 133).
exceptions.add(e); | ||
} | ||
} else { | ||
log.warn("Didn't find fork for {} so not changing Dockerfiles", currUserRepo); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line is longer than 100 characters (found 102).
Codecov Report
@@ Coverage Diff @@
## master #276 +/- ##
============================================
+ Coverage 74.46% 74.92% +0.45%
- Complexity 241 250 +9
============================================
Files 17 17
Lines 885 969 +84
Branches 124 137 +13
============================================
+ Hits 659 726 +67
- Misses 200 213 +13
- Partials 26 30 +4
|
@@ -100,6 +106,7 @@ protected void changeDockerfiles(Namespace ns, | |||
String parentName = parent.getFullName(); | |||
|
|||
log.info("Fixing Dockerfiles in {} to PR to {}", forkedRepo.getFullName(), parent.getFullName()); | |||
TimeUnit.SECONDS.sleep(10); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this for debugging or are you fixing something? If you're waiting on a resource, we should have a poll loop with a timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also note that, in some cases, the underlying GitHub library will do some retries (kinda wish they were exponential backoff with jitter but 🤷 ) and will do some reacting to rate-limit responses that GitHub will send back.
int totalCount = files.getTotalCount(); | ||
if (totalCount > 1000) { | ||
log.info("Number of files found for {}: {}", image, totalCount); | ||
if (totalCount > 1000 && org == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make 1000
a constant - or better yet make it configurable, so that we can make this changable in utests and itests
if (totalCount > 1000) { | ||
log.info("Number of files found for {}: {}", image, totalCount); | ||
if (totalCount > 1000 && org == null) { | ||
return getSearchResultsForEachOrg(image, files); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getSearchResultsForEachOrg
-> getSearchResultsForFiles
orgs.add(orgName); | ||
} | ||
for(String gitOrg : orgs) { | ||
contentsWithImage = findFilesWithImage(image, gitOrg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're making a call per org, which will increase # of gh api calls a lot. We should be excluding orgs that I've processed instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of calling findFilesWithImage
, we need to exclude the org that returned the most results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Linkifying the issue - I think we're looking at this: #169 (comment)
} | ||
if (orgToExclude != null) { | ||
String queryExcludingOrg = String.format("-org:{}", orgToExclude); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
String.format takes %s
not {}
@@ -73,6 +73,8 @@ static ArgumentParser getArgumentParser() { | |||
.help("regex of repository names to exclude from pull request generation"); | |||
parser.addArgument("-B") | |||
.help("additional body text to include in pull requests"); | |||
parser.addArgument("-l", "--" + Constants.GIT_API_SEARCH_LIMIT) |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
@@ -48,14 +48,22 @@ public void execute(final Namespace ns, final DockerfileGitHubUtil dockerfileGit | |||
Multimap<String, String> pathToDockerfilesInParentRepo = ArrayListMultimap.create(); | |||
|
|||
Set<Map.Entry<String, JsonElement>> imageToTagStore = parseStoreToImagesMap(ns.get(Constants.STORE)); | |||
Integer gitApiSearchLimit; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for any of the below integer parsing b/c of adding the .type
above
@@ -73,6 +73,8 @@ static ArgumentParser getArgumentParser() { | |||
.help("regex of repository names to exclude from pull request generation"); | |||
parser.addArgument("-B") | |||
.help("additional body text to include in pull requests"); | |||
parser.addArgument("-l", "--" + Constants.GIT_API_SEARCH_LIMIT) | |||
.help("limit the search results for github api (default: 1000)"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also add .setDefault(1000)
...that way you don't need those constants you added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And you won't need the null check you added either ;)
forkRepositoriesFound(pathToDockerfilesInParentRepo, | ||
imagesFoundInParentRepo, contentsWithImage, image); | ||
List<PagedSearchIterable<GHContent>> contentsWithImage = | ||
this.dockerfileGitHubUtil.findFilesWithImage(image, ns.get(Constants.GIT_ORG), null, gitApiSearchLimit); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to use Optional.empty()
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll need to change your method signature as well, but it is worth it to avoid null
checks
@@ -58,29 +58,41 @@ public void execute(final Namespace ns, DockerfileGitHubUtil dockerfileGitHubUti | |||
new GitForkBranch(ns.get(Constants.IMG), ns.get(Constants.TAG), ns.get(Constants.GIT_BRANCH)); | |||
|
|||
log.info("Finding Dockerfiles with the given image..."); | |||
Optional<PagedSearchIterable<GHContent>> contentsWithImage = dockerfileGitHubUtil.getGHContents(ns.get(Constants.GIT_ORG), img); | |||
|
|||
Integer gitApiSearchLimit; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto here as well...this just boils down to Integer gitApiSearchLimit = ns.get(Constants.GIT_API_SEARCH_LIMIT);
w/ the above recommendations
imagesFoundInParentRepo, contentsWithImage, image); | ||
List<PagedSearchIterable<GHContent>> contentsWithImage = | ||
this.dockerfileGitHubUtil.findFilesWithImage(image, ns.get(Constants.GIT_ORG), null, gitApiSearchLimit); | ||
for (int i = 0; i < contentsWithImage.size(); i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: cleaner to do this the modern java way:
contentsWithImage.forEach(pagedSearchIterable -> {
forkRepositoriesFound(pathToDockerfilesInParentRepo,
imagesFoundInParentRepo, contentsWithImage.get(i), image);
});
@@ -34,4 +34,6 @@ private Constants() { | |||
public static final String GIT_PR_BODY = "B"; | |||
public static final String GIT_ADDITIONAL_COMMIT_MESSAGE = "c"; | |||
public static final String GIT_REPO_EXCLUDES = "excludes"; | |||
public static final String GIT_API_SEARCH_LIMIT = "git-api-search-limit"; | |||
public static final Integer GIT_API_SEARCH_LIMIT_NUMBER = 1000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need based on above suggestion
GHContentSearchBuilder search = gitHubUtil.startSearch(); | ||
// Filename search appears to yield better / more results than language:Dockerfile | ||
// Root cause: linguist doesn't currently deal with prefixes of files: | ||
// https://github.com/github/linguist/issues/4566 | ||
search.filename("Dockerfile"); | ||
if (org != null) { | ||
search.user(org); | ||
if (orgToInclude != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if orgToInclude.equals(orgToExclude)
. Can that ever happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should pass in a Map<String, Boolean>
instead of orgToInclude and orgToInclude...if map is empty, then we do search.q
for each entry that has true value, we do search.user
otherwise, we build a string with multiple -org:
log.info("Number of files found for {}: {}", image, totalCount); | ||
return files; | ||
if (totalCount > gitApiSearchLimit && orgToInclude != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this warning any more?
String orgName; | ||
String orgWithMaximumHits = ""; | ||
int maxCount = 0; | ||
HashMap<String, Integer> orgHitsMap = new HashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HashMap<String, Integer> orgHitsMap = new HashMap<>(); | |
Map<String, Integer> orgHitsMap = new HashMap<>(); |
|
||
protected String getOrgNameWithMaximumHits(PagedSearchIterable<GHContent> files) { | ||
GHRepository org; | ||
String orgName; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for expanded scope of these
return allFiles; | ||
} | ||
|
||
protected String getOrgNameWithMaximumHits(PagedSearchIterable<GHContent> files) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modern java way of doing this:
return files.toList().stream()
.collect(Collectors.groupingBy(ghContent -> ghContent.getOwner().getOwnerName(), Collectors.counting()))
.entrySet()
.stream()
.max(Map.Entry.comparingByValue())
.get()
.getKey();
int numOfContentsFound = contentsWithImage.getTotalCount(); | ||
int numOfContentsFound = 0; | ||
for (int i = 0; i < contentsWithImage.size(); i++) { | ||
numOfContentsFound += contentsWithImage.get(i).getTotalCount(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More idiomatic these days to do:
contentsWithImage.stream().mapToInt(PagedSearchIterable::getTotalCount).sum()
log.warn("Number of search results is above {}! The GitHub Search API will only return around 1000 results - https://developer.github.com/v3/search/#about-the-search-api", Constants.GIT_API_SEARCH_LIMIT_NUMBER.toString()); | ||
} | ||
else if (totalCount > gitApiSearchLimit) { | ||
return getSearchResultsExcludingOrgWithMostHits(image, files, gitApiSearchLimit); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not totally convinced that this will filter out multiple orgs if we need two several cycles of excludes.
For example 3 orgs:
OrgA - 1000 results
OrgB - 1000 results
OrgC - 1000 results
We'll need to do something like:
T0: query with all orgs, 3k total results, keep results for OrgA assuming those appeared and not others.
T1: query with all orgs minus OrgA, 2k total results, keep results for OrgB assuming those appeared and not others.
T2: query with all orgs minus OrgB and OrgC, 1k total results, we are finished.
I don't think we do that here b/c we don't seem to make the org exclude str additive, but I might be missing something.
contentsWithImage = findFilesWithImage(img, org); | ||
if (contentsWithImage.getTotalCount() > 0) { | ||
contentsWithImage = findFilesWithImage(img, orgsToIncludeInSearch, gitApiSearchLimit); | ||
if (contentsWithImage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Call "Optional#isPresent()" before accessing the value.
return files; | ||
if (totalCount > gitApiSearchLimit | ||
&& orgsToIncludeOrExclude.size() == 1 | ||
&& orgsToIncludeOrExclude.entrySet() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Call "Optional#isPresent()" before accessing the value.
@@ -319,23 +390,32 @@ public boolean thisUserIsOwner(GHRepository repo) throws IOException { | |||
* @param org GitHub organization | |||
* @param img image to find | |||
*/ | |||
public Optional<PagedSearchIterable<GHContent>> getGHContents(String org, String img) | |||
public Optional<List<PagedSearchIterable<GHContent>>> getGHContents(String org, String img, Integer gitApiSearchLimit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Abbreviation in name 'getGHContents' must contain no more than '2' consecutive capital letters.
orgsToExclude.put(orgWithMaximumHits, false); | ||
log.info("Running search by excluding the orgs {}.", orgsToExclude.keySet().toString()); | ||
Optional<List<PagedSearchIterable<GHContent>>> contentsExcludingOrgWithMaximumHits; | ||
contentsExcludingOrgWithMaximumHits = findFilesWithImage(image, orgsToExclude, gitApiSearchLimit); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line is longer than 100 characters (found 106).
@@ -319,23 +390,32 @@ public boolean thisUserIsOwner(GHRepository repo) throws IOException { | |||
* @param org GitHub organization | |||
* @param img image to find | |||
*/ | |||
public Optional<PagedSearchIterable<GHContent>> getGHContents(String org, String img) | |||
public Optional<List<PagedSearchIterable<GHContent>>> getGHContents(String org, String img, Integer gitApiSearchLimit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line is longer than 100 characters (found 122).
@@ -17,11 +17,11 @@ | |||
import com.salesforce.dockerfileimageupdate.storage.GitHubJsonStore; | |||
import org.kohsuke.github.*; | |||
import org.slf4j.Logger; | |||
import java.util.stream.Collectors; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong lexicographical order for 'java.util.stream.Collectors' import. Should be before 'org.slf4j.Logger'.
throws IOException, InterruptedException { | ||
PagedSearchIterable<GHContent> contentsWithImage = null; | ||
Optional<List<PagedSearchIterable<GHContent>>> contentsWithImage = Optional.empty(); | ||
Map<String, Boolean> orgsToIncludeInSearch = Collections.unmodifiableMap(Collections.singletonMap(org, true)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line is longer than 100 characters (found 118).
Optional<PagedSearchIterable<GHContent>> contentsWithImage = dockerfileGitHubUtil.getGHContents(ns.get(Constants.GIT_ORG), img); | ||
|
||
Integer gitApiSearchLimit = ns.get(Constants.GIT_API_SEARCH_LIMIT); | ||
Optional<List<PagedSearchIterable<GHContent>>> contentsWithImage = dockerfileGitHubUtil.getGHContents(ns.get(Constants.GIT_ORG), img, gitApiSearchLimit); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line is longer than 100 characters (found 161).
forkRepositoriesFound(pathToDockerfilesInParentRepo, | ||
imagesFoundInParentRepo, contentsWithImage, image); | ||
Optional<List<PagedSearchIterable<GHContent>>> contentsWithImage = | ||
this.dockerfileGitHubUtil.findFilesWithImage(image, orgsToIncludeInSearch, gitApiSearchLimit); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line is longer than 100 characters (found 114).
@@ -83,14 +83,27 @@ public GHRepository getRepo(String repoName) throws IOException { | |||
return gitHubUtil.getRepo(repoName); | |||
} | |||
|
|||
public PagedSearchIterable<GHContent> findFilesWithImage(String image, String org) throws IOException { | |||
public Optional<List<PagedSearchIterable<GHContent>>> findFilesWithImage( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Method findFilesWithImage
has 47 lines of code (exceeds 25 allowed). Consider refactoring.
@@ -83,14 +83,27 @@ public GHRepository getRepo(String repoName) throws IOException { | |||
return gitHubUtil.getRepo(repoName); | |||
} | |||
|
|||
public PagedSearchIterable<GHContent> findFilesWithImage(String image, String org) throws IOException { | |||
public Optional<List<PagedSearchIterable<GHContent>>> findFilesWithImage( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Method findFilesWithImage
has a Cognitive Complexity of 12 (exceeds 5 allowed). Consider refactoring.
@@ -16,12 +16,12 @@ | |||
import com.salesforce.dockerfileimageupdate.search.GitHubImageSearchTermList; | |||
import com.salesforce.dockerfileimageupdate.storage.GitHubJsonStore; | |||
import org.kohsuke.github.*; | |||
import java.util.stream.Collectors; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong lexicographical order for 'java.util.stream.Collectors' import. Should be before 'org.kohsuke.github.*'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@avimanyum Thanks! This is awesome! Could you edit the description of what the net change of this PR does? It seems to appear that it will potentially exclude one org with the top hits but I haven't given it a super deep look.
if (org != null) { | ||
search.user(org); | ||
if (!orgsToIncludeOrExclude.isEmpty()) { | ||
StringBuilder includeOrexcludeOrgsQuery = new StringBuilder(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StringBuilder includeOrexcludeOrgsQuery = new StringBuilder(); | |
StringBuilder includeOrExcludeOrgsQuery = new StringBuilder(); |
@@ -48,14 +48,27 @@ public void execute(final Namespace ns, final DockerfileGitHubUtil dockerfileGit | |||
Multimap<String, String> pathToDockerfilesInParentRepo = ArrayListMultimap.create(); | |||
|
|||
Set<Map.Entry<String, JsonElement>> imageToTagStore = parseStoreToImagesMap(ns.get(Constants.STORE)); | |||
Integer gitApiSearchLimit = ns.get(Constants.GIT_API_SEARCH_LIMIT); | |||
List<IOException> exceptions = new ArrayList<>(); | |||
Map<String, Boolean> orgsToIncludeInSearch = new HashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we wanting a Set
instead? Should it be unmodifyable? What is the value
here? Should this be a Set
with typed IncludedOrgs
/ ExcludedOrgs
or something like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We wanted to have a map where the keys would be Org names and the value would be a Boolean value. True indicates the the org should be included in the search, and False indicates that it should not be included in the search. Using a map also ensures that the same org cannot be included and excluded at the same time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. I guess I could read this a few ways. Since the variable is named orgsToIncludeInSearch
I don't know that I'd immediately think of the include/exclude so that's where I was wondering if it deems a formal class for coherence. I guess the other way would be to add a comment for clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the method call happens here, the map will only have an org to include in the search. Hence, I chose to name it orgsToIncludeInSearch
. In the actual method definition, this gets stored in a variable Map<String, Boolean> orgsToIncludeOrExclude
since this data structure can contain orgs to include or exclude depending on where it is being invoked from.
Integer gitApiSearchLimit = ns.get(Constants.GIT_API_SEARCH_LIMIT); | ||
List<IOException> exceptions = new ArrayList<>(); | ||
Map<String, Boolean> orgsToIncludeInSearch = new HashMap<>(); | ||
orgsToIncludeInSearch.put(ns.get(Constants.GIT_ORG), true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if GIT_ORG
wasn't specified? Does this work properly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this. Will add a check here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were you still going to add a check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, added a check.
@justinharringa I add the description here #276 (comment). Please let me know if that looks ok. Also, is there any other place you wanted me to add the description to? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you need to rebase. Added some more comments. Looks like there are some unresolved comments from Andrey and I.
Thanks for the description add! I think that makes sense.
@@ -48,14 +48,27 @@ public void execute(final Namespace ns, final DockerfileGitHubUtil dockerfileGit | |||
Multimap<String, String> pathToDockerfilesInParentRepo = ArrayListMultimap.create(); | |||
|
|||
Set<Map.Entry<String, JsonElement>> imageToTagStore = parseStoreToImagesMap(ns.get(Constants.STORE)); | |||
Integer gitApiSearchLimit = ns.get(Constants.GIT_API_SEARCH_LIMIT); | |||
List<IOException> exceptions = new ArrayList<>(); | |||
Map<String, Boolean> orgsToIncludeInSearch = new HashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. I guess I could read this a few ways. Since the variable is named orgsToIncludeInSearch
I don't know that I'd immediately think of the include/exclude so that's where I was wondering if it deems a formal class for coherence. I guess the other way would be to add a comment for clarity.
Integer gitApiSearchLimit = ns.get(Constants.GIT_API_SEARCH_LIMIT); | ||
List<IOException> exceptions = new ArrayList<>(); | ||
Map<String, Boolean> orgsToIncludeInSearch = new HashMap<>(); | ||
orgsToIncludeInSearch.put(ns.get(Constants.GIT_ORG), true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were you still going to add a check?
if (org.getKey() != null) { | ||
if (org.getValue()) { | ||
log.info("Including the org {} in the search.", org.getKey()); | ||
includeOrExcludeOrgsQuery.append(String.format("user:%s ", org.getKey())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not simply do search.user(org.getKey())
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There can be multiple orgs that will need to be appended to the search query. From my understanding search.user(org)
internally reslolves to search.q(user:org)
. If we call call search.user(org)
wouldn't the search query keep getting overwritten with every call and have only the last org added? Instead, using the search.q
method allows us to construct the query string with all orgs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you dig into search.q()
you'll notice that it's just appending search terms to a GHSearchBuilder
. If it were the case that it was overriding, you'd have a similar problem with search.q()
here overriding search.filename()
above.
} | ||
} | ||
} | ||
search.q(includeOrExcludeOrgsQuery.toString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to check to make sure we don't exceed any search term limits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please point me to some documentation about search term limits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @justinharringa!
Do we need to add an additional check here or can we rely on the check that Github will do internally and return a "Validation failed" error if the limit is exceeded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not spotted anything in the github-api Java library that does this check but I'm not sure. I suspect that you'd likely get a failure from the actual GitHub API in this case.
Seems like it would be worth testing. A simple way to test would be to simply try and build a main()
that adds 500 search terms of length beyond the limit and see what happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running the search for every individual org will increase # of gh api calls a lot. But reaching a 128 chars limit would not be very hard like you mentioned.
CC @afalko
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to do a search for each individual org unless each of them are 90 long. I think we'll need do the following if the total length of orgs + the Dockerfile search is >120:
- Sort orgs by length
- Build list of smallest length orgs until you have as many as will fit under the 120 limit
- Call search with those
- Repeat step
#2
until you don't have any orgs remaining.
We'll need a function that calls search again with another set of orgs if
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be tracked as a follow up? I think in our case, the chances of the query hitting that limit is quite low. For most cases, excluding 1 or 2 git orgs should bring down the search space to less than 1000.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll let you 2 make the determination on follow-up or not - would be good to check one of our known large orgs to get an idea of how impactful this is. Note that github-api
does try and respect rate limiting in some cases with retries (although it's simple sleep where this does occur). I believe that this is one of those cases so that certainly helps with GH API calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@justinharringa : we determined to proceed forward without the above and create a follow up change to either do the above or take another approach all together.
return files; | ||
if (totalCount > gitApiSearchLimit | ||
&& orgsToIncludeOrExclude.size() == 1 | ||
&& orgsToIncludeOrExclude |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Call "Optional#isPresent()" before accessing the value.
@@ -83,14 +84,29 @@ public GHRepository getRepo(String repoName) throws IOException { | |||
return gitHubUtil.getRepo(repoName); | |||
} | |||
|
|||
public PagedSearchIterable<GHContent> findFilesWithImage(String image, String org) throws IOException { | |||
public Optional<List<PagedSearchIterable<GHContent>>> findFilesWithImage( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Method findFilesWithImage
has 59 lines of code (exceeds 25 allowed). Consider refactoring.
import java.util.List; | ||
import java.util.Optional; | ||
import java.util.*; | ||
import java.util.HashMap; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong lexicographical order for 'java.util.HashMap' import. Should be before 'org.slf4j.LoggerFactory'.
@@ -319,23 +403,38 @@ public boolean thisUserIsOwner(GHRepository repo) throws IOException { | |||
* @param org GitHub organization | |||
* @param img image to find | |||
*/ | |||
public Optional<PagedSearchIterable<GHContent>> getGHContents(String org, String img) | |||
public Optional<List<PagedSearchIterable<GHContent>>> getGHContents(String org, String img, Integer gitApiSearchLimit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Method getGHContents
has 28 lines of code (exceeds 25 allowed). Consider refactoring.
@@ -319,23 +403,38 @@ public boolean thisUserIsOwner(GHRepository repo) throws IOException { | |||
* @param org GitHub organization | |||
* @param img image to find | |||
*/ | |||
public Optional<PagedSearchIterable<GHContent>> getGHContents(String org, String img) | |||
public Optional<List<PagedSearchIterable<GHContent>>> getGHContents(String org, String img, Integer gitApiSearchLimit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Method getGHContents
has a Cognitive Complexity of 6 (exceeds 5 allowed). Consider refactoring.
log.warn("Number of search results for a single org {} is above {}! The GitHub Search API will only return around 1000 results - https://developer.github.com/v3/search/#about-the-search-api", | ||
orgName, gitApiSearchLimit); | ||
} else if (totalCount > gitApiSearchLimit) { | ||
log.info("The number of files returned is greater than the git API search limit" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'+' should be on a new line.
orgName, gitApiSearchLimit); | ||
} else if (totalCount > gitApiSearchLimit) { | ||
log.info("The number of files returned is greater than the git API search limit" + | ||
" of {}. The orgs with the maximum number of hits will be recursively removed" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'+' should be on a new line.
} else if (totalCount > gitApiSearchLimit) { | ||
log.info("The number of files returned is greater than the git API search limit" + | ||
" of {}. The orgs with the maximum number of hits will be recursively removed" + | ||
" to reduce the search space. For every org that is excluded, a separate " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'+' should be on a new line.
@@ -83,14 +84,29 @@ public GHRepository getRepo(String repoName) throws IOException { | |||
return gitHubUtil.getRepo(repoName); | |||
} | |||
|
|||
public PagedSearchIterable<GHContent> findFilesWithImage(String image, String org) throws IOException { | |||
public Optional<List<PagedSearchIterable<GHContent>>> findFilesWithImage( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Method findFilesWithImage
has a Cognitive Complexity of 15 (exceeds 5 allowed). Consider refactoring.
…m/dockerfile-image-update into git_api_search_limit_workaround
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Despite the search space size limitation, this buys some time for a better solution.
Currently, due to a GIT API search limit of 1000, DFIU can only send PRs to up to 1000 downstream repos when a new version for a base image is available. With this change, if the number of downstream repos consuming a base image is more than 1000, the Git org with the most number of consumers is excluded from the search space. DFIU is run separately on that individual org while another search is executed while excluding that org. This process is repeated, by adding orgs recursively with the most number of consumers to the exclude list to reduce the search space till the resulting number of consumers is less than 1000. For each of the orgs that are added to the exclusion list, DFIU is run individually on each of those orgs.