Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workaround for the GitHub Search API limit #276

Merged
merged 21 commits into from
Nov 11, 2021
Merged

Workaround for the GitHub Search API limit #276

merged 21 commits into from
Nov 11, 2021

Conversation

avimanyum
Copy link
Contributor

@avimanyum avimanyum commented Oct 13, 2021

Currently, due to a GIT API search limit of 1000, DFIU can only send PRs to up to 1000 downstream repos when a new version for a base image is available. With this change, if the number of downstream repos consuming a base image is more than 1000, the Git org with the most number of consumers is excluded from the search space. DFIU is run separately on that individual org while another search is executed while excluding that org. This process is repeated, by adding orgs recursively with the most number of consumers to the exclude list to reduce the search space till the resulting number of consumers is less than 1000. For each of the orgs that are added to the exclusion list, DFIU is run individually on each of those orgs.

@salesforce-cla
Copy link

Thanks for the contribution! Unfortunately we can't verify the commit author(s): Avimanyu Mukhopadhyay <a***@s***.com>. One possible solution is to add that email to your GitHub account. Alternatively you can change your commits to another email and force push the change. After getting your commits associated with your GitHub account, refresh the status of this Pull Request.

@github-actions github-actions bot added the tests Improvements / maintenance to tests label Oct 13, 2021

int numOfContentsFound = contentsWithImage.getTotalCount();
int numOfContentsFound = 0;
for (int i = 0; i < contentsWithImage.size(); i++) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A "NullPointerException" could be thrown; "contentsWithImage" is nullable here.

@@ -319,19 +342,21 @@ public boolean thisUserIsOwner(GHRepository repo) throws IOException {
* @param org GitHub organization
* @param img image to find
*/
public Optional<PagedSearchIterable<GHContent>> getGHContents(String org, String img)
public Optional<List<PagedSearchIterable<GHContent>>> getGHContents(String org, String img)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Abbreviation in name 'getGHContents' must contain no more than '2' consecutive capital letters.

@@ -20,8 +20,7 @@
import org.slf4j.LoggerFactory;

import java.io.*;
import java.util.List;
import java.util.Optional;
import java.util.*;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the '.' form of import should be avoided - java.util..

@@ -83,7 +82,7 @@ public GHRepository getRepo(String repoName) throws IOException {
return gitHubUtil.getRepo(repoName);
}

public PagedSearchIterable<GHContent> findFilesWithImage(String image, String org) throws IOException {
public List<PagedSearchIterable<GHContent>> findFilesWithImage(String image, String org) throws IOException {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line is longer than 100 characters (found 113).

return allFiles;
}

protected List<PagedSearchIterable<GHContent>> getSearchResultsForEachOrg(String image, PagedSearchIterable<GHContent> files) throws IOException {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line is longer than 100 characters (found 150).

@@ -20,8 +20,7 @@
import org.slf4j.LoggerFactory;

import java.io.*;
import java.util.List;
import java.util.Optional;
import java.util.*;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong lexicographical order for 'java.util.*' import. Should be before 'org.slf4j.LoggerFactory'.

@@ -30,6 +30,7 @@
import java.util.ArrayList;
import java.util.List;
import java.util.Optional;
import java.util.concurrent.TimeUnit;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong lexicographical order for 'java.util.concurrent.TimeUnit' import. Should be before 'org.slf4j.LoggerFactory'.

@@ -58,29 +59,34 @@ public void execute(final Namespace ns, DockerfileGitHubUtil dockerfileGitHubUti
new GitForkBranch(ns.get(Constants.IMG), ns.get(Constants.TAG), ns.get(Constants.GIT_BRANCH));

log.info("Finding Dockerfiles with the given image...");
Optional<PagedSearchIterable<GHContent>> contentsWithImage = dockerfileGitHubUtil.getGHContents(ns.get(Constants.GIT_ORG), img);
Optional<List<PagedSearchIterable<GHContent>>> contentsWithImage = dockerfileGitHubUtil.getGHContents(ns.get(Constants.GIT_ORG), img);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line is longer than 100 characters (found 142).

List<PagedSearchIterable<GHContent>> contentsFoundWithImage = contentsWithImage.get();
for (int i = 0; i < contentsFoundWithImage.size(); i++ ) {
Multimap<String, GitHubContentToProcess> pathToDockerfilesInParentRepo =
pullRequestSender.forkRepositoriesFoundAndGetPathToDockerfiles(contentsFoundWithImage.get(i), gitForkBranch);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line is longer than 100 characters (found 133).

exceptions.add(e);
}
} else {
log.warn("Didn't find fork for {} so not changing Dockerfiles", currUserRepo);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line is longer than 100 characters (found 102).

@codecov
Copy link

codecov bot commented Oct 13, 2021

Codecov Report

Merging #276 (6eab75a) into master (68b00b3) will increase coverage by 0.45%.
The diff coverage is 66.08%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master     #276      +/-   ##
============================================
+ Coverage     74.46%   74.92%   +0.45%     
- Complexity      241      250       +9     
============================================
  Files            17       17              
  Lines           885      969      +84     
  Branches        124      137      +13     
============================================
+ Hits            659      726      +67     
- Misses          200      213      +13     
- Partials         26       30       +4     
Impacted Files Coverage Δ
...ce/dockerfileimageupdate/subcommands/impl/All.java 51.77% <0.00%> (-4.39%) ⬇️
...dockerfileimageupdate/subcommands/impl/Parent.java 51.51% <0.00%> (-2.46%) ⬇️
...kerfileimageupdate/utils/DockerfileGitHubUtil.java 86.24% <93.33%> (+4.79%) ⬆️
.../salesforce/dockerfileimageupdate/CommandLine.java 57.42% <100.00%> (+2.21%) ⬆️

@@ -100,6 +106,7 @@ protected void changeDockerfiles(Namespace ns,
String parentName = parent.getFullName();

log.info("Fixing Dockerfiles in {} to PR to {}", forkedRepo.getFullName(), parent.getFullName());
TimeUnit.SECONDS.sleep(10);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for debugging or are you fixing something? If you're waiting on a resource, we should have a poll loop with a timeout.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note that, in some cases, the underlying GitHub library will do some retries (kinda wish they were exponential backoff with jitter but 🤷 ) and will do some reacting to rate-limit responses that GitHub will send back.

int totalCount = files.getTotalCount();
if (totalCount > 1000) {
log.info("Number of files found for {}: {}", image, totalCount);
if (totalCount > 1000 && org == null) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make 1000 a constant - or better yet make it configurable, so that we can make this changable in utests and itests

if (totalCount > 1000) {
log.info("Number of files found for {}: {}", image, totalCount);
if (totalCount > 1000 && org == null) {
return getSearchResultsForEachOrg(image, files);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getSearchResultsForEachOrg -> getSearchResultsForFiles

orgs.add(orgName);
}
for(String gitOrg : orgs) {
contentsWithImage = findFilesWithImage(image, gitOrg);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're making a call per org, which will increase # of gh api calls a lot. We should be excluding orgs that I've processed instead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of calling findFilesWithImage, we need to exclude the org that returned the most results.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Linkifying the issue - I think we're looking at this: #169 (comment)

}
if (orgToExclude != null) {
String queryExcludingOrg = String.format("-org:{}", orgToExclude);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String.format takes %s not {}

@@ -73,6 +73,8 @@ static ArgumentParser getArgumentParser() {
.help("regex of repository names to exclude from pull request generation");
parser.addArgument("-B")
.help("additional body text to include in pull requests");
parser.addArgument("-l", "--" + Constants.GIT_API_SEARCH_LIMIT)

This comment was marked as resolved.

@@ -48,14 +48,22 @@ public void execute(final Namespace ns, final DockerfileGitHubUtil dockerfileGit
Multimap<String, String> pathToDockerfilesInParentRepo = ArrayListMultimap.create();

Set<Map.Entry<String, JsonElement>> imageToTagStore = parseStoreToImagesMap(ns.get(Constants.STORE));
Integer gitApiSearchLimit;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for any of the below integer parsing b/c of adding the .type above

@@ -73,6 +73,8 @@ static ArgumentParser getArgumentParser() {
.help("regex of repository names to exclude from pull request generation");
parser.addArgument("-B")
.help("additional body text to include in pull requests");
parser.addArgument("-l", "--" + Constants.GIT_API_SEARCH_LIMIT)
.help("limit the search results for github api (default: 1000)");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add .setDefault(1000)...that way you don't need those constants you added

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And you won't need the null check you added either ;)

forkRepositoriesFound(pathToDockerfilesInParentRepo,
imagesFoundInParentRepo, contentsWithImage, image);
List<PagedSearchIterable<GHContent>> contentsWithImage =
this.dockerfileGitHubUtil.findFilesWithImage(image, ns.get(Constants.GIT_ORG), null, gitApiSearchLimit);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to use Optional.empty() here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need to change your method signature as well, but it is worth it to avoid null checks

@@ -58,29 +58,41 @@ public void execute(final Namespace ns, DockerfileGitHubUtil dockerfileGitHubUti
new GitForkBranch(ns.get(Constants.IMG), ns.get(Constants.TAG), ns.get(Constants.GIT_BRANCH));

log.info("Finding Dockerfiles with the given image...");
Optional<PagedSearchIterable<GHContent>> contentsWithImage = dockerfileGitHubUtil.getGHContents(ns.get(Constants.GIT_ORG), img);

Integer gitApiSearchLimit;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto here as well...this just boils down to Integer gitApiSearchLimit = ns.get(Constants.GIT_API_SEARCH_LIMIT); w/ the above recommendations

imagesFoundInParentRepo, contentsWithImage, image);
List<PagedSearchIterable<GHContent>> contentsWithImage =
this.dockerfileGitHubUtil.findFilesWithImage(image, ns.get(Constants.GIT_ORG), null, gitApiSearchLimit);
for (int i = 0; i < contentsWithImage.size(); i++) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: cleaner to do this the modern java way:

contentsWithImage.forEach(pagedSearchIterable -> {
                forkRepositoriesFound(pathToDockerfilesInParentRepo,
                        imagesFoundInParentRepo, contentsWithImage.get(i), image);
            });

@@ -34,4 +34,6 @@ private Constants() {
public static final String GIT_PR_BODY = "B";
public static final String GIT_ADDITIONAL_COMMIT_MESSAGE = "c";
public static final String GIT_REPO_EXCLUDES = "excludes";
public static final String GIT_API_SEARCH_LIMIT = "git-api-search-limit";
public static final Integer GIT_API_SEARCH_LIMIT_NUMBER = 1000;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need based on above suggestion

GHContentSearchBuilder search = gitHubUtil.startSearch();
// Filename search appears to yield better / more results than language:Dockerfile
// Root cause: linguist doesn't currently deal with prefixes of files:
// https://github.com/github/linguist/issues/4566
search.filename("Dockerfile");
if (org != null) {
search.user(org);
if (orgToInclude != null) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if orgToInclude.equals(orgToExclude). Can that ever happen?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should pass in a Map<String, Boolean> instead of orgToInclude and orgToInclude...if map is empty, then we do search.q for each entry that has true value, we do search.user otherwise, we build a string with multiple -org:

log.info("Number of files found for {}: {}", image, totalCount);
return files;
if (totalCount > gitApiSearchLimit && orgToInclude != null) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this warning any more?

String orgName;
String orgWithMaximumHits = "";
int maxCount = 0;
HashMap<String, Integer> orgHitsMap = new HashMap<>();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
HashMap<String, Integer> orgHitsMap = new HashMap<>();
Map<String, Integer> orgHitsMap = new HashMap<>();


protected String getOrgNameWithMaximumHits(PagedSearchIterable<GHContent> files) {
GHRepository org;
String orgName;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for expanded scope of these

return allFiles;
}

protected String getOrgNameWithMaximumHits(PagedSearchIterable<GHContent> files) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modern java way of doing this:

return files.toList().stream()
                .collect(Collectors.groupingBy(ghContent -> ghContent.getOwner().getOwnerName(), Collectors.counting()))
                .entrySet()
                .stream()
                .max(Map.Entry.comparingByValue())
                .get()
                .getKey();

int numOfContentsFound = contentsWithImage.getTotalCount();
int numOfContentsFound = 0;
for (int i = 0; i < contentsWithImage.size(); i++) {
numOfContentsFound += contentsWithImage.get(i).getTotalCount();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More idiomatic these days to do:
contentsWithImage.stream().mapToInt(PagedSearchIterable::getTotalCount).sum()

log.warn("Number of search results is above {}! The GitHub Search API will only return around 1000 results - https://developer.github.com/v3/search/#about-the-search-api", Constants.GIT_API_SEARCH_LIMIT_NUMBER.toString());
}
else if (totalCount > gitApiSearchLimit) {
return getSearchResultsExcludingOrgWithMostHits(image, files, gitApiSearchLimit);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not totally convinced that this will filter out multiple orgs if we need two several cycles of excludes.

For example 3 orgs:
OrgA - 1000 results
OrgB - 1000 results
OrgC - 1000 results

We'll need to do something like:
T0: query with all orgs, 3k total results, keep results for OrgA assuming those appeared and not others.
T1: query with all orgs minus OrgA, 2k total results, keep results for OrgB assuming those appeared and not others.
T2: query with all orgs minus OrgB and OrgC, 1k total results, we are finished.

I don't think we do that here b/c we don't seem to make the org exclude str additive, but I might be missing something.

contentsWithImage = findFilesWithImage(img, org);
if (contentsWithImage.getTotalCount() > 0) {
contentsWithImage = findFilesWithImage(img, orgsToIncludeInSearch, gitApiSearchLimit);
if (contentsWithImage
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call "Optional#isPresent()" before accessing the value.

return files;
if (totalCount > gitApiSearchLimit
&& orgsToIncludeOrExclude.size() == 1
&& orgsToIncludeOrExclude.entrySet()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call "Optional#isPresent()" before accessing the value.

@@ -319,23 +390,32 @@ public boolean thisUserIsOwner(GHRepository repo) throws IOException {
* @param org GitHub organization
* @param img image to find
*/
public Optional<PagedSearchIterable<GHContent>> getGHContents(String org, String img)
public Optional<List<PagedSearchIterable<GHContent>>> getGHContents(String org, String img, Integer gitApiSearchLimit)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Abbreviation in name 'getGHContents' must contain no more than '2' consecutive capital letters.

orgsToExclude.put(orgWithMaximumHits, false);
log.info("Running search by excluding the orgs {}.", orgsToExclude.keySet().toString());
Optional<List<PagedSearchIterable<GHContent>>> contentsExcludingOrgWithMaximumHits;
contentsExcludingOrgWithMaximumHits = findFilesWithImage(image, orgsToExclude, gitApiSearchLimit);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line is longer than 100 characters (found 106).

@@ -319,23 +390,32 @@ public boolean thisUserIsOwner(GHRepository repo) throws IOException {
* @param org GitHub organization
* @param img image to find
*/
public Optional<PagedSearchIterable<GHContent>> getGHContents(String org, String img)
public Optional<List<PagedSearchIterable<GHContent>>> getGHContents(String org, String img, Integer gitApiSearchLimit)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line is longer than 100 characters (found 122).

@@ -17,11 +17,11 @@
import com.salesforce.dockerfileimageupdate.storage.GitHubJsonStore;
import org.kohsuke.github.*;
import org.slf4j.Logger;
import java.util.stream.Collectors;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong lexicographical order for 'java.util.stream.Collectors' import. Should be before 'org.slf4j.Logger'.

throws IOException, InterruptedException {
PagedSearchIterable<GHContent> contentsWithImage = null;
Optional<List<PagedSearchIterable<GHContent>>> contentsWithImage = Optional.empty();
Map<String, Boolean> orgsToIncludeInSearch = Collections.unmodifiableMap(Collections.singletonMap(org, true));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line is longer than 100 characters (found 118).

Optional<PagedSearchIterable<GHContent>> contentsWithImage = dockerfileGitHubUtil.getGHContents(ns.get(Constants.GIT_ORG), img);

Integer gitApiSearchLimit = ns.get(Constants.GIT_API_SEARCH_LIMIT);
Optional<List<PagedSearchIterable<GHContent>>> contentsWithImage = dockerfileGitHubUtil.getGHContents(ns.get(Constants.GIT_ORG), img, gitApiSearchLimit);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line is longer than 100 characters (found 161).

forkRepositoriesFound(pathToDockerfilesInParentRepo,
imagesFoundInParentRepo, contentsWithImage, image);
Optional<List<PagedSearchIterable<GHContent>>> contentsWithImage =
this.dockerfileGitHubUtil.findFilesWithImage(image, orgsToIncludeInSearch, gitApiSearchLimit);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line is longer than 100 characters (found 114).

@@ -83,14 +83,27 @@ public GHRepository getRepo(String repoName) throws IOException {
return gitHubUtil.getRepo(repoName);
}

public PagedSearchIterable<GHContent> findFilesWithImage(String image, String org) throws IOException {
public Optional<List<PagedSearchIterable<GHContent>>> findFilesWithImage(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method findFilesWithImage has 47 lines of code (exceeds 25 allowed). Consider refactoring.

@@ -83,14 +83,27 @@ public GHRepository getRepo(String repoName) throws IOException {
return gitHubUtil.getRepo(repoName);
}

public PagedSearchIterable<GHContent> findFilesWithImage(String image, String org) throws IOException {
public Optional<List<PagedSearchIterable<GHContent>>> findFilesWithImage(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method findFilesWithImage has a Cognitive Complexity of 12 (exceeds 5 allowed). Consider refactoring.

@@ -16,12 +16,12 @@
import com.salesforce.dockerfileimageupdate.search.GitHubImageSearchTermList;
import com.salesforce.dockerfileimageupdate.storage.GitHubJsonStore;
import org.kohsuke.github.*;
import java.util.stream.Collectors;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong lexicographical order for 'java.util.stream.Collectors' import. Should be before 'org.kohsuke.github.*'.

@justinharringa justinharringa changed the title Workaround for the git api search limit Workaround for the GitHub Search API limit Nov 2, 2021
Copy link
Contributor

@justinharringa justinharringa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@avimanyum Thanks! This is awesome! Could you edit the description of what the net change of this PR does? It seems to appear that it will potentially exclude one org with the top hits but I haven't given it a super deep look.

if (org != null) {
search.user(org);
if (!orgsToIncludeOrExclude.isEmpty()) {
StringBuilder includeOrexcludeOrgsQuery = new StringBuilder();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
StringBuilder includeOrexcludeOrgsQuery = new StringBuilder();
StringBuilder includeOrExcludeOrgsQuery = new StringBuilder();

@@ -48,14 +48,27 @@ public void execute(final Namespace ns, final DockerfileGitHubUtil dockerfileGit
Multimap<String, String> pathToDockerfilesInParentRepo = ArrayListMultimap.create();

Set<Map.Entry<String, JsonElement>> imageToTagStore = parseStoreToImagesMap(ns.get(Constants.STORE));
Integer gitApiSearchLimit = ns.get(Constants.GIT_API_SEARCH_LIMIT);
List<IOException> exceptions = new ArrayList<>();
Map<String, Boolean> orgsToIncludeInSearch = new HashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we wanting a Set instead? Should it be unmodifyable? What is the value here? Should this be a Set with typed IncludedOrgs / ExcludedOrgs or something like that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We wanted to have a map where the keys would be Org names and the value would be a Boolean value. True indicates the the org should be included in the search, and False indicates that it should not be included in the search. Using a map also ensures that the same org cannot be included and excluded at the same time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. I guess I could read this a few ways. Since the variable is named orgsToIncludeInSearch I don't know that I'd immediately think of the include/exclude so that's where I was wondering if it deems a formal class for coherence. I guess the other way would be to add a comment for clarity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the method call happens here, the map will only have an org to include in the search. Hence, I chose to name it orgsToIncludeInSearch. In the actual method definition, this gets stored in a variable Map<String, Boolean> orgsToIncludeOrExclude since this data structure can contain orgs to include or exclude depending on where it is being invoked from.

Integer gitApiSearchLimit = ns.get(Constants.GIT_API_SEARCH_LIMIT);
List<IOException> exceptions = new ArrayList<>();
Map<String, Boolean> orgsToIncludeInSearch = new HashMap<>();
orgsToIncludeInSearch.put(ns.get(Constants.GIT_ORG), true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if GIT_ORG wasn't specified? Does this work properly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this. Will add a check here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were you still going to add a check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, added a check.

@avimanyum
Copy link
Contributor Author

@justinharringa I add the description here #276 (comment). Please let me know if that looks ok. Also, is there any other place you wanted me to add the description to?

afalko
afalko previously approved these changes Nov 5, 2021
Copy link
Contributor

@justinharringa justinharringa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you need to rebase. Added some more comments. Looks like there are some unresolved comments from Andrey and I.

Thanks for the description add! I think that makes sense.

@@ -48,14 +48,27 @@ public void execute(final Namespace ns, final DockerfileGitHubUtil dockerfileGit
Multimap<String, String> pathToDockerfilesInParentRepo = ArrayListMultimap.create();

Set<Map.Entry<String, JsonElement>> imageToTagStore = parseStoreToImagesMap(ns.get(Constants.STORE));
Integer gitApiSearchLimit = ns.get(Constants.GIT_API_SEARCH_LIMIT);
List<IOException> exceptions = new ArrayList<>();
Map<String, Boolean> orgsToIncludeInSearch = new HashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. I guess I could read this a few ways. Since the variable is named orgsToIncludeInSearch I don't know that I'd immediately think of the include/exclude so that's where I was wondering if it deems a formal class for coherence. I guess the other way would be to add a comment for clarity.

Integer gitApiSearchLimit = ns.get(Constants.GIT_API_SEARCH_LIMIT);
List<IOException> exceptions = new ArrayList<>();
Map<String, Boolean> orgsToIncludeInSearch = new HashMap<>();
orgsToIncludeInSearch.put(ns.get(Constants.GIT_ORG), true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were you still going to add a check?

if (org.getKey() != null) {
if (org.getValue()) {
log.info("Including the org {} in the search.", org.getKey());
includeOrExcludeOrgsQuery.append(String.format("user:%s ", org.getKey()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not simply do search.user(org.getKey())?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There can be multiple orgs that will need to be appended to the search query. From my understanding search.user(org) internally reslolves to search.q(user:org). If we call call search.user(org) wouldn't the search query keep getting overwritten with every call and have only the last org added? Instead, using the search.q method allows us to construct the query string with all orgs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you dig into search.q() you'll notice that it's just appending search terms to a GHSearchBuilder. If it were the case that it was overriding, you'd have a similar problem with search.q() here overriding search.filename() above.

}
}
}
search.q(includeOrExcludeOrgsQuery.toString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to check to make sure we don't exceed any search term limits?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please point me to some documentation about search term limits?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @justinharringa!
Do we need to add an additional check here or can we rely on the check that Github will do internally and return a "Validation failed" error if the limit is exceeded?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not spotted anything in the github-api Java library that does this check but I'm not sure. I suspect that you'd likely get a failure from the actual GitHub API in this case.

Seems like it would be worth testing. A simple way to test would be to simply try and build a main() that adds 500 search terms of length beyond the limit and see what happens.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running the search for every individual org will increase # of gh api calls a lot. But reaching a 128 chars limit would not be very hard like you mentioned.
CC @afalko

Copy link
Collaborator

@afalko afalko Nov 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to do a search for each individual org unless each of them are 90 long. I think we'll need do the following if the total length of orgs + the Dockerfile search is >120:

  1. Sort orgs by length
  2. Build list of smallest length orgs until you have as many as will fit under the 120 limit
  3. Call search with those
  4. Repeat step #2 until you don't have any orgs remaining.
    We'll need a function that calls search again with another set of orgs if

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be tracked as a follow up? I think in our case, the chances of the query hitting that limit is quite low. For most cases, excluding 1 or 2 git orgs should bring down the search space to less than 1000.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll let you 2 make the determination on follow-up or not - would be good to check one of our known large orgs to get an idea of how impactful this is. Note that github-api does try and respect rate limiting in some cases with retries (although it's simple sleep where this does occur). I believe that this is one of those cases so that certainly helps with GH API calls.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@justinharringa : we determined to proceed forward without the above and create a follow up change to either do the above or take another approach all together.

return files;
if (totalCount > gitApiSearchLimit
&& orgsToIncludeOrExclude.size() == 1
&& orgsToIncludeOrExclude
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call "Optional#isPresent()" before accessing the value.

@@ -83,14 +84,29 @@ public GHRepository getRepo(String repoName) throws IOException {
return gitHubUtil.getRepo(repoName);
}

public PagedSearchIterable<GHContent> findFilesWithImage(String image, String org) throws IOException {
public Optional<List<PagedSearchIterable<GHContent>>> findFilesWithImage(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method findFilesWithImage has 59 lines of code (exceeds 25 allowed). Consider refactoring.

import java.util.List;
import java.util.Optional;
import java.util.*;
import java.util.HashMap;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong lexicographical order for 'java.util.HashMap' import. Should be before 'org.slf4j.LoggerFactory'.

@@ -319,23 +403,38 @@ public boolean thisUserIsOwner(GHRepository repo) throws IOException {
* @param org GitHub organization
* @param img image to find
*/
public Optional<PagedSearchIterable<GHContent>> getGHContents(String org, String img)
public Optional<List<PagedSearchIterable<GHContent>>> getGHContents(String org, String img, Integer gitApiSearchLimit)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method getGHContents has 28 lines of code (exceeds 25 allowed). Consider refactoring.

@@ -319,23 +403,38 @@ public boolean thisUserIsOwner(GHRepository repo) throws IOException {
* @param org GitHub organization
* @param img image to find
*/
public Optional<PagedSearchIterable<GHContent>> getGHContents(String org, String img)
public Optional<List<PagedSearchIterable<GHContent>>> getGHContents(String org, String img, Integer gitApiSearchLimit)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method getGHContents has a Cognitive Complexity of 6 (exceeds 5 allowed). Consider refactoring.

log.warn("Number of search results for a single org {} is above {}! The GitHub Search API will only return around 1000 results - https://developer.github.com/v3/search/#about-the-search-api",
orgName, gitApiSearchLimit);
} else if (totalCount > gitApiSearchLimit) {
log.info("The number of files returned is greater than the git API search limit" +
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'+' should be on a new line.

orgName, gitApiSearchLimit);
} else if (totalCount > gitApiSearchLimit) {
log.info("The number of files returned is greater than the git API search limit" +
" of {}. The orgs with the maximum number of hits will be recursively removed" +
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'+' should be on a new line.

} else if (totalCount > gitApiSearchLimit) {
log.info("The number of files returned is greater than the git API search limit" +
" of {}. The orgs with the maximum number of hits will be recursively removed" +
" to reduce the search space. For every org that is excluded, a separate " +
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'+' should be on a new line.

@@ -83,14 +84,29 @@ public GHRepository getRepo(String repoName) throws IOException {
return gitHubUtil.getRepo(repoName);
}

public PagedSearchIterable<GHContent> findFilesWithImage(String image, String org) throws IOException {
public Optional<List<PagedSearchIterable<GHContent>>> findFilesWithImage(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Method findFilesWithImage has a Cognitive Complexity of 15 (exceeds 5 allowed). Consider refactoring.

Copy link
Collaborator

@afalko afalko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Despite the search space size limitation, this buys some time for a better solution.

@justinharringa justinharringa added bug Bug fixes and removed tests Improvements / maintenance to tests labels Nov 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Bug fixes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle cases where there are more than 1000 downstream consumers of an image
3 participants