Negative/Near-zero acquisition values for Multi-Fidelity BO #1977

AlexanderMouton · 2023-08-10T07:56:23Z

AlexanderMouton
Aug 10, 2023

Hi, this question builds on discussion #1942 and the resulting PR, #1956.

The problem setup is as follows:

I have 8 continuous dimensions and 8 discrete data fidelity dimensions. There are a total of 864 possible combinations of values for the discrete data fidelity dimensions. I have adapted code from the discrete multi-fidelity bo tutorial to allow for this setup.

In the tutorial, optimize_acqf is used (internally) in optimize_acqf_mixed for each of the 3 discrete fidelity values. In my case, it is computationally too expensive to run optimize_acqf for each of the possible 864 combinations, so I randomly sample 43 (~5%) of the configurations for each point. The random sampling is done using a Gibbs distribution based on the Cosine similarity between the discrete combinations and the target discrete combination. Here is my adapted version of optimize_mfkg_and_get_points:

def _optimize_mfkg_and_get_points(
        self, num_points: int, mfkg_acqf: qMultiFidelityKnowledgeGradient
    ) -> np.ndarray:
        """
        Optimises the given Multi-Fidelity Knowledge Gradient acquisition function and returns the next point(s) to
        evaluate

        Parameters:
            num_points (int): The number of points to return
            mfkg_acqf (qMultiFidelityKnowledgeGradient): The Multi-Fidelity Knowledge Gradient acquisition function to
            optimise

        Returns:
            np.ndarray: The next point(s) to evaluate
        """
        # initialise an empty tensor to store the candidate points
        candidate_points = torch.empty(
            (num_points, self._total_dimensions), **self._tkwargs
        )
        # get the current pending points
        base_X_pending = mfkg_acqf.X_pending
        
        for _ in range(num_points):
            ff_candidate_list, ff_acq_value_list = [], []
            fixed_features_list = self._get_biased_subset_of_list(
                self._all_fixed_features_sets, self._num_configs_to_sample
            )
            for fixed_features in fixed_features_list:
                candidate, acq_value = optimize_acqf(
                    acq_function=mfkg_acqf,
                    bounds=self._bounds,
                    q=1,
                    num_restarts=self._acqf_kg_num_restarts,
                    raw_samples=self._kg_acqf_num_raw_samples,
                    options={
                        "batch_limit": self._acqf_kg_batch_limit,
                        "maxiter": self._acqf_kg_num_iterations,
                    },
                    fixed_features=fixed_features,
                    return_best_only=True,
                )
                ff_candidate_list.append(candidate)
                ff_acq_value_list.append(acq_value)

            ff_acq_values = torch.stack(ff_acq_value_list)
            best = torch.argmax(ff_acq_values)
            candidate_point, acqf_value = ff_candidate_list[best], ff_acq_values[best]

            # add to the list of pending candidate points
            mfkg_acqf.set_X_pending(
                torch.cat([base_X_pending, candidate_points], dim=-2)
                if base_X_pending is not None
                else candidate_points
            )
            # store the newly acquired candidate point
            candidate_points[q] = candidate_point
        return candidate_points.detach().cpu().numpy()

The fixed_features_list is a list of dictionaries, each dictionary containing a key-value pair for each data fidelity dimension.

The problem I'm experiencing is that the acquisition value here is either close to zero, in the range of 10^(-4) to 10^(-10), or negative values, in the range of 2*10^(-4) to 3*10^(-4).

The acquisition value found when optimize_acqf is called on the Posterior Mean acquisition function during construction of the qMultiFidelityKnowledgeGradient is usually in the range of 0.15 to 0.3. This value is passed as the current_value of the Knowledge Gradient, so are the aforementioned negative/near-zero values relative to this value and it is fine if they are negative, or should the acquisition values always be positive?

Thanks in advance!

Answered by Balandat

Aug 17, 2023

Is this correct? In this case, it shoulad actually be feasible to evaluate the wrapped acquisition function for all discrete combinations.

Yep, that is correct. I just put up #1987 that introduces a mixed alternating optimizer - this needs some cleanup before it can be merged in, but it should work if you check out the PR locally.

cc @saitcakmak, @dme65

View full answer

Balandat · 2023-08-11T15:11:17Z

Balandat
Aug 11, 2023
Collaborator

In my case, it is computationally too expensive to run optimize_acqf for each of the possible 864 combinations, so I randomly sample 43 (~5%) of the configurations for each point. The random sampling is done using a Gibbs distribution based on the Cosine similarity between the discrete combinations and the target discrete combination

Interesting approach. You may also be interested in two potential alternatives:

A mixed alternating optimization approach that interleaves coordinate-descent using nearest-neighbor search of the discrete variables with gradient-based optimization of the continuous variables. We have an implementation of this - @saitcakmak, any sense for when we're planning to move this into OSS? cc @aryandeshwal
An approach based on probabilistic reparameterization, see this paper: https://proceedings.neurips.cc/paper_files/paper/2022/hash/531230cfac80c65017ad0f85d3031edc-Abstract-Conference.html - @sdaulton what's the status of PR in OSS?

so are the aforementioned negative/near-zero values relative to this value and it is fine if they are negative, or should the acquisition values always be positive?

Mathematically, the Knowledge gradient should always be nonnegative. It being small isn't a concern per se - the changes in the posterior mean are usually quite small relative to the current value. But negative values are not great. The fact that this isn't happening here is likely due to numerical precision issues or the fact that the KG acquisition function isn't optimized perfectly. This optimization is quite hard even in relatively standard cases, and your setting with many discrete variables should make it a lot harder.

One thing you could try to do is to really crank up the budget you spend on optimizing the acquisition function for a single iteration (e.g. by actually enumerating all combinations) to see whether that mitigates this issue of negative values.

0 replies

AlexanderMouton · 2023-08-15T13:45:39Z

AlexanderMouton
Aug 15, 2023
Author

Hi @Balandat

Thanks for your timely response! I had to do a bit of a deepdive before I could get back to you on this.

One thing you could try to do is to really crank up the budget you spend on optimizing the acquisition function for a single iteration (e.g. by actually enumerating all combinations) to see whether that mitigates this issue of negative values.

I attempted enumerating all combinations, bumped up the raw_samples to 2048 and the maxiter to 2000, but to no avail. Turns out only zero or one iteration of the L-BFGS-B was being done per point as the projected gradient was very small. As per this stackexchange post, I scaled my objective function from [0.0, 1.0] to [0.0, 1e+6] and added the following options to optimize_acqf, along with the batch_limit, maxiter and nonnegative key-value pairs:

{
    "ftol": 1e-8, # (default 1e-5)
    "gtol": 1e-8, # (default 1e-5)
    "maxcor": 20, # (default 10)
}

These options get passed to scipy's implementation of L-BFGS-B. According to the documentation here and the aforementioned stackexchange post, it seems relatively safe to reduce the tolerance to 1e-8. This, and rescaling my objective function, mitigated the issue of small/negative acquisition values. I now get acquisition values in the range of [1e+2, 1e+4]. I also increased the maxcor parameter, as this seems to have helped.

As for the two alternatives you mentioned, both sound interesting (especially the probalistic reparameterization), but I think I will regard them as future work for now while I chase an article deadline. 😅

I should also mention - I am optimising for 96 of the 864 discrete combinations. I do this by first shuffling a list of all combinations and then selecting the next combination for which to construct the qMultiFidelityKnowledgeGradient for each iteration. Do you have a recommendation on how this approach could be improved?

Thanks!

Alex

5 replies

Balandat Aug 15, 2023
Collaborator

cc @SebastianAment re the numerics, rescaling, and reducing the tolerance of the optimizer.

I should also mention - I am optimising for 96 of the 864 discrete combinations. I do this by first shuffling a list of all combinations and then selecting the next combination for which to construct the qMultiFidelityKnowledgeGradient for each iteration. Do you have a recommendation on how this approach could be improved?

Hmm you could evaluate the acqf value for all combinations and then use that to prioritize which ones to perform gradient-based optimization for. But this is going to be challenging for KG type acquisition functions b/c of the fantasy values required for evaluating the acquisition function - one could optimize only those for a fixed input of "real" paramters; but that will likely hit scalability issues.

I think your best bet would be to use the alternating approach I mentioned above - do some gradients steps for a fixed discrete selection, then do a nearest-neighbor search across the discrete variables, and interleave the two.

AlexanderMouton Aug 16, 2023
Author

Hmm you could evaluate the acqf value for all combinations and then use that to prioritize which ones to perform gradient-based optimization for. But this is going to be challenging for KG type acquisition functions b/c of the fantasy values required for evaluating the acquisition function - one could optimize only those for a fixed input of "real" paramters; but that will likely hit scalability issues.

I think I'll stick to the current approach here, as evaluating the acqf value for all combinations already runs into scalability issues 🤔

I think your best bet would be to use the alternating approach I mentioned above - do some gradients steps for a fixed discrete selection, then do a nearest-neighbor search across the discrete variables, and interleave the two.

I had a look at this now and, per my understanding, it should be relatively simple to implement:

Optimize the acquisition function for the target combination
Wrap the Acquisition function in a FixedFeatureAcquisitionFunction, fixing the continuous variables
Evaluate the wrapped acquisition function for a subset of discrete variables based on the nearest-neighbors, and take the best one as the next point to evaluate

Is this correct? In this case, it shoulad actually be feasible to evaluate the wrapped acquisition function for all discrete combinations.

Thanks again!

AlexanderMouton Aug 16, 2023
Author

I had a discussion with my supervisors and we have decided on the following approach:

For each iteration:

Optimise a target combintation by sampling a subset of the 864 combinations based on the Cosine similarity as I mentioned before
Fix the continuous dimensions of the found candidate point and evaluate the acquisition value for each of the remaining 863 discrete combinations, as this should be relatively inexpensive, to find more points that will contribute information to the target combination.

Balandat Aug 17, 2023
Collaborator

Is this correct? In this case, it shoulad actually be feasible to evaluate the wrapped acquisition function for all discrete combinations.

Yep, that is correct. I just put up #1987 that introduces a mixed alternating optimizer - this needs some cleanup before it can be merged in, but it should work if you check out the PR locally.

cc @saitcakmak, @dme65

Answer selected by AlexanderMouton

AlexanderMouton Aug 17, 2023
Author

Thanks @Balandat!

I will have a look at #1987

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Negative/Near-zero acquisition values for Multi-Fidelity BO #1977

{{title}}

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Negative/Near-zero acquisition values for Multi-Fidelity BO #1977

AlexanderMouton Aug 10, 2023

Replies: 2 comments · 5 replies

Balandat Aug 11, 2023 Collaborator

AlexanderMouton Aug 15, 2023 Author

Balandat Aug 15, 2023 Collaborator

AlexanderMouton Aug 16, 2023 Author

AlexanderMouton Aug 16, 2023 Author

Balandat Aug 17, 2023 Collaborator

AlexanderMouton Aug 17, 2023 Author

AlexanderMouton
Aug 10, 2023

Replies: 2 comments 5 replies

Balandat
Aug 11, 2023
Collaborator

AlexanderMouton
Aug 15, 2023
Author

Balandat Aug 15, 2023
Collaborator

AlexanderMouton Aug 16, 2023
Author

AlexanderMouton Aug 16, 2023
Author

Balandat Aug 17, 2023
Collaborator

AlexanderMouton Aug 17, 2023
Author